Re: [Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage
On 11/20/14 3:54 AM, Dr. David Alan Gilbert wrote: * Gary R Hook (grhookatw...@gmail.com) wrote: Ugh, I wish I could teach Thunderbird to understand how to reply to a newsgroup. Apologies to Paolo for the direct note. On 11/19/14 4:19 AM, Paolo Bonzini wrote: On 19/11/2014 10:35, Dr. David Alan Gilbert wrote: * Paolo Bonzini (pbonz...@redhat.com) wrote: On 18/11/2014 21:28, Dr. David Alan Gilbert wrote: This seems odd, since as far as I know the tunneling code is quite separate to the migration code; I thought the only thing that the migration code sees different is the file descriptors it gets past. (Having said that, again I don't know storage stuff, so if this is a storage special there may be something there...) Tunnelled migration uses the old block-migration.c code. Non-tunnelled migration uses the NBD server and block/mirror.c. OK, that explains that. Is that because the tunneling code can't deal with tunneling the NBD server connection? The main problem with the old code is that uses a possibly unbounded amount of memory in mig_save_device_dirty and can have huge jitter if any serious workload is running in the guest. So that's sending dirty blocks iteratively? Not that I can see when the allocations get freed; but is the amount allocated there related to total disk size (as Gary suggested) or to the amount of dirty blocks? It should be related to the maximum rate limit (which can be set to arbitrarily high values, however). This makes no sense. The code in block_save_iterate() specifically attempts to control the rate of transfer. But when qemu_file_get_rate_limit() returns a number like 922337203685372723 (0xCCB) I'm under the impression that no bandwidth constraints are being imposed at this layer. Why, then, would that transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE connection) with no clear bottleneck in CPU or network? What other relation might exist? Disk IO on the disk that you're trying to transfer? Well, non-tunneled runs fast enough (120 MB/s) to saturate the network pipe, so it's evident to me that the blocks can come screaming from the disk plenty fast. And there's no CPU bottleneck; the VM is really not doing much of anything at all. So I'll say no. I shall continue my investigation. The reads are started, then the ones that are ready are sent and the blocks are freed in flush_blks. The jitter happens when the guest reads a lot but only writes a few blocks. In that case, the bdrv_drain_all in mig_save_device_dirty can be called relatively often and it can be expensive because it also waits for all guest-initiated reads to complete. Pardon my ignorance, but this does not match my observations. What I am seeing is the process size of the source qemu grow steadily until the COR completes; during this time the backing file on the destination system does not change/grow at all, which implies that no blocks are being transferred. (I have tested this with a 25GB VM disk, and larger; no network activity occurs during this period.) Once the COR is done and the in-memory copy ready (marked by a "Completed 100%" message from blk_mig_save_builked_block()) the transfer begins. At an abysmally slow rate, I'll add, per the above. Another problem to be investigated. Odd thought; can you try dropping your migration bandwidth limit (migrate_set_speed) - try something low, like 10M - does the behaviour stay the same, or does it start transmitting disk data before it's read the lot? Interesting idea. I shall attempt that. The bulk phase is similar, just with different functions (the reads are done in mig_save_device_bulk). With a high rate limit, the total allocated memory can reach a few gigabytes indeed. Much, much more than that. It's definitely dependent upon the disk file size. Tiny VM disks are a nit; big VM disks are a problem. Well, if as you say it's not starting transmitting for some reason until it's read the lot then that would make sense. Right. I'm just saying that I don't think this works the way people thinks it works. Depending on the scenario, a possible disadvantage of NBD migration is that it can only throttle each disk separately, while the old code will apply a single limit to all migrations. How about no throttling at all? And just to be very clear, the goal is fast (NBD-based) migrations of VMs using non-shared storage over an encrypted channel. Safest, worst-case scenario. Aside from gaining an understanding of this code. There are vague plans to add TLS support for encrypting these streams internally to qemu; but they're just thoughts at the moment. :-( -- Gary R Hook Senior Kernel Engineer NIMBOXX, Inc
Re: [Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage
* Gary R Hook (grhookatw...@gmail.com) wrote: > Ugh, I wish I could teach Thunderbird to understand how to reply to a > newsgroup. > > Apologies to Paolo for the direct note. > > On 11/19/14 4:19 AM, Paolo Bonzini wrote: > > > > > >On 19/11/2014 10:35, Dr. David Alan Gilbert wrote: > >>* Paolo Bonzini (pbonz...@redhat.com) wrote: > >>> > >>> > >>>On 18/11/2014 21:28, Dr. David Alan Gilbert wrote: > This seems odd, since as far as I know the tunneling code is quite > separate > to the migration code; I thought the only thing that the migration > code sees different is the file descriptors it gets past. > (Having said that, again I don't know storage stuff, so if this > is a storage special there may be something there...) > >>> > >>>Tunnelled migration uses the old block-migration.c code. Non-tunnelled > >>>migration uses the NBD server and block/mirror.c. > >> > >>OK, that explains that. Is that because the tunneling code can't > >>deal with tunneling the NBD server connection? > >> > >>>The main problem with > >>>the old code is that uses a possibly unbounded amount of memory in > >>>mig_save_device_dirty and can have huge jitter if any serious workload > >>>is running in the guest. > >> > >>So that's sending dirty blocks iteratively? Not that I can see > >>when the allocations get freed; but is the amount allocated there > >>related to total disk size (as Gary suggested) or to the amount > >>of dirty blocks? > > > >It should be related to the maximum rate limit (which can be set to > >arbitrarily high values, however). > > This makes no sense. The code in block_save_iterate() specifically > attempts to control the rate of transfer. But when > qemu_file_get_rate_limit() returns a number like 922337203685372723 > (0xCCB) I'm under the impression that no bandwidth > constraints are being imposed at this layer. Why, then, would that > transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE > connection) with no clear bottleneck in CPU or network? What other > relation might exist? Disk IO on the disk that you're trying to transfer? > >The reads are started, then the ones that are ready are sent and the > >blocks are freed in flush_blks. The jitter happens when the guest reads > >a lot but only writes a few blocks. In that case, the bdrv_drain_all in > >mig_save_device_dirty can be called relatively often and it can be > >expensive because it also waits for all guest-initiated reads to complete. > > Pardon my ignorance, but this does not match my observations. What I am > seeing is the process size of the source qemu grow steadily until the > COR completes; during this time the backing file on the destination > system does not change/grow at all, which implies that no blocks are > being transferred. (I have tested this with a 25GB VM disk, and larger; > no network activity occurs during this period.) Once the COR is done and > the in-memory copy ready (marked by a "Completed 100%" message from > blk_mig_save_builked_block()) the transfer begins. At an abysmally slow > rate, I'll add, per the above. Another problem to be investigated. Odd thought; can you try dropping your migration bandwidth limit (migrate_set_speed) - try something low, like 10M - does the behaviour stay the same, or does it start transmitting disk data before it's read the lot? > >The bulk phase is similar, just with different functions (the reads are > >done in mig_save_device_bulk). With a high rate limit, the total > >allocated memory can reach a few gigabytes indeed. > > Much, much more than that. It's definitely dependent upon the disk file > size. Tiny VM disks are a nit; big VM disks are a problem. Well, if as you say it's not starting transmitting for some reason until it's read the lot then that would make sense. > >Depending on the scenario, a possible disadvantage of NBD migration is > >that it can only throttle each disk separately, while the old code will > >apply a single limit to all migrations. > > How about no throttling at all? And just to be very clear, the goal is > fast (NBD-based) migrations of VMs using non-shared storage over an > encrypted channel. Safest, worst-case scenario. Aside from gaining an > understanding of this code. There are vague plans to add TLS support for encrypting these streams internally to qemu; but they're just thoughts at the moment. > Thank you for your attention. Dave > > -- > Gary R Hook > Senior Kernel Engineer > NIMBOXX, Inc > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[Qemu-devel] Fwd: Re: Tunneled Migration with Non-Shared Storage
Ugh, I wish I could teach Thunderbird to understand how to reply to a newsgroup. Apologies to Paolo for the direct note. On 11/19/14 4:19 AM, Paolo Bonzini wrote: On 19/11/2014 10:35, Dr. David Alan Gilbert wrote: * Paolo Bonzini (pbonz...@redhat.com) wrote: On 18/11/2014 21:28, Dr. David Alan Gilbert wrote: This seems odd, since as far as I know the tunneling code is quite separate to the migration code; I thought the only thing that the migration code sees different is the file descriptors it gets past. (Having said that, again I don't know storage stuff, so if this is a storage special there may be something there...) Tunnelled migration uses the old block-migration.c code. Non-tunnelled migration uses the NBD server and block/mirror.c. OK, that explains that. Is that because the tunneling code can't deal with tunneling the NBD server connection? The main problem with the old code is that uses a possibly unbounded amount of memory in mig_save_device_dirty and can have huge jitter if any serious workload is running in the guest. So that's sending dirty blocks iteratively? Not that I can see when the allocations get freed; but is the amount allocated there related to total disk size (as Gary suggested) or to the amount of dirty blocks? It should be related to the maximum rate limit (which can be set to arbitrarily high values, however). This makes no sense. The code in block_save_iterate() specifically attempts to control the rate of transfer. But when qemu_file_get_rate_limit() returns a number like 922337203685372723 (0xCCB) I'm under the impression that no bandwidth constraints are being imposed at this layer. Why, then, would that transfer be occurring at 20MB/sec (simple, under-utilized 1 gigE connection) with no clear bottleneck in CPU or network? What other relation might exist? The reads are started, then the ones that are ready are sent and the blocks are freed in flush_blks. The jitter happens when the guest reads a lot but only writes a few blocks. In that case, the bdrv_drain_all in mig_save_device_dirty can be called relatively often and it can be expensive because it also waits for all guest-initiated reads to complete. Pardon my ignorance, but this does not match my observations. What I am seeing is the process size of the source qemu grow steadily until the COR completes; during this time the backing file on the destination system does not change/grow at all, which implies that no blocks are being transferred. (I have tested this with a 25GB VM disk, and larger; no network activity occurs during this period.) Once the COR is done and the in-memory copy ready (marked by a "Completed 100%" message from blk_mig_save_builked_block()) the transfer begins. At an abysmally slow rate, I'll add, per the above. Another problem to be investigated. The bulk phase is similar, just with different functions (the reads are done in mig_save_device_bulk). With a high rate limit, the total allocated memory can reach a few gigabytes indeed. Much, much more than that. It's definitely dependent upon the disk file size. Tiny VM disks are a nit; big VM disks are a problem. Depending on the scenario, a possible disadvantage of NBD migration is that it can only throttle each disk separately, while the old code will apply a single limit to all migrations. How about no throttling at all? And just to be very clear, the goal is fast (NBD-based) migrations of VMs using non-shared storage over an encrypted channel. Safest, worst-case scenario. Aside from gaining an understanding of this code. Thank you for your attention. -- Gary R Hook Senior Kernel Engineer NIMBOXX, Inc