On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote: > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote: > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: > > > > > > regressions > > > > > > - FAIL"): > > > > > > > From mere code inspection and document of lwip 1.3.0 I think > > > > > > > mini > > > > > > -os > > > > > > > does send gratuitous ARP. > > > > > > > > > > > > The guest is using the PVHVM drivers at this point, with the > > > > > > backend > > > > > > directly in dom0, so it is the guest's gratuitous arp which is > > > > > > needed, > > > > > > I think. > > > > > > > > > > It would be worth investigating whether mini-os's gratuitous ARP > > > > > might > > > > > also be occurring and confusing things, e.g. by coming after and > > > > > therefore taking precedence over the one coming from the guest. > > > > > > > > > > > > > Several observations: > > > > > > > > 1. The guest doesn't always send gratuitous arp -- but this might not > > > > be > > > > the cause of this failure. Guest works fine when using qemu-trad > > > > only. > > > > > > As in it always sends the arp when using qemu-trad, or that it is fine > > > irrespective of not always sending it? > > > > > > > Whether or not stubdom is in use, the guest behaves the same -- it > > doesn't always send gratuitous arp. > > > > When using qemu-trad alone, it's always fine when it doesn't send > > gratuitous arp because either there is cache in dom0 that already has > > guest mac address or the guest responses instantly to dom0 arp request. > > Where has this cache entry come from? Any preexisting ARP cache would be > associated with vifX.0 and would go away when that device was destroyed and > replace with vif(X+1).0. >
No, vif-bridge script has two runes for off-lining a vif brctl delif $bridge $vif ifconfig $vif down Neither of these causes cache entry to be flushed. > Also this only work for localhost migration. If the domain actually moved > to another host then the ARP is required in order for the physical switch > to learn the new location. > > Thus it seems to me that not always sending the gratuitous ARP is the most > important thing to get to the bottom of here. > That's another issue, but this would cause other error (no route to host) instead of timeout. The failure exhibits timeout error -- let's do one thing at a time. > > So it comes down to the responsiveness of guest is the key. > > > [...] > > > > 3. When using stubdom, guest is a lot less responsive. See two > > > > experiments and analysis below. > > > > > > Less responsive in use or only while migrating, or to ssh after > > > migration, > > > or to something else? > > > > > > > For every activity after migration for a period of time, including both > > arp request / reply and ssh connection. > > > > > > Scenario 1: > > > > xl shows "Migration successful." > > > > ...30s... > > > > xenbr0 receives gratuitous arp > > > > ...1s... > > > > ssh date command comes back > > > > > > > > Scenario 2: > > > > xenbr0 receives gratuitous arp > > > > ...1s... > > > > xl shows "Migration successful." > > > > ssh date command comes back > > > > > > > > When stubdom was not present I never saw scenario 1. > > So in that case you only saw Scenario 2 which includes a "receives > gratuitous ARP". But above you state that even with non-stub case sometimes > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned > here? > Scenario 3: xl shows "Migration successful." dom0 sends arp request because arp cache entry not available guest takes a long time to respond when using stubdom or responds instantly when not using stubdom Scenario 4: xl shows "Migration successful." (arp cache entry still available) guest takes a long time to respond to ssh when using stubdom or responds instantly when not using stubdom > > > It would be worth looking at the possibility of a delay between > > > "Migration > > > successful" and the target domain actually running. A 30s delay between > > > the > > > guest restarting and it sending the ARP would be pretty strange IMHO > > > > > > > The guest is in a weird state. > > > > xl list shows the stubdom is in "b" state while guest has no state at > > all, heh. > > Has it actually been started/unpaused then? > Yes, of course -- otherwise the state would have been "p". And I observed the transition from "p" to "weird state". Wei. > Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel