On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote: > On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote: > > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote: > > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: > > > > > > > regressions > > > > > > > - FAIL"): > > > > > > > > From mere code inspection and document of lwip 1.3.0 I > > > > > > > > think > > > > > > > > mini > > > > > > > -os > > > > > > > > does send gratuitous ARP. > > > > > > > > > > > > > > The guest is using the PVHVM drivers at this point, with the > > > > > > > backend > > > > > > > directly in dom0, so it is the guest's gratuitous arp which > > > > > > > is > > > > > > > needed, > > > > > > > I think. > > > > > > > > > > > > It would be worth investigating whether mini-os's gratuitous > > > > > > ARP > > > > > > might > > > > > > also be occurring and confusing things, e.g. by coming after > > > > > > and > > > > > > therefore taking precedence over the one coming from the guest. > > > > > > > > > > > > > > > > Several observations: > > > > > > > > > > 1. The guest doesn't always send gratuitous arp -- but this might > > > > > not > > > > > be > > > > > the cause of this failure. Guest works fine when using qemu > > > > > -trad > > > > > only. > > > > > > > > As in it always sends the arp when using qemu-trad, or that it is > > > > fine > > > > irrespective of not always sending it? > > > > > > > > > > Whether or not stubdom is in use, the guest behaves the same -- it > > > doesn't always send gratuitous arp. > > > > > > When using qemu-trad alone, it's always fine when it doesn't send > > > gratuitous arp because either there is cache in dom0 that already has > > > guest mac address or the guest responses instantly to dom0 arp > > > request. > > > > Where has this cache entry come from? Any preexisting ARP cache would > > be > > associated with vifX.0 and would go away when that device was destroyed > > and > > replace with vif(X+1).0. > > > > No, vif-bridge script has two runes for off-lining a vif > brctl delif $bridge $vif > ifconfig $vif down > > Neither of these causes cache entry to be flushed.
$vif disappearing when netback finally deletes the device will though. Or it should/used to. Maybe this is happening after the new guest has started and confusing things somewhere? > > Also this only work for localhost migration. If the domain actually > > moved > > to another host then the ARP is required in order for the physical > > switch > > to learn the new location. > > > > Thus it seems to me that not always sending the gratuitous ARP is the > > most > > important thing to get to the bottom of here. > > > > That's another issue, but this would cause other error (no route to > host) instead of timeout. The failure exhibits timeout error -- let's do > one thing at a time. The presence of an ARP cache entry in dom0 pointing to the old VIF would also cause a timeout issue, I think, since the guest is no longer connected to that vif. This stale ARP cache entry should be the first thing to investigate, before either the lack of a grat ARP or the slowness of the guest, since its presence will confuse the results in both those other cases. > > > So it comes down to the responsiveness of guest is the key. > > > > > [...] > > > > > 3. When using stubdom, guest is a lot less responsive. See two > > > > > experiments and analysis below. > > > > > > > > Less responsive in use or only while migrating, or to ssh after > > > > migration, > > > > or to something else? > > > > > > > > > > For every activity after migration for a period of time, including > > > both > > > arp request / reply and ssh connection. > > > > > > > > Scenario 1: > > > > > xl shows "Migration successful." > > > > > ...30s... > > > > > xenbr0 receives gratuitous arp > > > > > ...1s... > > > > > ssh date command comes back > > > > > > > > > > Scenario 2: > > > > > xenbr0 receives gratuitous arp > > > > > ...1s... > > > > > xl shows "Migration successful." > > > > > ssh date command comes back > > > > > > > > > > When stubdom was not present I never saw scenario 1. > > > > So in that case you only saw Scenario 2 which includes a "receives > > gratuitous ARP". But above you state that even with non-stub case > > sometimes > > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned > > here? > > > > Scenario 3: > xl shows "Migration successful." > dom0 sends arp request because arp cache entry not available > guest takes a long time to respond when using stubdom or responds > instantly when not using stubdom > > Scenario 4: > xl shows "Migration successful." > (arp cache entry still available) > guest takes a long time to respond to ssh when using stubdom or > responds instantly when not using stubdom > > > > > It would be worth looking at the possibility of a delay between > > > > "Migration > > > > successful" and the target domain actually running. A 30s delay > > > > between > > > > the > > > > guest restarting and it sending the ARP would be pretty strange > > > > IMHO > > > > > > > > > > The guest is in a weird state. > > > > > > xl list shows the stubdom is in "b" state while guest has no state at > > > all, heh. > > > > Has it actually been started/unpaused then? > > > > Yes, of course -- otherwise the state would have been "p". And I > observed the transition from "p" to "weird state". If weird state is "-----" then I think that is normal, it is "runnable but not running" IIRC. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel