Hi Andrija,

Do you use NIC bonds? I have seen this before when using active-active bonds, 
and as you say it can be very difficult to troubleshoot and the behaviour makes 
little sense. What can happen is network traffic is load balanced between the 
two NICs, however the update frequency of the MAC tables between the two 
switches don’t keep up with the load balanced traffic. In other words a MAC 
address which used to transmit on hypervisor eth0 (attached to your first top 
of rack switch) of a bond has suddenly due to load started transmitting on eth1 
(attached to the second of the top of rack switches) of the bond, however the 
physical switch stack still thinks the MAC address lives on eth0, hence traffic 
is dropped until next time the switches synch MAC tables. 

We used to see this a lot in the past on XenServer – the solution being moving 
to active-passive bond modes, or go up to LACP/802.3ad if your hardware allows 
for it. The same principle will however also apply on generic linux bonds.

Regards, 
Dag Sonstebo
Cloud Architect
ShapeBlue
 S: +44 20 3603 0540  | dag.sonst...@shapeblue.com | http://www.shapeblue.com 
<http://www.shapeblue.com/> | Twitter:@ShapeBlue 
<https://twitter.com/#!/shapeblue>


On 09/10/2017, 21:52, "Andrija Panic" <andrija.pa...@gmail.com> wrote:

    Hi guys,
    
    we have occasional but serious problem, that starts happening as it seems
    randomly (i.e. NOT under high load)  - not ACS related afaik, purely KVM,
    but feedback is really welcomed.
    
    - VM is reachable in general from everywhere, but not reachable from
    specific IP address ?!
    - VM is NOT under high load, network traffic next to zero, same for
    CPU/disk...
    - We mitigate this problem by migrating VM away to another host, not much
    of a solution...
    
    Description of problem:
    
    We let ping from "problematic" source IP address to the problematic VM, and
    we capture traffic on KVM host where the problematic VM lives:
    
    - Tcpdump on VXLAN interface (physical incoming interface on the host) - we
    see packet fine
    - tcpdump on BRIDGE = we see packet fine
    - tcpdump on VNET = we DON'T see packet.
    
    In the scenario above, I need to say that :
    - we can tcpdump packets from other source IPs on the VNET interface just
    fine (as expected), so should also see this problematic source IP's packets
    - we can actually ping in oposite direction - from the problematic VM to
    the problematic "source" IP
    
    We checked everything possible, from bridge port forwarding, to mac-to-vtep
    mapping, to many other things, removed traffic shaping from VNET interface,
    no iptables/ebtables, no STP on bridge, remove and rejoin interfaces to
    bridge, destroy bridge and create manually on the fly,
    
    Problem is really crazy, and I can not explain it - no iptables, no
    ebtables for troubleshooting pruposes (on this host) and
    
    We mitigate this problem by migrating VM away to another host, not much of
    a solution...
    
    This is Ubuntu 14.04, Qemu 2.5 (libvirt 1.3.1),
    Stock kernel 3.16-xx, regular bridge (not OVS)
    
    Anyone else ever heard of such problem - this is not intermittent packet
    dropping, but complete blackout/packet drop in some way...
    
    Thanks,
    
    -- 
    
    Andrija Panić
    


dag.sonst...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

Reply via email to