Re: [ovs-discuss] Restarting network kills ovs-vswitchd (and network)... ?

2019-06-13 Thread SCHAER Frederic
Hi,

An update on the issue I faced...

I noticed that openstack VMs migrated onto the ovs--buggy hosts lost network.
Actually, arp and DHCP were flowing from the VM to the outside, but were 
dropped "somewhere" on the return path.
I have tried to find what could have caused this using ovs-dpctl|ofctl, but 
failed at understanding why. I even tried an ovs-vsctl emer-reset... but this 
changed nothing.
 
Since I was out of ideas, I reinstalled the host using the exact same config 
(using snapshotted repos, using puppet for everything), and it actually fixed 
both issues:
- network restart no longuer kills/stops connectivity
- migrated VMs network is not "arp broken" anymore

I have to say I'm puzzled, I did not think a reinstall would fix anything...
The only thing that I can think of is that a reinstall truly wiped out the OVS 
config on the node, and eventual openflow linguering rules or openvswitch 
upgrades inconsistencies (I'm using the rdo repo)...

Thanks anyway for the support and answers you gave me.
Regards

> -Message d'origine-
> De : Flavio Leitner 
> Envoyé : vendredi 17 mai 2019 12:13
> À : SCHAER Frederic 
> Cc : b...@openvswitch.org
> Objet : Re: [ovs-discuss] Restarting network kills ovs-vswitchd (and
> network)... ?
> 
> On Fri, May 17, 2019 at 09:45:36AM +, SCHAER Frederic wrote:
> > Hi
> >
> > Thank you for your answer.
> > I actually forgot to say I already had checked the syslogs, the ovs
> > and the network journals/logs... no coredump reference anywhere
> >
> > For me a core dump or a crash would not return an exit code of 0,
> > which seems to be what system saw :/ I even straced -f the ovs-vswitchd
> process  and made it stop/crash with an ifdown/ifup, but looks to me this is
> an exit ...
> >
> > (I can retry and save the strace output if necessary or usefull) End
> > of strace output was (I see "brflat" in the long strings, which is the 
> > bridge
> hosting em1) :
> 
> Is this a strace of ovs-vswitchd or ovs-vsctl? Because SIGABRT happens when
> the ovs-vsctl is stuck and the alarm fires. Then this would just point that 
> ovs-
> vswitchd is not running.
> 
> If ovs-vswitchd is not crashing, something is stopping the service and maybe
> running sh -x /sbin/ifdown   helps to shed a light?
> 
> Or add 'set -x' to /etc/sysconfig/network-scripts/if*-ovs scripts.
> 
> fbl
> 
> >
> > [pid 175068] sendmsg(18, {msg_name(0)=NULL,
> >
> msg_iov(1)=[{",\0\0\0\22\0\1\0\223\6\0\0!\353\377\377\0\0\0\0\0\0\0\0\0\
> 0\0\0\0\0\0\0\v\0\3\0brflat\0\0", 44}], msg_ controllen=0, msg_flags=0}, 0
> 
> > [pid 175233] <... futex resumed> )  = 0
> > [pid 175068] <... sendmsg resumed> )= 44
> > [pid 175068] recvmsg(18,   [pid 175234]
> > futex(0x55b8aaa19128, FUTEX_WAKE_PRIVATE, 1 
> > ...skipping...
> > [pid 175233] futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 175068] <... sendmsg resumed> )= 44
> > [pid 175234] <... futex resumed> )  = 0
> > [pid 175233] <... futex resumed> )  = -1 EAGAIN (Resource temporarily
> unavailable)
> > [pid 175068] recvmsg(18,   [pid 175234]
> > futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL 
> > [pid 175233] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1  > ...> [pid 175068] <... recvmsg resumed> {msg_name(0)=NULL,
> msg_iov(2)=[{"\360\4\0\0\20\0\0\0\224\6\0\0!\353\377\377\0\0\1\0\36\0\0
> \0C\20\1\0\0\0\0\0\v\0\3\0brflat\0\0\10\0\r\0\350\3\0\0\5\0\20\0\0\0\0\0\
> 5\0\21\0\0\0\0\0\10\0\4\0\334\5\0\0\10\0\33\0\0\0\0\0\10\0\36\0\1\0\0\0
> \10\0\37\0\1\0\0\0\10\0(\0\377\377\0\0\10\0)\0\0\0\1\0\10\0
> \0\1\0\0\0\5\0!\0\1\0\0\0\f\0\6\0noqueue\0\10\0#\0\0\0\0\0\5\0'\0\0\0\0
> \0$\0\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0H\234
> \377\377\n\0\1\0\276J\307\307\207I\0\0\n\0\2\0\377\377\377\377\377\377
> \0\0\304\0\27\0Y\22\5\0\0\0\0\0Uf\0\0\0\0\0\0^0j\1\0\0\0\0\372\371k\1\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0d\0\7\0Y\22\5\0Uf\0\0^0j\1\372\37
> 1k\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0"..., 1024},
> {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\0\0\0\0y0\2\0\0\0\0\0\256\1\0\0\0\0\0\0\0\0\0\0\0\0\0
> \0\0\0\0\0\0\0\0\0\

Re: [ovs-discuss] Restarting network kills ovs-vswitchd (and network)... ?

2019-05-17 Thread SCHAER Frederic
Hi

Thank you for your answer.
I actually forgot to say I already had checked the syslogs, the ovs and the 
network journals/logs... no coredump reference anywhere

For me a core dump or a crash would not return an exit code of 0, which seems 
to be what system saw :/
I even straced -f the ovs-vswitchd process  and made it stop/crash with an 
ifdown/ifup, but looks to me this is an exit ...

(I can retry and save the strace output if necessary or usefull)
End of strace output was (I see "brflat" in the long strings, which is the 
bridge hosting em1) :

[pid 175068] sendmsg(18, {msg_name(0)=NULL, 
msg_iov(1)=[{",\0\0\0\22\0\1\0\223\6\0\0!\353\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\3\0brflat\0\0",
 44}], msg_
controllen=0, msg_flags=0}, 0 
[pid 175233] <... futex resumed> )  = 0
[pid 175068] <... sendmsg resumed> )= 44
[pid 175068] recvmsg(18,  
[pid 175234] futex(0x55b8aaa19128, FUTEX_WAKE_PRIVATE, 1 
...skipping...
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 175068] <... sendmsg resumed> )= 44
[pid 175234] <... futex resumed> )  = 0
[pid 175233] <... futex resumed> )  = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 175068] recvmsg(18,  
[pid 175234] futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 
[pid 175068] <... recvmsg resumed> {msg_name(0)=NULL, 
msg_iov(2)=[{"\360\4\0\0\20\0\0\0\224\6\0\0!\353\377\377\0\0\1\0\36\0\0\0C\20\1\0\0\0\0\0\v\0\3\0brflat\0\0\10\0\r\0\350\3\0\0\5\0\20\0\0\0\0\0\5\0\21\0\0\0\0\0\10\0\4\0\334\5\0\0\10\0\33\0\0\0\0\0\10\0\36\0\1\0\0\0\10\0\37\0\1\0\0\0\10\0(\0\377\377\0\0\10\0)\0\0\0\1\0\10\0
 
\0\1\0\0\0\5\0!\0\1\0\0\0\f\0\6\0noqueue\0\10\0#\0\0\0\0\0\5\0'\0\0\0\0\0$\0\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0H\234\377\377\n\0\1\0\276J\307\307\207I\0\0\n\0\2\0\377\377\377\377\377\377\0\0\304\0\27\0Y\22\5\0\0\0\0\0Uf\0\0\0\0\0\0^0j\1\0\0\0\0\372\371k\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0d\0\7\0Y\22\5\0Uf\0\0^0j\1\372\371k\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
 1024}, 
{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0y0\2\0\0\0\0\0\256\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\214\254\235\0\0\0\0\0
 
z\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\211\223\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0004\0\6\0\6\0\0\0\0\0\0\0r\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0A\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\10\0\0\0\0\0",
 65536}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 1264
[pid 175234] <... futex resumed> )  = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 175233] <... futex resumed> )  = 0
[pid 175234] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 
[pid 175068] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  
[pid 175234] <... futex resumed> )  = 0
[pid 175068] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 175068] tgkill(175068, 175068, SIGABRT 
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 
[pid 175068] <... tgkill resumed> ) = 0
[pid 175233] <... futex resumed> )  = 0
[pid 175068] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=175068, 
si_uid=393} ---
[pid 189862] +++ killed by SIGABRT +++
[pid 175237] +++ killed by SIGABRT +++
[pid 175236] +++ killed by SIGABRT +++
[pid 175235] +++ killed by SIGABRT +++
[pid 175234] +++ killed by SIGABRT +++
[pid 175233] +++ killed by SIGABRT +++
[pid 175232] +++ killed by SIGABRT +++
[pid 175231] +++ killed by SIGABRT +++
[pid 175230] +++ killed by SIGABRT +++
[pid 175229] +++ killed by SIGABRT +++
[pid 175228] +++ killed by SIGABRT +++
[pid 175227] +++ killed by SIGABRT +++
[pid 175226] +++ killed by SIGABRT +++
[pid 175225] +++ killed by SIGABRT +++
[pid 175224] +++ killed by SIGABRT +++
[pid 175223] +++ killed by SIGABRT +++
[pid 175222] +++ killed by SIGABRT +++
[pid 175085] +++ killed by SIGABRT +++
+++ killed by SIGABRT +++


Regards

> -Message d'origine-
> De : Flavio Leitner 
> Envoyé : vendredi 17 mai 2019 10:29
> À : SCHAER Frederic 
> Cc : b...@openvswitch.org
> Objet : Re: [ovs-discuss] Restarting network kills ovs-vswitchd (and
> network)... ?
> 
> On Thu, May 16, 2019 at 09:34:28AM +, SCHAER Frederic wrote:
> > Hi,
> > I'm facing an issue with openvswitch, which I think is new (not even sure).
> > here is the description :
> &