On Mon, Oct 27, 2014 at 04:21:54PM -0700, Duncan Idaho wrote:
> We're currently seeing this crash several times a day in our Ubuntu
> Icehouse OpenStack environment of about 60 nodes.
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  nl_attr_get_size (nla=nla@entry=0x0) at ../lib/netlink.c:506
> #1  0x0000000000460473 in format_generic_odp_key (a=a@entry=0x0,
> ds=ds@entry=0x7fffc803d290)
> at ../lib/odp-util.c:767
> #2  0x0000000000460cd2 in format_odp_key_attr (a=a@entry=0x1a63c98,
> ma=ma@entry=0x0, ds=ds@entry=0x7fffc803d290, verbose=verbose@entry=true) at
> ../lib/odp-util.c:1332
> #3  0x00000000004609d7 in odp_flow_format (key=key@entry=0x1a63c50,
> key_len=key_len@entry=80, mask=mask@entry=0x0, mask_len=mask_len@entry=0,
> ds=ds@entry=0x7fffc803d290,
>     verbose=verbose@entry=true) at ../lib/odp-util.c:1402
> #4  0x00000000004450f3 in log_flow_message (error=error@entry=2,
> operation=operation@entry=0x4d0e73 "flow_del", key=0x1a63c50, key_len=80,
> mask=mask@entry=0x0,
>     mask_len=mask_len@entry=0, stats=0x0, actions=actions@entry=0x0,
> actions_len=actions_len@entry=0, dpif=<optimized out>) at ../lib/dpif.c:1354
> #5  0x00000000004453c9 in log_flow_del_message (dpif=dpif@entry=0x1a06c70,
> del=del@entry=0x7fffc803d340, error=error@entry=2) at ../lib/dpif.c:1397
> #6  0x0000000000445433 in log_flow_del_message (error=2,
> del=0x7fffc803d340, dpif=0x1a06c70) at ../lib/dpif.c:1396
> #7  dpif_flow_del__ (dpif=0x1a06c70, del=del@entry=0x7fffc803d340) at
> ../lib/dpif.c:945
> #8  0x00000000004455ca in dpif_flow_del (dpif=<optimized out>,
> key=<optimized out>, key_len=<optimized out>, 
> stats=stats@entry=0x7fffc803d370)
> at ../lib/dpif.c:965
> #9  0x000000000041b423 in subfacet_uninstall (subfacet=0x1be9a80) at
> ../ofproto/ofproto-dpif.c:4686
> #10 0x0000000000420f18 in facet_remove (facet=0x1be9680) at
> ../ofproto/ofproto-dpif.c:4014
> #11 0x0000000000422f52 in facet_revalidate (facet=facet@entry=0x1be9680) at
> ../ofproto/ofproto-dpif.c:4321
> #12 0x0000000000423b96 in type_run (type=<optimized out>) at
> ../ofproto/ofproto-dpif.c:836
> #13 0x000000000041224f in ofproto_type_run (datapath_type=<optimized out>,
> datapath_type@entry=0x1ab88a0 "system") at ../ofproto/ofproto.c:1309
> #14 0x000000000040d755 in bridge_run () at ../vswitchd/bridge.c:2384
> #15 0x00000000004059bb in main (argc=<optimized out>, argv=<optimized out>)
> at ../vswitchd/ovs-vswitchd.c:118

This backtrace doesn't quite add up.

We can see from frames 4 and 3 that we've got a nonnull 'key', which
becomes a nonnull nlattr 'a' in frame 2.  Along the same chain, we
have a null 'mask' that becomes a null 'ma'.  I often don't trust GDB
to give me correct arguments in backtraces but all of that adds up
nicely so I tend to believe it.

Take a look at the code for format_odp_key_attr().  It always
dereferences 'a' to get its type 'attr':

    enum ovs_key_attr attr = nl_attr_type(a);

A few lines later we can see 'is_exact' getting set to true (since
'ma' is NULL):

    bool is_exact;

    is_exact = ma ? odp_mask_attr_is_exact(ma) : true;

We're evidently hitting the default case in the switch statement given
the line number cited in the backtrace, which runs this code:

    case OVS_KEY_ATTR_UNSPEC:
    case __OVS_KEY_ATTR_MAX:
    default:
        format_generic_odp_key(a, ds);
        if (!is_exact) {
            ds_put_char(ds, '/');
            format_generic_odp_key(ma, ds);      <---- line 1332
        }
        break;

but that doesn't make sense--we should never get there, because
is_exact is true.  So--WTF?

> This is probably related to the following "fixed" Ubuntu bug:
> https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1352570
> 
> The fix referenced was:
> https://github.com/openvswitch/ovs/commit/dd2e44f835fac8c2df99f84c54250c3ca981f2f5
> 
> Not sure if it's relevant but part of this patch was reverted prior to the
> 2.0.2 release:
> https://github.com/openvswitch/ovs/commit/e8ac8c3940535fb439eba980afa6c61bdd428003

commit dd2e44f835 is about a race between two threads when a bridge is
being deleted.  I don't see any evidence that there's a bridge being
deleted here.

> Any help will be appreciated!  Let me know if I can provide any more
> relevant information.

What GCC version was used for this build?  I've seen an unusual number
of code generation bugs with GCC 4.9.x.
_______________________________________________
discuss mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/discuss

Reply via email to