I think I might have stumbled across a bug in ospfd, where it's possible for it
to forget the mapping between OSPF external tags and routing labels defined in
it's config file.

It's possible to reproduce in various network configurations by restarting
ospfd on individual routers, but a minimal set of reproduction steps would be
something like this:

1. Two peers running ospfd. Both with the same set of routing label to external
   tag mappings in their config file.
2. Configure both peers to run ospfd on a shared subnet.
3. One of the two peers should advertise a route to a subnet (not the common
   one) which has a routing label assigned.
4. Start ospfd on both peers and wait until the advertised route propagates.
5. Restart ospfd on the peer advertising the route and wait until the route
   propagates again. Notice that the route still has a label assigned.
6. Restart ospfd on the peer advertising the route again. This time notice that
   while the route propagates it will no longer have a label assigned.

Once a label mapping is forgotten by an ospfd instance it's possible to restore
it by asking the instance to reload it's config file via ospfctl.

Looking at the code, I noticed that there is a reference counting system used
for storing the association between OSPF external tags, routing labels and
internal interface ids. I wondered if there might be a problem with how the
references where tracked, so I built the latest version of ospfd with
additional logging added to each function in the name2id.c file.

Router 10.0.0.1 (the forgetful router in this case) is using interface em2 to
connect to the common subnet and is using the config file:

  router-id 10.0.0.1

  auth-type crypt
  ...

  rtlabel default external-tag 1
  rtlabel internal external-tag 2 
  rtlabel external external-tag 3 
  rtlabel restricted external-tag 4
  rtlabel isolated external-tag 5
  rtlabel hosted external-tag 6

  area 0.0.0.0 {
    interface em2 {
    }
  }

Router 10.1.0.1 is using em0 to connect to the common subnet. It is advertising
10.66/16, which is the route to the subnet of interface em3, and is assigned
the routing label "isolated". Nothing else on either router is using this
routing label. The ospfd config file is:

  router-id 10.1.0.1

  auth-type crypt
  ...

  rtlabel default external-tag 1
  rtlabel internal external-tag 2 
  rtlabel external external-tag 3 
  rtlabel restricted external-tag 4
  rtlabel isolated external-tag 5
  rtlabel hosted external-tag 6

  redistribute 10.66.0.0/16

  area 0.0.0.0 {
    interface em0 {
    }
  }

Log of ospfd starting on router 10.0.0.1:

  rtlabel_name2id(default)
  ref++ new name=default
  rtlabel_tag(id=1,tag=1)
  ... repeated for other label mappings ...
  rtlabel_name2id(isolated)
  ref++ new name=isolated
  rtlabel_tag(id=5,tag=5)
  ...
  startup
  kr_init: priority filter enabled
  rtlabel_name2id(default)
  ref++ existing name=default
  rtlabel_id2tag(1)
  ... repeated for other local interfaces using labels with mappings ...
  spf_calc: area 0.0.0.0 calculated

After 10.1.0.1 joins:

  nbr_fsm: event HELLO_RECEIVED resulted in action START_INACTIVITY_TIMER and 
changing state for neighbor ID 10.1.0.1 (em2) from DOWN to INIT
  ...
  spf_calc: area 0.0.0.0 calculated
  rtlabel_tag2id(5)
  rtlabel_id2name(5)
  rtlabel_unref(id=0)

10.1.0.1 then leaves:

  ...
  spf_calc: area 0.0.0.0 calculated
  rtlabel_id2name(5)
  rtlabel_unref(id=5)
  ref-- id=5
  ref==0 remove id=5

10.1.0.1 rejoins:

  nbr_fsm: event 2_WAY_RECEIVED resulted in action EVAL and changing state for 
neighbor ID 10.1.0.1 (em2) from INIT to EXSTA
  ...
  spf_calc: area 0.0.0.0 calculated
  rtlabel_tag2id(5)
  rtlabel_unref(id=0)

I think it's reasonably clear from these logs snippets what is
happening, but let me know if a complete log would be helpful or if it
would be useful to rerun the steps above with some additional debugging
output.

Reply via email to