Re: [c-nsp] ASR9k - IPoE termination

Nathan Ward Wed, 22 Jun 2016 01:06:35 -0700

Hi,

> On 22/06/2016, at 11:39, Pshem Kowalczyk <pshe...@gmail.com> wrote:
> 
> Hi,
> 
> We're testing IPoE termination on ASR9ks and ran into a small, but annoying
> issue.
> Our subs will terminate on PW-Eth interfaces, that ultimately connect to a
> L2 broadcast domain (access network, this is not something we can change).
> So when there are two BNGs attached to the same broadcast domain they both
> can see the same DHCP-Discovery packets which results in both of the BNGs
> building a session for the sub, both BNGs pass the DHCP-Discovery onto the
> actual DHCP servers that allocate them IPs. Many CPE's will send
> DHCP-Request to both BNGs as well, but then the sub's CPE only takes one
> lease and completely ignores the other, so the second session remains idle
> for the duration of the lease (we use the 'proxy' functionality) and
> ultimately drops. This is not too bad for subs that end up with different
> IPs on both BNGs, but for a case where there is a static IP involved - the
> same IP ends up on both BNGs (that happily advertise it into the rest of
> the network), which blackholes a portion of the traffic until the 'idle'
> session times out. Not ideal.
> 
> The overarching requirement here is that we have to provide basic
> redundancy (so the session can connect to the second BNG if the first one
> drops).
> 
> So far I've come up with the following scenarios:
> 1. Active and backup PWEs
>  - won't work as there are 2 independent entry point from the broadcast
> domain into the MPLS cloud.
> 2. Somehow checking the state on radius or DHCP servers and either not
> allowing the session in (radius) or not responding (or at least delaying
> the response) to requests (DHCP)
> - the difficulty here is in determining reliably if the session should be
> allowed or not
> - there might be a delay in radius accounting propagation
> - many tested clients sends a high number of DHCP-Discovery packets in
> short time, which results in race conditions between the radius/dhcp setup,
> as they hit different BNGs/radius/dhcp servers
> 
> Ideally I would like delay the building of the session on one bng (similar
> to pado-delay in the PPPoE world). Any idea on how this can be achieved?


I have been looking at exactly this thing in the last week.

Firstly, PW-HE has worked very poorly for us for IPoE. We’ve tried on a number 
of releases over the last 12 months, including the latest release a few weeks 
back. Of course, there are the problems where NP queue resources are consumed 
for each physical link that a PW might transit. We also found that if a 
PW-carrying MPLS link flapped or a whole PW flapped and customers were 
disconnected/reconnected very rapidly, the DHCPv4 functionality would just die. 
I presume this is related to the queue resource creation/deletion in the NP, 
though we haven’t had a chance to debug this closely.
I noted that DHCPv6 (which remained working when DHCPv4 died) didn’t handle 
remote-id correctly, but haven’t had a chance to look in to whether this was 
just PW-HE, or whether it was a general DHCPv6 problem on the ASR9000.

Now to the main part of your message..

I have considered delaying DHCP triggered RADIUS responses, in order to delay 
the DHCP response, but given we can only have 255 RADIUS requests in flight, 
this would mean things go very slow getting customers online. If the delay is 
only a second or two, then maybe this would work. I’m not certain what would 
happen with BNG resources if we were to accept the request though - I presume 
the subscriber would time out at some point having not sent a subsequent DHCP 
REQUEST after the BNG sends a DHCP OFFER. That is, assuming the client doesn’t 
attempt to send a DHCP REQUEST when it receives a second OFFER… and in our 
market who knows what CPEs people are running. I see you note that some send 
REQUESTs to both BNGs, so, there you go.

I am also considering some sort of memcache or similar solution to allow the 
RADIUS servers to share state about which requests have been answered in the 
last n seconds, so that the second server can reject the request.

I am also considering the above, but with an optimisation where RADIUS servers 
act differently depending on the output of a hash function over say the 
remote-id - i.e. responding immediately then pushing to memcache, or after a 
short delay and checking memcache.

I have not tested any of these yet, and am mulling them over.

If you are using proxy DHCP functionality, perhaps you can auth both BNGs, and 
control which you respond to in your DHCP server - if your DHCP server can 
support such things. Perhaps the FreeRADIUS DHCP support can help out here.
We too are doing proxy DHCP on the BNGs, though I’d like to move to a RADIUS 
backed DHCP on the BNGs.

--
Nathan Ward

_______________________________________________
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Re: [c-nsp] ASR9k - IPoE termination

Reply via email to