Re: [j-nsp] Junos 20 - slow RPD
I've been down the path of very slow RPD with JTAC recently. In our case it was due to some mildly complex BGP community stuff that we do which was exhausting memory limits. A good fix for us was to bump up the memory allocation using these hidden commands: set policy-options as-path-match memory-limit 16m set policy-options community-match memory-limit 16m Default memory is 2097152 bytes, so very small. You can see some interesting numbers with some other hidden commands: show policy community-match show policy as-path-match Also if you're running EVPN, check out this PR which is a whole world of fun https://prsearch.juniper.net/InfoCenter/index?page=prcontent=PR1616167 On Fri, Mar 25, 2022 at 6:27 AM Mark Tinka via juniper-nsp < juniper-nsp@puck.nether.net> wrote: > > > On 3/25/22 11:21, Mihai via juniper-nsp wrote: > > > In my case I just upgraded one MX204 in the lab to 21.2R2, enabled > > rib-sharding and increased the JunosVM memory to 24G and things look > > better now. > > Glad to hear! > > Mark. > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
On 3/25/22 11:21, Mihai via juniper-nsp wrote: In my case I just upgraded one MX204 in the lab to 21.2R2, enabled rib-sharding and increased the JunosVM memory to 24G and things look better now. Glad to hear! Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
In my case I just upgraded one MX204 in the lab to 21.2R2, enabled rib-sharding and increased the JunosVM memory to 24G and things look better now. On 25/03/2022 00:58, Gustavo Santos via juniper-nsp wrote: Hi, I think that I was the only one with this issue. Even with a RE-S-X6-64G. We have very slow outbound updates. sending a lot of fullrouting tables to customers may take upto 60 minutes or more when you have a lot of BGP groups , for instance, one group per customer ... and if the we have an issue with the preferred upstream provider, the customer routers may me offline until all updates are sent.. We got new routers and we are going to try Junos 20.4R3 latest service release with update threading and rib-sharding to see if we get some improvement, it is better to lost NSR than blackhole traffic for over an hour.. Em qua., 23 de mar. de 2022 às 06:41, Mark Tinka via juniper-nsp < juniper-nsp@puck.nether.net> escreveu: On 3/22/22 22:42, Mihai via juniper-nsp wrote: Hi Saku, The routes are in VRF so no support for rib-sharding unfortunately. This MX204 is running 20.2R3-S3 so probably the only option is to try another version. We've had some terrible experiences with RPD due to NSR sync. to re1 for BGP, on an RE-S-1800 running Junos 20.4R3.8. Turns out the code can't deal with grouping outbound updates to eBGP neighbors at scale for that RE, which crashes RPD on re1. The options were to either disable NSR, rewrite our outbound policies and combine multiple customers in the same outbound group, or get more memory. We went for the last option. No more problems on the RE-S-X6-64G. Juniper have some work to do to optimize the code in these use-cases. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
On 3/25/22 02:58, Gustavo Santos via juniper-nsp wrote: Hi, I think that I was the only one with this issue. From their feedback, it seems the issue of scaling outbound updates of full tables to eBGP neighbors is known within Juniper, because they told us they have had to come up with all manner of hacks for many of their large scale customers as well. So it's a fundamental problem, one I'm not sure they are addressing very well. We can't keep throwing hardware at the problem. it is better to lost NSR than blackhole traffic for over an hour.. Agreed - we had gotten to the point where we were willing to give up NSR until we figure this out. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
Hi, I think that I was the only one with this issue. Even with a RE-S-X6-64G. We have very slow outbound updates. sending a lot of fullrouting tables to customers may take upto 60 minutes or more when you have a lot of BGP groups , for instance, one group per customer ... and if the we have an issue with the preferred upstream provider, the customer routers may me offline until all updates are sent.. We got new routers and we are going to try Junos 20.4R3 latest service release with update threading and rib-sharding to see if we get some improvement, it is better to lost NSR than blackhole traffic for over an hour.. Em qua., 23 de mar. de 2022 às 06:41, Mark Tinka via juniper-nsp < juniper-nsp@puck.nether.net> escreveu: > > > On 3/22/22 22:42, Mihai via juniper-nsp wrote: > > > > > Hi Saku, > > > > The routes are in VRF so no support for rib-sharding unfortunately. > > This MX204 is running 20.2R3-S3 so probably the only option is to try > > another version. > > We've had some terrible experiences with RPD due to NSR sync. to re1 for > BGP, on an RE-S-1800 running Junos 20.4R3.8. Turns out the code can't > deal with grouping outbound updates to eBGP neighbors at scale for that > RE, which crashes RPD on re1. > > The options were to either disable NSR, rewrite our outbound policies > and combine multiple customers in the same outbound group, or get more > memory. We went for the last option. > > No more problems on the RE-S-X6-64G. > > Juniper have some work to do to optimize the code in these use-cases. > > Mark. > ___ > juniper-nsp mailing list juniper-nsp@puck.nether.net > https://puck.nether.net/mailman/listinfo/juniper-nsp > ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
On 3/22/22 22:42, Mihai via juniper-nsp wrote: Hi Saku, The routes are in VRF so no support for rib-sharding unfortunately. This MX204 is running 20.2R3-S3 so probably the only option is to try another version. We've had some terrible experiences with RPD due to NSR sync. to re1 for BGP, on an RE-S-1800 running Junos 20.4R3.8. Turns out the code can't deal with grouping outbound updates to eBGP neighbors at scale for that RE, which crashes RPD on re1. The options were to either disable NSR, rewrite our outbound policies and combine multiple customers in the same outbound group, or get more memory. We went for the last option. No more problems on the RE-S-X6-64G. Juniper have some work to do to optimize the code in these use-cases. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
Hi Saku, The routes are in VRF so no support for rib-sharding unfortunately. This MX204 is running 20.2R3-S3 so probably the only option is to try another version. Thank you for your time and info, very useful as always. On 22/03/2022 17:58, Saku Ytti wrote: Hey, On MX204 with ~4M routes, after upgrading from 18.2 to 20.2 the RPD is way slower in processing BGP policies and sending the routes to neighbors. For example, on a BGP group with one neighbor and an export policy containing 5 terms each matching a community it takes ~1min ( 100% RPD utilisation ) to send 1k routes to the neighbor in 20.2 compared to 15s in 18.2. Disabling terms will reduce the time. Anyone experienced something similar? I don't recognise this problem specifically. It seems rather terrible regression so you probably should either open a JTAC case or do the Junos dance. If you have a large RIB/FIB ratio allowing more than 1 core to work on BGP will produce improvement: set system processes routing bgp rib-sharding number-of-shards 4 set system processes routing bgp update-threading This is a disruptive change. JNPR wanted us on 20.3 (we are on 20.3R3-S2) for rib-sharding, but we did run it previously on 20.2R3-S3 with success. We are currently targeting 21.4R1-S1. If you have memory pressure, you can expand the default 16GB DRAM to 24GB DRAM via configuration toggle (post 21.2R1). If you are comfortable hacking QEMU/KVM config manually, you can do it on any release and can entertain other sizes. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Junos 20 - slow RPD
Hey, > On MX204 with ~4M routes, after upgrading from 18.2 to 20.2 the RPD is > way slower in processing BGP policies and sending the routes to neighbors. > For example, on a BGP group with one neighbor and an export policy > containing 5 terms each matching a community it takes ~1min ( 100% RPD > utilisation ) to send 1k routes to the neighbor in 20.2 compared to 15s > in 18.2. > Disabling terms will reduce the time. > > Anyone experienced something similar? I don't recognise this problem specifically. It seems rather terrible regression so you probably should either open a JTAC case or do the Junos dance. If you have a large RIB/FIB ratio allowing more than 1 core to work on BGP will produce improvement: set system processes routing bgp rib-sharding number-of-shards 4 set system processes routing bgp update-threading This is a disruptive change. JNPR wanted us on 20.3 (we are on 20.3R3-S2) for rib-sharding, but we did run it previously on 20.2R3-S3 with success. We are currently targeting 21.4R1-S1. If you have memory pressure, you can expand the default 16GB DRAM to 24GB DRAM via configuration toggle (post 21.2R1). If you are comfortable hacking QEMU/KVM config manually, you can do it on any release and can entertain other sizes. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
[j-nsp] Junos 20 - slow RPD
Hi, On MX204 with ~4M routes, after upgrading from 18.2 to 20.2 the RPD is way slower in processing BGP policies and sending the routes to neighbors. For example, on a BGP group with one neighbor and an export policy containing 5 terms each matching a community it takes ~1min ( 100% RPD utilisation ) to send 1k routes to the neighbor in 20.2 compared to 15s in 18.2. Disabling terms will reduce the time. Anyone experienced something similar? Thanks! ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp