Hello list,

We are experiencing PFE crashes (core dumps) on one of our EX4200 VC's. 

It appears we've hit an unknown bug and we have been working with JTAC and 
Juniper engineering to find the root cause of the issue, but so far without any 
luck. (There is no public PR for this issue yet.)

I was kind of hoping our issue looked familiar to someone on this list. We are 
kind of desperate, since we don't have a workaround or solution other than 
switching to a new vendor.

So here is our setup:

- 6 VC members
- about 60 virtualization servers connected, each hosting about 60 VM's and 
each connected with a 2x1GE LAG to the VC
- Each VM has two interfaces in two different VLANS (public and private network)
- These VLANS are big broadcast domains, shared by all virtualization servers 
and VM's within this VC
- We provide both v4 and v6 connectivity on this VC

So that means thousands of MAC, ARP and v6 neighbour entries in the PFE 
database (but nowhere near the supported limit of 16k entries).

We use OSPF + OSPFv3 to distribute routes, but we've got fairly small tables 
(about 30 v4 and 17 v6 routes). Other than that the configuration of this VC is 
fairly trivial.

So the trouble started about a month ago. We were still running 10.4R9.5 back 
then. Suddenly the PFE daemons of two seperate VC's started core dumping for no 
reason about every two hours. No configuration changes have been made 
whatsoever.

JTAC analysed our core dumps and told us this was a known issue (null pointer 
exception). It would not be resolved in 10.4 but was resolved in 11.2R5.5, the 
recommended JTAC release for EX4200. 

So we planned emergency maintenance and upgraded to the recommended level. 
About a week later, the issue returned on just one VC this time and it has 
happened almost once every week since then. Our other VC's seems to be stable 
so far.

Long story short. JTAC was out of ideas, created an internal PR and forwarded 
our case to their engineering team, who is still looking for a root cause as 
I'm writing this.

In the meantime we are still experiencing this issue and our customers are 
becoming a bit impatient (and rightfully so). We need to work out a plan B in 
case Juniper can't find the root cause and provide a fix.

We could upgrade to an even newer release, but we don't have the impression 
this would solve our issue at all. It could even make matters worse (no way to 
tell in advance).

We would appreciate it if anyone could share any information about similar 
issues and workarounds or solutions. Thanks in advance!

Regards,

--
Dennis Krul 
Tilaa


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Reply via email to