Not sure, but anything shipped before May most likely would be affected, if not 
into May a bit.   Since we were one of the first if not the first customer to 
get the fixed applied to the equipment we got at the end of March.   

We never knew the real root cause outside that when it happened the primary RE 
would lock up-sold and not respond to any command or input or allow access to 
any management port and there were no crash dumps or logs.  The Backup RE would 
NOT Take over and get stuck in a loop trying to take over the mastership, but 
the backup RE would still respond to management, but even a reboot of it would 
not allow it to take mastership.   The only solution was a full power plug 
removal or RE removal from the chassis for a power reset.   But they were able 
to find this in the lab at Juniper right after we reported it and they worked 
on a fix and got it to use about 1.5 weeks later.    We got lucky in that one 
of the 3 boxes would never last more than 6 hours after a reboot before the 
lockup of the master RE  (No matter if it was RE0 or RE1 as master)  The other 
2 boxes could go a week or more before locking up.   So we were a good test to 
see if the fixed work since in the lab it would take up to 8 days before 
locking up.

FYI:  The one that would lock up inside 6 hours was our spare and had no 
traffic at all or even a optic plugged into any port and not even test traffic 
which the other 2 had going.  We did not go into production until 2 weeks after 
the fix was applied to make sure.

This problem would only surface also if you have more then one RE plugged into 
the system.  Even if failover was not configured.  It was just the presence of 
the 2nd RE that would trigger it.   I understand that the engineering team is 
now fully regressive testing all releases with multiple REs now.  I guess that 
was not true before we found the bug.

-----Original Message-----
From: juniper-nsp <juniper-nsp-boun...@puck.nether.net> On Behalf Of Mark Tinka 
via juniper-nsp
Sent: Thursday, June 8, 2023 10:53 PM
To: juniper-nsp@puck.nether.net
Subject: Re: [EXT] [j-nsp] MX304 Port Layout



On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:

> The big issue we ran into is if you have redundant REs then there is a super 
> bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and 
> the other 2 would take a very long time) to 8 days will lock the entire 
> chassis up solid where we had to pull the REs physical out to reboot them.    
>  It is fixed now, but they had to manually poke new firmware into the ASICs 
> on each RE when they were in a half-powered state,  Was a very complex 
> procedure with tech support and the MX304 engineering team.  It took about 3 
> hours to do all 3 MX304s  one RE at a time.   We have not seen an update with 
> this built-in yet.  (We just did this back at the end of April)

Oh dear, that's pretty nasty. So did they say new units shipping today would 
come with the RE's already fixed?

We've been suffering a somewhat similar issue on the PTX1000, where a bug was 
introduced via regression in Junos 21.4, 22.1 and 22.2 that causes CPU queues 
to get filled up by unknown MAC address frames, and are not cleared. It takes 
64 days for this packet accumulation to grow to a point where the queues get 
exhausted, causing a host loopback wedge.

You would see an error like this in the logs:

<date> <time> <hostname> alarmd[27630]: Alarm set: FPC id=150995048, color=RED, 
class=CHASSIS, reason=FPC 0 Major Errors <date> <time> <hostname> fpc0 
Performing action cmalarm for error 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1]
(0x20002) in module: Host Loopback with scope: pfe category: functional
level: major
<date> <time> <hostname> fpc0 Cmerror Op Set: Host Loopback: HOST LOOPBACK 
WEDGE DETECTED IN PATH ID 1  (URI: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1])
Apr 1 03:52:28  PTX1000 fpc0 CMError: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3]
(0x20004), in module: Host Loopback with scope: pfe category: functional
level: major

This causes the router to drop all control plane traffic, which, basically, 
makes it unusable. One has to reboot the box to get it back up and running, 
until it happens again 64 days later.

The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.

However, these releases are not shipping yet, so Juniper gave us a workaround 
SLAX script that automatically runs and clears the CPU queues before the 64 
days are up.

We are currently running Junos 22.1R3.9 on this platform, and will move to 
22.4R2 in a few weeks to permanently fix this.

Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.

I understand it may also affect the QFX and MX, but I don't have details on 
that.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net 
https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Reply via email to