Re: [j-nsp] MX304 Port Layout

2023-06-09 Thread Litterick, Jeff (BIT) via juniper-nsp
This is why we got the MX304.  It was a test to replace our MX10008 Chassis, 
which we bought a few of because we had to get at a reasonable price into 100G 
at high density at multiple sites a few years back now.  Though we really only 
need 4 line cards, with 2 being for redundancy.   The MX1004 was not available 
at the time back then (Wish it had been.  The MX10008 is a heavy beast indeed 
and we had to use fork lifts to move them around into the data centers).But 
after handling the MX304 we will most likely for 400G go to the MX10004 line 
for the future and just use the MX304 at very small edge sites if needed.   
Mainly due to full FPC redundancy requirements at many of our locations.   And 
yes we had multiple full FPC failures in the past on the MX10008 line.  We went 
through at first an RMA cycle with multiple line cards which in the end was due 
to just 1 line cards causing full FPC failure on a different line card in the 
chassis  around every 3 months or so.   Only having everything redundant across 
both FPCs allowed us not to have serious downtime. 


-Original Message-
From: juniper-nsp  On Behalf Of Andrey 
Kostin via juniper-nsp
Sent: Friday, June 9, 2023 11:09 AM
To: Mark Tinka 
Cc: Saku Ytti ; juniper-nsp 
Subject: Re: [EXT] [j-nsp] MX304 Port Layout

Mark Tinka писал(а) 2023-06-09 10:26:
> On 6/9/23 16:12, Saku Ytti wrote:
> 
>> I expect many people in this list have no need for more performance 
>> than single Trio YT in any pop at all, yet they need ports. And they 
>> are not adequately addressed by vendors. But they do need the deep 
>> features of NPU.
> 
> This.
> 
> There is sufficient performance in Trio today (even a single Trio chip 
> on the board) that people are willing to take an oversubscribed box or 
> line card because in real life, they will run out of ports long before 
> they run out of aggregate forwarding capacity.
> 
> The MX204, even though it's a pizza box, is a good example of how it 
> could do with 8x 100Gbps ports, even though Trio on it will only 
> forward 400Gbps. Most use-cases will require another MX204 chassis, 
> just for ports, before the existing one has hit anywhere close to 
> capacity.

Agree, there is a gap between 204 and 304, but don't forget that they belong to 
different generations. 304 is shiny new with a next level performance that's 
replacing MX10k3. The previous generation was announced to retire, but life of 
MX204 was extended because Juniper realized that they don't have anything atm 
to replace it and probably will lose revenue. Maybe this gap was caused by 
covid that slowed down the new platform. And possibly we may see a single NPU 
model based on the new gen chip, because chips for 204 are finite. At least it 
would be logical to make it, considering success of MX204.
> 
> Really, folk are just chasing the Trio capability, otherwise they'd 
> have long solved their port-count problems by choosing any 
> Broadcom-based box on the market. Juniper know this, and they are 
> using it against their customers, knowingly or otherwise. Cisco was 
> good at this back in the day, over-subscribing line cards on their 
> switches and routers. Juniper have always been a little more purist, 
> but the market can't handle it because the rate of traffic growth is 
> being out-paced by what a single Trio chip can do for a couple of 
> ports, in the edge.

I think that it's not rational to make another chipset with lower bandwidth, 
easier to limit an existing more powerful chip. Then it leads to 
MX5/MX10/MX40/MX80 hardware and licensing model. It could be a single
Trio6 with up to 1.6T in access ports and 1.6T in uplink ports with low 
features. Maybe it will come, who knows, let's watch ;)

Kind regards,
Andrey
___
juniper-nsp mailing list juniper-nsp@puck.nether.net 
https://puck.nether.net/mailman/listinfo/juniper-nsp
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-06-09 Thread Litterick, Jeff (BIT) via juniper-nsp
Not sure, but anything shipped before May most likely would be affected, if not 
into May a bit.   Since we were one of the first if not the first customer to 
get the fixed applied to the equipment we got at the end of March.   

We never knew the real root cause outside that when it happened the primary RE 
would lock up-sold and not respond to any command or input or allow access to 
any management port and there were no crash dumps or logs.  The Backup RE would 
NOT Take over and get stuck in a loop trying to take over the mastership, but 
the backup RE would still respond to management, but even a reboot of it would 
not allow it to take mastership.   The only solution was a full power plug 
removal or RE removal from the chassis for a power reset.   But they were able 
to find this in the lab at Juniper right after we reported it and they worked 
on a fix and got it to use about 1.5 weeks later.We got lucky in that one 
of the 3 boxes would never last more than 6 hours after a reboot before the 
lockup of the master RE  (No matter if it was RE0 or RE1 as master)  The other 
2 boxes could go a week or more before locking up.   So we were a good test to 
see if the fixed work since in the lab it would take up to 8 days before 
locking up.

FYI:  The one that would lock up inside 6 hours was our spare and had no 
traffic at all or even a optic plugged into any port and not even test traffic 
which the other 2 had going.  We did not go into production until 2 weeks after 
the fix was applied to make sure.

This problem would only surface also if you have more then one RE plugged into 
the system.  Even if failover was not configured.  It was just the presence of 
the 2nd RE that would trigger it.   I understand that the engineering team is 
now fully regressive testing all releases with multiple REs now.  I guess that 
was not true before we found the bug.

-Original Message-
From: juniper-nsp  On Behalf Of Mark Tinka 
via juniper-nsp
Sent: Thursday, June 8, 2023 10:53 PM
To: juniper-nsp@puck.nether.net
Subject: Re: [EXT] [j-nsp] MX304 Port Layout



On 6/9/23 00:03, Litterick, Jeff (BIT) via juniper-nsp wrote:

> The big issue we ran into is if you have redundant REs then there is a super 
> bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and 
> the other 2 would take a very long time) to 8 days will lock the entire 
> chassis up solid where we had to pull the REs physical out to reboot them.
>  It is fixed now, but they had to manually poke new firmware into the ASICs 
> on each RE when they were in a half-powered state,  Was a very complex 
> procedure with tech support and the MX304 engineering team.  It took about 3 
> hours to do all 3 MX304s  one RE at a time.   We have not seen an update with 
> this built-in yet.  (We just did this back at the end of April)

Oh dear, that's pretty nasty. So did they say new units shipping today would 
come with the RE's already fixed?

We've been suffering a somewhat similar issue on the PTX1000, where a bug was 
introduced via regression in Junos 21.4, 22.1 and 22.2 that causes CPU queues 
to get filled up by unknown MAC address frames, and are not cleared. It takes 
64 days for this packet accumulation to grow to a point where the queues get 
exhausted, causing a host loopback wedge.

You would see an error like this in the logs:

   alarmd[27630]: Alarm set: FPC id=150995048, color=RED, 
class=CHASSIS, reason=FPC 0 Major Errorsfpc0 
Performing action cmalarm for error 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1]
(0x20002) in module: Host Loopback with scope: pfe category: functional
level: major
   fpc0 Cmerror Op Set: Host Loopback: HOST LOOPBACK 
WEDGE DETECTED IN PATH ID 1  (URI: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[1])
Apr 1 03:52:28  PTX1000 fpc0 CMError: 
/fpc/0/pfe/0/cm/0/Host_Loopback/0/HOST_LOOPBACK_MAKE_CMERROR_ID[3]
(0x20004), in module: Host Loopback with scope: pfe category: functional
level: major

This causes the router to drop all control plane traffic, which, basically, 
makes it unusable. One has to reboot the box to get it back up and running, 
until it happens again 64 days later.

The issue is resolved in Junos 21.4R3-S4, 22.4R2, 23.2R1 and 23.3R1.

However, these releases are not shipping yet, so Juniper gave us a workaround 
SLAX script that automatically runs and clears the CPU queues before the 64 
days are up.

We are currently running Junos 22.1R3.9 on this platform, and will move to 
22.4R2 in a few weeks to permanently fix this.

Junos 20.2, 20.3 and 20.4 are not affected, nor is anything after 23.2R1.

I understand it may also affect the QFX and MX, but I don't have details on 
that.

Mark.

___
juniper-nsp mailing list juniper-nsp@puck.nether.net 
https://puck.nether.net/mailman/listinfo/juniper-nsp
_

Re: [j-nsp] MX304 Port Layout

2023-06-08 Thread Litterick, Jeff (BIT) via juniper-nsp
No, that is not quite right.  We have 2 chassis of MX304 in Production today 
and 1 spare all with Redundant REs   You do not need all the ports filled in a 
port group.   I know since we mixed in some 40G and 40G is ONLY supported on 
the bottom row of ports so we have a mix and had to break stuff out leaving 
empty ports because of that limitation, and it is running just fine.But you 
do have to be careful which type of optics get plugged into which ports.  IE 
Port 0/2 vs Port 1/3 in a grouping if you are not using 100G optics.

The big issue we ran into is if you have redundant REs then there is a super 
bad bug that after 6 hours (1 of our 3 would lock up after reboot quickly and 
the other 2 would take a very long time) to 8 days will lock the entire chassis 
up solid where we had to pull the REs physical out to reboot them. It is 
fixed now, but they had to manually poke new firmware into the ASICs on each RE 
when they were in a half-powered state,  Was a very complex procedure with tech 
support and the MX304 engineering team.  It took about 3 hours to do all 3 
MX304s  one RE at a time.   We have not seen an update with this built-in yet.  
(We just did this back at the end of April)


-Original Message-
From: juniper-nsp  On Behalf Of Thomas 
Bellman via juniper-nsp
Sent: Thursday, June 8, 2023 2:09 PM
To: juniper-nsp 
Subject: Re: [EXT] [j-nsp] MX304 Port Layout

On 2023-06-08 17:18, Kevin Shymkiw via juniper-nsp wrote:

> Along with this - I would suggest looking at Port Checker ( 
> https://apps.juniper.net/home/port-checker/index.html ) to make sure 
> your port combinations are valid.

The port checker claims an interresting "feature": if you have anything in port 
3, then *all* the other ports in that port group must also be occupied.  So if 
you use all those four ports for e.g. 100GE, everything is fine, but if you 
then want to stop using either of ports 0, 1 or 2, the configuration becomes 
invalid...

(And similarly for ports 5, 8 and 14 in their respective groups.)

I hope that's a bug in the port checker, not actual behaviour by the MX304...


--
Thomas Bellman,  National Supercomputer Centre,  Linköping Univ., Sweden "We 
don't understand the software, and sometimes we don't understand  the hardware, 
but we can *see* the blinking lights!"

___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp