Re: [j-nsp] MX304 Port Layout
On Sun, 2 Jul 2023 at 17:15, Mark Tinka wrote: > Technically, do we not think that an oversubscribed Juniper box with a > single Trio 6 chip with no fabric is feasible? And is it not being built > because Juniper don't want to cannibalize their other distributed > compact boxes? > > The MX204, for example, is a single Trio 3 chip that is oversubscribed > by an extra 240Gbps. So we know they can do it. The issue with the MX204 > is that most customers will run out of ports before they run out of > bandwidth. Not disagreeing here, but how do we define oversubscribed here? Are all boxes oversubscribed which can't do a) 100% at max size packet and b) 100% at min size packet and c) 100% of packets to delay buffer, I think this would be quite reasonable definition, but as far as I know, no current device of non-modest scale would satisfy each 3, almost all of them would only satisfy a). Let's consider first gen trio serdes 1) 2/4 goes to fabric (btree replication) 2) 1/4 goes to delay buffer 3) 1/4 goes to WAN port (and actually like 0.2 additionally goes to lookup engine) So you're selling less than 1/4th of the serdes you ship, more than 3/4 are 'overhead'. Compared to say Silicon1, which is partially buffered, they're selling almost 1/2 of the serdes they ship. You could in theory put ports on all of these serdes in BPS terms, but not in PPS terms at least not with off-chip memory. And in each case, in a pizza box case, you could sell those fabric ports, as there is no fabric. So given NPU has always ~2x the bps in pizza box format (but usually no more pps). And in MX80/MX104 Juniper did just this, they sell 80G WAN ports, when in linecard mode it only is 40G WAN port device. I don't consider it oversubscribed, even though the minimum packet size went up, because the lookup capacity didn't increase. Curiously AMZN told Nanog their ratio, when design is fully scaled to 100T is 1/4, 400T bought ports, 100T useful ports. Unclear how long 100T was going to scale, but obviously they wouldn't launch architecture which needs to be redone next year, so when they decided 100T cap for the scale, they didn't have 100T need yet. This design was with 112Gx128 chips, and boxes were single chip, so all serdes connect ports, no fabrics, i.e. true pizzabox. I found this very interesting, because the 100T design was, I think 3 racks? And last year 50T asics shipped, next year we'd likely get 100T asics (224Gx512? or 112Gx1024?). So even hyperscalers are growing slower than silicon, and can basically put their dc-in-a-chip, greatly reducing cost (both CAPEX and OPEX) as no need for wasting 3/4th of the investment on overhead. The scale also surprised me, even though perhaps it should not have, they quoted +1M network devices, considering they quote +20M nitro system shipped, that's like <20 revenue generating compute per network device. Depending on the refresh cycle, this means amazon is buying 15-30k network devices per month, which I expect is significantly more than cisco+juniper+nokia ship combined to SP infra, so no wonder SPs get little love. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On 7/2/23 15:19, Saku Ytti wrote: Right as is MX304. I don't think this is 'my definition', everything was centralised originally, until Cisco7500 came out, which then had distributed forwarding capabilities. Now does centralisation truly mean BOM benefit to vendors? Probably not, but it may allow to address one lower margin market which as lower per-port performance needs, without cannibilising larger margin market. Technically, do we not think that an oversubscribed Juniper box with a single Trio 6 chip with no fabric is feasible? And is it not being built because Juniper don't want to cannibalize their other distributed compact boxes? The MX204, for example, is a single Trio 3 chip that is oversubscribed by an extra 240Gbps. So we know they can do it. The issue with the MX204 is that most customers will run out of ports before they run out of bandwidth. I don't think it's that vendors using Broadcom to oversubscribe a high-capacity chip is the issue. It's that other vendors with in-house silicon won't do the same with their own silicon. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On Sun, 2 Jul 2023 at 15:53, Mark Tinka via juniper-nsp wrote: > Well, by your definition, the ASR9903, for example, is a distributed > platform, which has a fabric ASIC via the RP, with 4x NPU's on the fixed > line card, 2x NPU's on the 800Gbps PEC and 4x NPU's on the 2Tbps PEC. Right as is MX304. I don't think this is 'my definition', everything was centralised originally, until Cisco7500 came out, which then had distributed forwarding capabilities. Now does centralisation truly mean BOM benefit to vendors? Probably not, but it may allow to address one lower margin market which as lower per-port performance needs, without cannibilising larger margin market. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On 6/28/23 09:29, Saku Ytti via juniper-nsp wrote: This of course makes it more redundant than distributed box, because distributed boxes don't have NPU redundancy. Well, by your definition, the ASR9903, for example, is a distributed platform, which has a fabric ASIC via the RP, with 4x NPU's on the fixed line card, 2x NPU's on the 800Gbps PEC and 4x NPU's on the 2Tbps PEC. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On 7/2/23 11:18, Saku Ytti wrote: In this context, these are all distributed platforms, they have multiple NPUs and fabric. Centralised has a single forwarding chip, and significantly more ports than bandwidth. So to clarify your definition of "centralized", even if there is no replaceable fabric, and the line cards communicate via a fixed fabric ASIC, you'd still define that as a distributed platform? By your definition, you are speaking about fixed form factor platforms with neither a replaceable fabric nor fabric ASIC, like the MX204, ASR920, ACX7024, 7520-IXR, e.t.c.? Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On Sun, 2 Jul 2023 at 12:11, Mark Tinka wrote: > Well, for data centre aggregation, especially for 100Gbps transit ports > to customers, centralized routers make sense (MX304, MX10003, ASR9903, > e.t.c.). But those boxes don't make sense as Metro-E routers... they can > aggregate Metro-E routers, but can't be Metro-E routers due to their cost. In this context, these are all distributed platforms, they have multiple NPUs and fabric. Centralised has a single forwarding chip, and significantly more ports than bandwidth. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On 7/2/23 10:42, Saku Ytti wrote: Yes. Satellite is basically VLAN aggregation, but a little bit less broken. Both are much inferior to MPLS. I agree that using vendor satellites solves this problem. The issue, IIRC, is was what happens when you need to have the satellites in rings? Satellites work well when fibre is not an issue, and each satellite can hang off the PE router like a spur. But if you need to build rings in order to cover as many areas as possible at a reasonable cost, satellites seemed to struggled to have scalable ring topologies. This could have changed over time, not sure. I stopped tracking satellite technologies around 2010. But usually that's not the comparison due to real or perceived cost reasons. So in absence of a vendor selling you the front-plate you need, option space often considered is satellite or vlan aggregation, instead of connecting some smaller MPLS edge boxes to bigger aggregation MPLS boxes, which would be, in my opinion, obviously better. The cost you pay for a small Metro-E router optimized for ring deployments is more than paid back in the operational simplicity that comes with MPLS-based rings. Having ran such architectures for close to 15 years now (since the Cisco ME3600X/3800X), I can tell you how much easier it has been for us to scale and keep customers because we did not have to run Layer 2 rings like our competitors did. But as discussed, centralised chassis boxes are appearing as a new option to the option space. Well, for data centre aggregation, especially for 100Gbps transit ports to customers, centralized routers make sense (MX304, MX10003, ASR9903, e.t.c.). But those boxes don't make sense as Metro-E routers... they can aggregate Metro-E routers, but can't be Metro-E routers due to their cost. I think there is still a use-case for distributed boxes like the MX480 and MX960, for cases where you have to aggregate plenty of 1Gbps and 10Gbps customers. Those line cards, especially the ones that are now EoS/EoL, are extremely cheap and more than capable of supporting 1Gbps and 10Gbps services in the data centre. At the moment, with modern centralized routers optimized for 100Gbps and 400Gbps, using them to aggregate 10Gbps services or lower maybe be costlier than, say, an MX480 or MX960 with MPC2E or MPC7E line cards attached to a dense Ethernet switch via 802.1Q. For the moment, the Metro-E router that makes the most sense to us is the ACX7024. Despite its Broadcom base, we seem to have found a way to make it work for us, and replace the ASR920. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On Sun, 2 Jul 2023 at 11:38, Mark Tinka wrote: > So all the above sounds to me like scenarios where Metro-E rings are > built on 802.1Q/Q-in-Q/REP/STP/e.t.c., rather than IP/MPLS. Yes. Satellite is basically VLAN aggregation, but a little bit less broken. Both are much inferior to MPLS. But usually that's not the comparison due to real or perceived cost reasons. So in absence of a vendor selling you the front-plate you need, option space often considered is satellite or vlan aggregation, instead of connecting some smaller MPLS edge boxes to bigger aggregation MPLS boxes, which would be, in my opinion, obviously better. But as discussed, centralised chassis boxes are appearing as a new option to the option space. -- ++ytti ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] MX304 Port Layout
On 6/28/23 08:44, Saku Ytti wrote: Apart from obvious stuff like QoS getting difficult, not full feature parity with VLAN and main interface, or counters becoming less useful as many are port level so identifying true source port may not be easy. There are things that you'll just discover over time that don't even come to your mind, and I don't know what those will be in your deployment. I can give anecdotes 2*VXR termination of metro L2 ring - everything is 'ok' - ethernet pseudowire service is introduced to customers - occasionally there are loops now - well VXR goes to promisc mode when you add ethernet pseudowire, because while it has VLAN local significancy, it doesn't have per-vlan MAC filter. - now unrelated L3 VLAN, which is redundantly terminated to both VXR has customer CE down in the L2 metro - because ARP timeout is 4h, and MAC timeout is 300s, the the metro will forget the MAC fast, L3 slowly - so primary PE gets packet off of internet, sends to metro, metro floods to all ports, including secondary PE - secondary PE sends packet to primary PE, over WAN - now you learned 'oh yeah, i should have ensured there is per-vlan mac filter' and 'oh yeah, my MAC/ARP timeouts are misconfigured' - but these are probably not the examples you'll learn, they'll be something different - when you do satellite, you can solve lot of the problem scope by software as you control L2 and L3, and can do proprietary code L2 transparency - You do QinQ in L2 aggregation, to pass customer frame to aggregation termination - You do MAC rewrite in/out of the L2 aggregation (customer MAC addresses get rewritten coming in from customer, and mangled back to legitimate MAC going out to termination). You need this to pass STP and such in pseudowires from customer to termination - In termination hardware physically doesn't consider VLAN+ISIS legitimate packet and will kill it, so you have no way of supporting ISIS inside pseudowire when you have L2 aggregation to customer. Technically it's not valid, technically ISIS isn't EthernetII, and 802.3 doesn't have VLANs. But technically correct rarely reduces the red hue in customers faces when they inform about issues they are experiencing. - even if this works, there are plenty of other ways pseudowire transparency suffers with L2 aggregation, as you are experiencing set of limitations from two box, instead of one box when it comes to transparency, and these sets wont be identical - you will introduce MAC limit to your point-to-point martini product, which didn't previously exist. Because your L2 ring is redundant and you need mac learning. If it's just single switch, you can turn off MAC learning per VLAN, and be closer to satellite solution Convergence - your termination no longer observes hardware liveness detection, so you need some solution to transfer L2 port state to VLAN. Which will occasionally break, as it's new complexity. So all the above sounds to me like scenarios where Metro-E rings are built on 802.1Q/Q-in-Q/REP/STP/e.t.c., rather than IP/MPLS. We run fairly large Metro-E rings, but we run them as IP/MPLS rings, and all the issues you describe above are the reasons we pushed the vendors (Cisco in particular) to provide boxes that were optimized for the Metro-E applications, but had proper IP/MPLS support. In other words, these are largely solved problems. I think many - if not all - of the issues you raise above can be fixed by, say, a Cisco ASR920 deployed at scale in the Metro, running IP/MPLS for the backbone, end-to-end. Mark. ___ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp