Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Saku Ytti via juniper-nsp
On Sun, 2 Jul 2023 at 17:15, Mark Tinka  wrote:

> Technically, do we not think that an oversubscribed Juniper box with a
> single Trio 6 chip with no fabric is feasible? And is it not being built
> because Juniper don't want to cannibalize their other distributed
> compact boxes?
>
> The MX204, for example, is a single Trio 3 chip that is oversubscribed
> by an extra 240Gbps. So we know they can do it. The issue with the MX204
> is that most customers will run out of ports before they run out of
> bandwidth.

Not disagreeing here, but how do we define oversubscribed here? Are
all boxes oversubscribed which can't do a) 100% at max size packet and
b) 100% at min size packet and c) 100% of packets to delay buffer, I
think this would be quite reasonable definition, but as far as I know,
no current device of non-modest scale would satisfy each 3, almost all
of them would only satisfy a).

Let's consider first gen trio serdes
1) 2/4 goes to fabric (btree replication)
2) 1/4 goes to delay buffer
3) 1/4 goes to WAN port
(and actually like 0.2 additionally goes to lookup engine)

So you're selling less than 1/4th of the serdes you ship, more than
3/4 are 'overhead'. Compared to say Silicon1, which is partially
buffered, they're selling almost 1/2 of the serdes they ship. You
could in theory put ports on all of these serdes in BPS terms, but not
in PPS terms at least not with off-chip memory.

And in each case, in a pizza box case, you could sell those fabric
ports, as there is no fabric. So given NPU has always ~2x the bps in
pizza box format (but usually no more pps). And in MX80/MX104 Juniper
did just this, they sell 80G WAN ports, when in linecard mode it only
is 40G WAN port device. I don't consider it oversubscribed, even
though the minimum packet size went up, because the lookup capacity
didn't increase.

Curiously AMZN told Nanog their ratio, when design is fully scaled to
100T is 1/4, 400T bought ports, 100T useful ports. Unclear how long
100T was going to scale, but obviously they wouldn't launch
architecture which needs to be redone next year, so when they decided
100T cap for the scale, they didn't have 100T need yet. This design
was with 112Gx128 chips, and boxes were single chip, so all serdes
connect ports, no fabrics, i.e. true pizzabox.
I found this very interesting, because the 100T design was, I think 3
racks? And last year 50T asics shipped, next year we'd likely get 100T
asics (224Gx512? or 112Gx1024?). So even hyperscalers are growing
slower than silicon, and can basically put their dc-in-a-chip, greatly
reducing cost (both CAPEX and OPEX) as no need for wasting 3/4th of
the investment on overhead.
The scale also surprised me, even though perhaps it should not have,
they quoted +1M network devices, considering they quote +20M nitro
system shipped, that's like <20 revenue generating compute per network
device. Depending on the refresh cycle, this means amazon is buying
15-30k network devices per month, which I expect is significantly more
than cisco+juniper+nokia ship combined to SP infra, so no wonder SPs
get little love.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Mark Tinka via juniper-nsp




On 7/2/23 15:19, Saku Ytti wrote:


Right as is MX304.

I don't think this is 'my definition', everything was centralised
originally, until Cisco7500 came out, which then had distributed
forwarding capabilities.

Now does centralisation truly mean BOM benefit to vendors? Probably
not, but it may allow to address one lower margin market which as
lower per-port performance needs, without cannibilising larger margin
market.


Technically, do we not think that an oversubscribed Juniper box with a 
single Trio 6 chip with no fabric is feasible? And is it not being built 
because Juniper don't want to cannibalize their other distributed 
compact boxes?


The MX204, for example, is a single Trio 3 chip that is oversubscribed 
by an extra 240Gbps. So we know they can do it. The issue with the MX204 
is that most customers will run out of ports before they run out of 
bandwidth.


I don't think it's that vendors using Broadcom to oversubscribe a 
high-capacity chip is the issue. It's that other vendors with in-house 
silicon won't do the same with their own silicon.



Mark.
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Saku Ytti via juniper-nsp
On Sun, 2 Jul 2023 at 15:53, Mark Tinka via juniper-nsp
 wrote:

> Well, by your definition, the ASR9903, for example, is a distributed
> platform, which has a fabric ASIC via the RP, with 4x NPU's on the fixed
> line card, 2x NPU's on the 800Gbps PEC and 4x NPU's on the 2Tbps PEC.

Right as is MX304.

I don't think this is 'my definition', everything was centralised
originally, until Cisco7500 came out, which then had distributed
forwarding capabilities.

Now does centralisation truly mean BOM benefit to vendors? Probably
not, but it may allow to address one lower margin market which as
lower per-port performance needs, without cannibilising larger margin
market.



-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Mark Tinka via juniper-nsp




On 6/28/23 09:29, Saku Ytti via juniper-nsp wrote:


This of course makes it more redundant than distributed box, because
distributed boxes don't have NPU redundancy.


Well, by your definition, the ASR9903, for example, is a distributed 
platform, which has a fabric ASIC via the RP, with 4x NPU's on the fixed 
line card, 2x NPU's on the 800Gbps PEC and 4x NPU's on the 2Tbps PEC.


Mark.
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Mark Tinka via juniper-nsp




On 7/2/23 11:18, Saku Ytti wrote:


In this context, these are all distributed platforms, they have
multiple NPUs and fabric. Centralised has a single forwarding chip,
and significantly more ports than bandwidth.


So to clarify your definition of "centralized", even if there is no 
replaceable fabric, and the line cards communicate via a fixed fabric 
ASIC, you'd still define that as a distributed platform?


By your definition, you are speaking about fixed form factor platforms 
with neither a replaceable fabric nor fabric ASIC, like the MX204, 
ASR920, ACX7024, 7520-IXR, e.t.c.?


Mark.
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Saku Ytti via juniper-nsp
On Sun, 2 Jul 2023 at 12:11, Mark Tinka  wrote:

> Well, for data centre aggregation, especially for 100Gbps transit ports
> to customers, centralized routers make sense (MX304, MX10003, ASR9903,
> e.t.c.). But those boxes don't make sense as Metro-E routers... they can
> aggregate Metro-E routers, but can't be Metro-E routers due to their cost.

In this context, these are all distributed platforms, they have
multiple NPUs and fabric. Centralised has a single forwarding chip,
and significantly more ports than bandwidth.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Mark Tinka via juniper-nsp




On 7/2/23 10:42, Saku Ytti wrote:


Yes. Satellite is basically VLAN aggregation, but a little bit less
broken. Both are much inferior to MPLS.


I agree that using vendor satellites solves this problem. The issue, 
IIRC, is was what happens when you need to have the satellites in rings?


Satellites work well when fibre is not an issue, and each satellite can 
hang off the PE router like a spur. But if you need to build rings in 
order to cover as many areas as possible at a reasonable cost, 
satellites seemed to struggled to have scalable ring topologies. This 
could have changed over time, not sure. I stopped tracking satellite 
technologies around 2010.




  But usually that's not the
comparison due to real or perceived cost reasons. So in absence of a
vendor selling you the front-plate you need, option space often
considered is satellite or vlan aggregation, instead of connecting
some smaller MPLS edge boxes to bigger aggregation MPLS boxes, which
would be, in my opinion, obviously better.


The cost you pay for a small Metro-E router optimized for ring 
deployments is more than paid back in the operational simplicity that 
comes with MPLS-based rings. Having ran such architectures for close to 
15 years now (since the Cisco ME3600X/3800X), I can tell you how much 
easier it has been for us to scale and keep customers because we did not 
have to run Layer 2 rings like our competitors did.




But as discussed, centralised chassis boxes are appearing as a new
option to the option space.


Well, for data centre aggregation, especially for 100Gbps transit ports 
to customers, centralized routers make sense (MX304, MX10003, ASR9903, 
e.t.c.). But those boxes don't make sense as Metro-E routers... they can 
aggregate Metro-E routers, but can't be Metro-E routers due to their cost.


I think there is still a use-case for distributed boxes like the MX480 
and MX960, for cases where you have to aggregate plenty of 1Gbps and 
10Gbps customers. Those line cards, especially the ones that are now 
EoS/EoL, are extremely cheap and more than capable of supporting 1Gbps 
and 10Gbps services in the data centre. At the moment, with modern 
centralized routers optimized for 100Gbps and 400Gbps, using them to 
aggregate 10Gbps services or lower maybe be costlier than, say, an MX480 
or MX960 with MPC2E or MPC7E line cards attached to a dense Ethernet 
switch via 802.1Q.


For the moment, the Metro-E router that makes the most sense to us is 
the ACX7024. Despite its Broadcom base, we seem to have found a way to 
make it work for us, and replace the ASR920.


Mark.
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Saku Ytti via juniper-nsp
On Sun, 2 Jul 2023 at 11:38, Mark Tinka  wrote:

> So all the above sounds to me like scenarios where Metro-E rings are
> built on 802.1Q/Q-in-Q/REP/STP/e.t.c., rather than IP/MPLS.

Yes. Satellite is basically VLAN aggregation, but a little bit less
broken. Both are much inferior to MPLS. But usually that's not the
comparison due to real or perceived cost reasons. So in absence of a
vendor selling you the front-plate you need, option space often
considered is satellite or vlan aggregation, instead of connecting
some smaller MPLS edge boxes to bigger aggregation MPLS boxes, which
would be, in my opinion, obviously better.

But as discussed, centralised chassis boxes are appearing as a new
option to the option space.

-- 
  ++ytti
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp


Re: [j-nsp] MX304 Port Layout

2023-07-02 Thread Mark Tinka via juniper-nsp




On 6/28/23 08:44, Saku Ytti wrote:


Apart from obvious stuff like QoS getting difficult, not full feature
parity with VLAN and main interface, or counters becoming less useful
as many are port level so identifying true source port may not be
easy. There are things that you'll just discover over time that don't
even come to your mind, and I don't know what those will be in your
deployment. I can give anecdotes

2*VXR termination of metro L2 ring
 - everything is 'ok'
 - ethernet pseudowire service is introduced to customers
 - occasionally there are loops now
 - well VXR goes to promisc mode when you add ethernet pseudowire,
because while it has VLAN local significancy, it doesn't have per-vlan
MAC filter.
 - now unrelated L3 VLAN, which is redundantly terminated to both
VXR has customer CE down in the L2 metro
 - because ARP timeout is 4h, and MAC timeout is 300s, the the
metro will forget the MAC fast, L3 slowly
 - so primary PE gets packet off of internet, sends to metro, metro
floods to all ports, including secondary PE
 - secondary PE sends packet to primary PE, over WAN
 - now you learned 'oh yeah, i should have ensured there is
per-vlan mac filter' and 'oh yeah, my MAC/ARP timeouts are
misconfigured'
 - but these are probably not the examples you'll learn, they'll be
something different
 - when you do satellite, you can solve lot of the problem scope by
software as you control L2 and L3, and can do proprietary code

L2 transparency
 - You do QinQ in L2 aggregation, to pass customer frame to
aggregation termination
 - You do MAC rewrite in/out of the L2 aggregation (customer MAC
addresses get rewritten coming in from customer, and mangled back to
legitimate MAC going out to termination). You need this to pass STP
and such in pseudowires from customer to termination
 - In termination hardware physically doesn't consider VLAN+ISIS
legitimate packet and will kill it, so you have no way of supporting
ISIS inside pseudowire when you have L2 aggregation to customer.
Technically it's not valid, technically ISIS isn't EthernetII, and
802.3 doesn't have VLANs. But technically correct rarely reduces the
red hue in customers faces when they inform about issues they are
experiencing.
 - even if this works, there are plenty of other ways pseudowire
transparency suffers with L2 aggregation, as you are experiencing set
of limitations from two box, instead of one box when it comes to
transparency, and these sets wont be identical
 - you will introduce MAC limit to your point-to-point martini
product, which didn't previously exist. Because your L2 ring is
redundant and you need mac learning. If it's just single switch, you
can turn off MAC learning per VLAN, and be closer to satellite
solution

Convergence
 - your termination no longer observes hardware liveness detection,
so you need some solution to transfer L2 port state to VLAN. Which
will occasionally break, as it's new complexity.


So all the above sounds to me like scenarios where Metro-E rings are 
built on 802.1Q/Q-in-Q/REP/STP/e.t.c., rather than IP/MPLS.


We run fairly large Metro-E rings, but we run them as IP/MPLS rings, and 
all the issues you describe above are the reasons we pushed the vendors 
(Cisco in particular) to provide boxes that were optimized for the 
Metro-E applications, but had proper IP/MPLS support. In other words, 
these are largely solved problems.


I think many - if not all - of the issues you raise above can be fixed 
by, say, a Cisco ASR920 deployed at scale in the Metro, running IP/MPLS 
for the backbone, end-to-end.


Mark.
___
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp