Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately 
familiar with any of these devices, but I'm familiar with the high level 
tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, 
especially once you get into the second- and third-order implementation 
details.  Your mileage will vary...   ;-)

If you have a model where one core/block does ALL of the processing, you 
generally benefit from lower latency, simpler programming, etc.  A major 
downside is that to do this, all of these cores have to have access to all of 
the different memories used to forward said packet.  Conversely, if you break 
up the processing into stages, you can only connect the FIB lookup memory to 
the cores that are going to be doing the FIB lookup, and only connect the encap 
memories to the cores/blocks that are doing the encapsulation work.  Those 
interconnects take up silicon space, which equates to higher cost and power.  

Packaging two cores on a single device is beneficial in that you only have one 
physical chip to work with instead of two.  This often simplifies the board 
designers' job, and is often lower power than two separate chips.  This starts 
to break down as you get to exceptionally large chips as you bump into the 
various physical/reticle limitations of how large a chip you can actually 
build.  With newer packaging technology (2.5D chips, HBM and similar memories, 
chiplets down the road, etc) this becomes even more complicated, but the answer 
to "why would you put two XYZs on a package?" is that it's just cheaper and 
lower power from a system standpoint (and often also from a pure silicon 
standpoint...)

Buffer designs are *really* hard in modern high speed chips, and there are 
always lots and lots of tradeoffs.  The "ideal" answer is an extremely large 
block of memory that ALL of the forwarding/queueing elements have fair/equal 
access to... but this physically looks more or less like a full mesh between 
the memory/buffering subsystem and all the forwarding engines, which becomes 
really unwieldly (expensive!) from a design standpoint.  The amount of memory 
you can practically put on the main NPU die is on the order of 20-200 **mega** 
bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 
100x the size.  Figuring out which side of this gigantic gulf you want to live 
on is a super important part of the basic architecture and also drives lots of 
other decisions down the line... once you've decided how much buffering memory 
you're willing/able to put down, the next challenge is coming up with ways to 
provide access to that memory from all the different potential clients.  It's a 
LOT easier to wire up/design a chip where you have four separate 
pipelines/cores/whatever and each one of them accesses 1/4 of the buffer 
memory... but that also means that any given port only has access to 1/4 of the 
memory for burst absorption.  Lots and lots of Smart People Time has gone into 
different memory designs that attempt to optimize this problem, and it's a 
major part of the intellectual property of various chip designs.  

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail....@nanog.org> On Behalf Of Saku Ytti
Sent: Friday, August 5, 2022 3:16 AM
To: Jeff Tantsura <jefftant.i...@gmail.com>
Cc: NANOG <nanog@nanog.org>; Jeff Doyle <jdo...@juniper.net>
Subject: Re: 400G forwarding - how does it work?

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of 
line-of-PPE (identical cores). I don't understand the advantages of the FP 
model, the Trio model advantages are clear to me. Obviously the FP model has to 
have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in Trio6, there 
is no local-switching between WANs of trios in-package, they are, as far as I 
can tell, ships in the night, packets between trios go via fabric, as they 
would with separate Trios. I can understand the benefit of putting trio and 
HBM2 on the same package, to reduce distance so wattage goes down or frequency 
goes up.

c) What evolution they are thinking for the shallow ingress buffers for Trio6. 
The collateral damage potential is significant, because WAN which asks most, 
gets most, instead each having their fair share, thus potentially arbitrarily 
low rate WAN ingress might not get access to ingress buffer causing drop. Would 
it be practical in terms of wattage/area to add some sort of preQoS towards the 
shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to 
shallow buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.i...@gmail.com> wrote:
>
> Apologies for garbage/HTMLed email, not sure what happened (thanks 
> Brian F for letting me know).
> Anyway, the podcast with Juniper (mostly around Trio/Express) has been 
> broadcasted today and is available at 
> https://www.youtube.com/watch?v=1he8GjDBq9g
> Next in the pipeline are:
> Cisco SiliconOne
> Broadcom DNX (Jericho/Qumran/Ramon)
> For both - the guests are main architects of the silicon
>
> Enjoy
>
>
> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.i...@gmail.com> wrote:
> >
> > Hey,
> >
> >
> >
> > This is not an advertisement but an attempt to help folks to better 
> > understand networking HW.
> >
> >
> >
> > Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle 
> > and I have been hosting for a couple of years.
> >
> >
> >
> > Following up the discussion we have decided to dedicate a number of 
> > upcoming podcasts to networking HW, the topic where more information and 
> > better education is very much needed (no, you won’t have to sign NDA before 
> > joining 😊), we have lined up a number of great guests, people who design 
> > and build ASICs and can talk firsthand about evolution of networking HW, 
> > complexity of the process, differences between fixed and programmable 
> > pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are 
> > joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other 
> > vendors will be joining in the later episodes, usual rules apply – no 
> > marketing, no BS.
> >
> > More to come, stay tuned.
> >
> > Live feed: https://lnkd.in/gk2x2ezZ
> >
> > Between 0x2 nerds playlist, videos will be published to: 
> > https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2
> > yJB7
> >
> >
> >
> > Cheers,
> >
> > Jeff
> >
> >
> >
> > From: James Bensley
> > Sent: Wednesday, July 27, 2022 12:53 PM
> > To: Lawrence Wobker; NANOG
> > Subject: Re: 400G forwarding - how does it work?
> >
> >
> >
> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwob...@gmail.com> wrote:
> >
> > > So if this pipeline can do 1.25 billion PPS and I want to be able to 
> > > forward 10BPPS, I can build a chip that has 8 of these pipelines and get 
> > > my performance target that way.  I could also build a "pipeline" that 
> > > processes multiple packets per clock, if I have one that does 2 
> > > packets/clock then I only need 4 of said pipelines... and so on and so 
> > > forth.
> >
> >
> >
> > Thanks for the response Lawrence.
> >
> >
> >
> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> >
> > J2 to have something similar (as someone already mentioned, most 
> > chips
> >
> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> >
> > pipelines would be needed to maintain the headline 2Bpps rate of the
> >
> > J2, or even just 1 if they have managed to squeeze out two packets 
> > per
> >
> > cycle through parallelisation within the pipeline.
> >
> >
> >
> > Cheers,
> >
> > James.
> >
> >



--
  ++ytti

Reply via email to