Re: [Bloat] Hardware upticks

2016-01-05 Thread Jonathan Morton

> On 6 Jan, 2016, at 02:22, Steinar H. Gunderson  wrote:
> 
> On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
>> The expensive part is often having to save and restore all the state in
>> registers and other bits on context switch.
> 
> Are you sure? There's not really all that much state to save, and all I've
> been taught before says the opposite.
> 
> Also, I've never ever seen the actual context switch turn up high in a perf
> profile.  Is this because of some sampling artifact?

ARM has dedicated register banks for several interrupt levels for exactly this 
reason.  Simple interrupt handlers can operate in these without spilling *any* 
userspace registers.  This gives ARM quite good interrupt latency, especially 
in the simpler implementations.

That doesn’t help for an actual context switch of course.  What does help is 
“lazy FPU state switching”, where on a context switch the FPU is simply marked 
as unavailable.  Only if/when the process attempts to *use* the FPU, this gets 
trapped and the trap handler restores the correct state before returning an 
enabled FPU to userspace.  The same goes for SIMD register banks, of course.

Lazy context switching is a kernel feature.  It’s used on all architectures 
that have a runtime disable-able FPU, AFAIK.  For a context switch to kernel 
and back to the same process, the FPU & SIMD are never actually switched, so 
there is almost no overhead.

 - Jonathan Morton

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Stephen Hemminger
On Wed, 6 Jan 2016 01:55:37 +0100
"Steinar H. Gunderson"  wrote:

> On Tue, Jan 05, 2016 at 04:53:14PM -0800, Stephen Hemminger wrote:
> >> Also, I've never ever seen the actual context switch turn up high in a perf
> >> profile.  Is this because of some sampling artifact?
> > Yes, especially with Intel processors getting more and more SSE/floating 
> > point
> > registers.
> 
> But those are not saved on context switch to the kernel, no? (Thus the rule
> of no floating-point in the kernel.) Only if you switch between userspace
> processes

Right that just punts the work to the kernel when it context switches
to next process.
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 04:53:14PM -0800, Stephen Hemminger wrote:
>> Also, I've never ever seen the actual context switch turn up high in a perf
>> profile.  Is this because of some sampling artifact?
> Yes, especially with Intel processors getting more and more SSE/floating point
> registers.

But those are not saved on context switch to the kernel, no? (Thus the rule
of no floating-point in the kernel.) Only if you switch between userspace
processes.

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Stephen Hemminger
On Wed, 6 Jan 2016 01:22:13 +0100
"Steinar H. Gunderson"  wrote:

> On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
> > The expensive part is often having to save and restore all the state in
> > registers and other bits on context switch.
> 
> Are you sure? There's not really all that much state to save, and all I've
> been taught before says the opposite.
> 
> Also, I've never ever seen the actual context switch turn up high in a perf
> profile.  Is this because of some sampling artifact?

Yes, especially with Intel processors getting more and more SSE/floating point
registers.
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 04:06:03PM -0800, Stephen Hemminger wrote:
> The expensive part is often having to save and restore all the state in
> registers and other bits on context switch.

Are you sure? There's not really all that much state to save, and all I've
been taught before says the opposite.

Also, I've never ever seen the actual context switch turn up high in a perf
profile.  Is this because of some sampling artifact?

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Stephen Hemminger
The expensive part is often having to save and restore all the state in
registers and other bits on context switch.


On Tue, Jan 5, 2016 at 4:01 PM, Steinar H. Gunderson  wrote:

> On Tue, Jan 05, 2016 at 03:36:10PM -0600, Benjamin Cronce wrote:
> > You can't have different virtual memory space and not take some large
> > switching overhead without devoting a lot of transistors to massive
> caches.
> > And the larger the caches, the higher the latency.
>
> I'm sure you already know this, but just to add to what you're saying:
> Modern CPUs actually have cache-line tagging tricks so that they don't have
> to blow the entire L1 just because you do a context switch. It would be too
> expensive.
>
> /* Steinar */
> --
> Homepage: https://www.sesse.net/
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 03:36:10PM -0600, Benjamin Cronce wrote:
> You can't have different virtual memory space and not take some large
> switching overhead without devoting a lot of transistors to massive caches.
> And the larger the caches, the higher the latency.

I'm sure you already know this, but just to add to what you're saying:
Modern CPUs actually have cache-line tagging tricks so that they don't have
to blow the entire L1 just because you do a context switch. It would be too
expensive.

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 12:27:03PM -0800, Stephen Hemminger wrote:
> Intel has some new Cache QoS stuff that allows configuring how much
> cache is allowed per context.  But of course it is only on the 
> newest/latest/unoptinium

Note that this is for improving fairness between applications, not total
throughput of the machine. It will gain you nothing if you run a single
server; what it gains you is that you can take your low-priority batch job
and run it next to your mission-critical web server, without having to worry
it will eat up all your L3.

Needless to say, it is for advanced users.

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 11:37:02AM -0800, Dave Täht wrote:
>> Anyway, the biggest cost of a context switch isn't necessarily the time used
>> to set up registers and such. It's increased L1 pressure; your CPU is now
>> running different code and looking at (largely) different data.
> A L1/L2 Icache dedicated to interrupt processing code could make a great
> deal of difference, if only cpu makers and benchmarkers would make
> CS time something we valued.
> 
> Dcache, not so much, except for the intel architectures which are now
> doing DMA direct to cache. (any arms doing that?)

Note that I'm saying L1 pressure, not “bad choice of what to keep in L1”.
If you dedicate L1i space to interrupt processing code (which includes,
presumably, large parts of your TCP/IP stack?), you will have less for
your normal userspace, and I'd like to see some very hard data on that being
a win before I'll believe it at face value.

In a sense, if you tie your interrupts to a dedicated core, you get exactly
this, though.

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Benjamin Cronce
On Tue, Jan 5, 2016 at 12:57 PM, Dave Täht  wrote:

>
>
> On 1/5/16 10:27 AM, Jonathan Morton wrote:
> > Undoubtedly.  But that beefy quad-core CPU should be able to handle it
> > without them.
>
> Sigh. It's not just the CPU that matters. Context switch time, memory
> bus and I/O bus architecture, the intelligence or lack thereof of the
> network interface, and so on.
>
> To give a real world case of stupidity in a hardware design - the armada
> 385 in the linksys platform connects tx and rx packet related interrupts
> to a single interrupt line, requiring that tx and rx ring buffer cleanup
> (in particular) be executed on a single cpu, *at the same time, in a
> dedicated thread*.
>
> Saving a single pin (which doesn't even exist off chip) serializes
> tx and rx processing. DUMB. (I otherwise quite like much of the marvel
> ethernet design and am looking forward to the turris omnia very much)
>
> ...
>
> Context switch time is probably one of the biggest hidden nightmares in
> modern OOO cpu architectures - they only go fast in a straight line. I'd
> love to see a 1ghz processor that could context switch in 5 cycles.
>

Seeing that most modern CPUs take thousands to tens of thousands of cycles
to switch, 5 is similar to saying "instantly". Some of that overhead is
shooting down the TLB and many layers of cache misses. You can't have
different virtual memory space and not take some large switching overhead
without devoting a lot of transistors to massive caches. And the larger the
caches, the higher the latency.

Modern PC hardware can use soft interrupts to reduce hardware interrupts
and context switching. My Intel i350 issues a steady about 150 interrupts
per second per core regardless the network load, while maintaining tens of
microsecond ping times.

I'm not sure what they could do with custom architectures, but there will
always be an issue with context switching overhead, but they may be able to
cache a few specific contexts knowing that the embedded system will rarely
have more than a few contexts doing the bulk of the work.


>
> Having 4 cores responding to interrupts masks this latency somewhat
> when having multiple sources of interrupt contending... (but see above -
> you need dedicated interrupt lines per major source of interrupts for
> it to work)
>
> and the inherent context switch latency is still always there. (sound
> cheshire's rant)
>
> The general purpose "mainstream" processors not handling interrupts well
> anymore is one of the market drivers towards specialized co-processors.
>
> ...
>
> Knowing broadcom, there's probably so many invasive offloads, bugs
> and errata in this new chip that 90% of the features will never be
> used. But "forwarding in-inspected, un-firewalled, packets in massive
> bulk so as to win a benchmark race", ok, happy they are trying.
>
> Maybe they'll even publish a data sheet worth reading.
>
> >
> > - Jonathan Morton
> >
> >
> >
> > ___
> > Bloat mailing list
> > Bloat@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
> >
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Jonathan Morton
Yes, Intel is the master of market segmentation here.  I don't believe for
a second that most of their best features just happen to have a high defect
rate that warrants setting the kill bit on all the cheaper badges slapped
on the common die.

A few years ago, I got a killer deal from AMD.  The Phenom II X2 555 BE.
In the right motherboard, it would happily attempt to turn the two missing
cores back on.  If successful, you had a Phenom II X4 955 BE.  And so I
did.  It's still a pretty nice beast - shame it's locked away in storage
for the moment.

Intel doesn't allow such nice tricks.  They'd lose too much money from it.

- Jonathan Morton
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Stephen Hemminger
On Tue, 5 Jan 2016 11:37:02 -0800
Dave Täht  wrote:

> 
> 
> On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:
> > On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
> >> Context switch time is probably one of the biggest hidden nightmares in
> >> modern OOO cpu architectures - they only go fast in a straight line. I'd
> >> love to see a 1ghz processor that could context switch in 5 cycles.
> > 
> > It's called hyperthreading? ;-)
> > 
> > Anyway, the biggest cost of a context switch isn't necessarily the time used
> > to set up registers and such. It's increased L1 pressure; your CPU is now
> > running different code and looking at (largely) different data.
> 
> +10.
> 
> A L1/L2 Icache dedicated to interrupt processing code could make a great
> deal of difference, if only cpu makers and benchmarkers would make
> CS time something we valued.
> 
> Dcache, not so much, except for the intel architectures which are now
> doing DMA direct to cache. (any arms doing that?)
> 
> > /* Steinar */
> > 
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

Intel has some new Cache QoS stuff that allows configuring how much
cache is allowed per context.  But of course it is only on the 
newest/latest/unoptinium
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread David Collier-Brown
The SPARC T5 is surprisingly good here, with a very short path to cache 
and a moderate number of threads with hot cache lines.  Cache 
performance was one of the surprises when the slowish early T-machines 
came out, and surprised a smarter colleague and I who had apps 
bottlenecking on cold cache lines on what were nominally much faster 
processors.


I'd love to have a T5-1 on an experimenter board, or perhaps even in my 
laptop (I used to own a SPARC laptop), but that's not where Snoracle is 
going.


--dave

On 05/01/16 02:37 PM, Dave Täht wrote:


On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:

On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:

Context switch time is probably one of the biggest hidden nightmares in
modern OOO cpu architectures - they only go fast in a straight line. I'd
love to see a 1ghz processor that could context switch in 5 cycles.

It's called hyperthreading? ;-)

Anyway, the biggest cost of a context switch isn't necessarily the time used
to set up registers and such. It's increased L1 pressure; your CPU is now
running different code and looking at (largely) different data.

+10.

A L1/L2 Icache dedicated to interrupt processing code could make a great
deal of difference, if only cpu makers and benchmarkers would make
CS time something we valued.

Dcache, not so much, except for the intel architectures which are now
doing DMA direct to cache. (any arms doing that?)


/* Steinar */


___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat



--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
dav...@spamcop.net   |  -- Mark Twain

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Dave Täht


On 1/5/16 11:29 AM, Steinar H. Gunderson wrote:
> On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
>> Context switch time is probably one of the biggest hidden nightmares in
>> modern OOO cpu architectures - they only go fast in a straight line. I'd
>> love to see a 1ghz processor that could context switch in 5 cycles.
> 
> It's called hyperthreading? ;-)
> 
> Anyway, the biggest cost of a context switch isn't necessarily the time used
> to set up registers and such. It's increased L1 pressure; your CPU is now
> running different code and looking at (largely) different data.

+10.

A L1/L2 Icache dedicated to interrupt processing code could make a great
deal of difference, if only cpu makers and benchmarkers would make
CS time something we valued.

Dcache, not so much, except for the intel architectures which are now
doing DMA direct to cache. (any arms doing that?)

> /* Steinar */
> 
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Steinar H. Gunderson
On Tue, Jan 05, 2016 at 10:57:13AM -0800, Dave Täht wrote:
> Context switch time is probably one of the biggest hidden nightmares in
> modern OOO cpu architectures - they only go fast in a straight line. I'd
> love to see a 1ghz processor that could context switch in 5 cycles.

It's called hyperthreading? ;-)

Anyway, the biggest cost of a context switch isn't necessarily the time used
to set up registers and such. It's increased L1 pressure; your CPU is now
running different code and looking at (largely) different data.

/* Steinar */
-- 
Homepage: https://www.sesse.net/
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Dave Täht


On 1/5/16 10:27 AM, Jonathan Morton wrote:
> Undoubtedly.  But that beefy quad-core CPU should be able to handle it
> without them.

Sigh. It's not just the CPU that matters. Context switch time, memory
bus and I/O bus architecture, the intelligence or lack thereof of the
network interface, and so on.

To give a real world case of stupidity in a hardware design - the armada
385 in the linksys platform connects tx and rx packet related interrupts
to a single interrupt line, requiring that tx and rx ring buffer cleanup
(in particular) be executed on a single cpu, *at the same time, in a
dedicated thread*.

Saving a single pin (which doesn't even exist off chip) serializes
tx and rx processing. DUMB. (I otherwise quite like much of the marvel
ethernet design and am looking forward to the turris omnia very much)

...

Context switch time is probably one of the biggest hidden nightmares in
modern OOO cpu architectures - they only go fast in a straight line. I'd
love to see a 1ghz processor that could context switch in 5 cycles.

Having 4 cores responding to interrupts masks this latency somewhat
when having multiple sources of interrupt contending... (but see above -
you need dedicated interrupt lines per major source of interrupts for
it to work)

and the inherent context switch latency is still always there. (sound
cheshire's rant)

The general purpose "mainstream" processors not handling interrupts well
anymore is one of the market drivers towards specialized co-processors.

...

Knowing broadcom, there's probably so many invasive offloads, bugs
and errata in this new chip that 90% of the features will never be
used. But "forwarding in-inspected, un-firewalled, packets in massive
bulk so as to win a benchmark race", ok, happy they are trying.

Maybe they'll even publish a data sheet worth reading.

> 
> - Jonathan Morton
> 
> 
> 
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
> 
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Jonathan Morton
Undoubtedly.  But that beefy quad-core CPU should be able to handle it
without them.

- Jonathan Morton
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Hardware upticks

2016-01-05 Thread Aaron Wood
'5Gbps system throughput "without taxing the CPU,"'

Lots of offloads?

On Mon, Jan 4, 2016 at 10:37 PM, Jonathan Morton 
wrote:

> This looks potentially interesting:
> http://www.theregister.co.uk/2016/01/05/broadcom_pimps_iot_router_chip/
>
> Even if that particular device turns out to be hard to work with in an
> open-source manner, it looks like hardware in general might be about to
> improve.
>
>  - Jonathan Morton
>
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat