Re: CPA patchset

2008-01-11 Thread dean gaudet
On Fri, 11 Jan 2008, dean gaudet wrote:

> On Fri, 11 Jan 2008, Ingo Molnar wrote:
> 
> > * Andi Kleen <[EMAIL PROTECTED]> wrote:
> > 
> > > Cached requires the cache line to be read first before you can write 
> > > it.
> > 
> > nonsense, and you should know it. It is perfectly possible to construct 
> > fully written cachelines, without reading the cacheline first. MOVDQ is 
> > SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> > can generate full cacheline writes, _without_ filling in the cacheline 
> > first.
> 
> did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a 
non-temporal store... i ran a microbenchmark streaming stores to every 16 
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU 
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles 
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines 
worth of stores being stuffed into the store queue as fast as possible... 
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do 
in hardware without the non-temporal hint.

-dean


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Arjan van de Ven
On Fri, 11 Jan 2008 09:02:46 -0800 (PST)
dean gaudet <[EMAIL PROTECTED]> wrote:

> > Bulk ops (string ops, etc.) will do full cacheline writes too, 
> > without filling in the cacheline.
> 
> on intel with fast strings enabled yes.  mind you intel gives hints in
> the documentation these operations don't respect coherence... and i
> asked about this when they posted their memory ordering paper but got
> no response.

I know there's an answer on the way; it just needs to get checked by
architects for cpus we shipped in like the last 10 years ;)

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread dean gaudet
On Fri, 11 Jan 2008, Ingo Molnar wrote:

> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to construct 
> fully written cachelines, without reading the cacheline first. MOVDQ is 
> SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> can generate full cacheline writes, _without_ filling in the cacheline 
> first.

did you mean to write MOVNTPS above?


> Bulk ops (string ops, etc.) will do full cacheline writes too, 
> without filling in the cacheline.

on intel with fast strings enabled yes.  mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Andi Kleen
> It is perfectly possible to construct 
> fully written cachelines, without reading the cacheline first. MOVDQ is 

If you write a aligned full 64 (or 128) byte area and even then you can 
have occassional reads which can be either painfully slow or even incorrect.

> but that's totally besides the point anyway. WC or WB accesses, if a 3D 
> app or a driver does high-freq change_page_attr() calls, it will _lose_ 
> the performance game:

Yes, high frequency as in doing it in fast paths is not a good idea, but 
reasonably low frequency (as in acceptable process exit latencies for 
example) are something to aim for. Right now with WBINVD and other problems 
it is too slow.

> > > in everything from the card and use it. In graphics, if you remap 
> > > anything on the fly and it's not a slowpath you've lost the 
> > > performance game even before you began it.
> > 
> > The typical case would be lots of user space DRI clients supplying 
> > their own buffers on the fly. There's not really a fixed pool in this 
> > case, but it all varies dynamically. In some scenarios that could 
> > happen quite often.
> 
> in what scenarios? Please give me in-tree examples of such high-freq 
> change_page_attr() cases, where the driver authors would like to call it 
> with high frequency but are unable to do it and see performance problems 
> due to the WBINVD.

Some workloads do regular mapping into the GART aperture, but it is
not too critical yet.

But it is not too widely used because it is too slow; but i've got
requests from various parties over the years for more efficient c_p_a().
It's a chicken'n'egg problem -- you're asking for users but the users
don't use it yet because it's too slow.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Andi Kleen
> Write-Combining can be very useful for devices that are behind a slow or 
> a high-latency transport, such as PCI, and which are mapped UnCached 

That is what I wrote! If you meant the same we must have been
spectacularly miscommunicating.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Andi Kleen
 It is perfectly possible to construct 
 fully written cachelines, without reading the cacheline first. MOVDQ is 

If you write a aligned full 64 (or 128) byte area and even then you can 
have occassional reads which can be either painfully slow or even incorrect.

 but that's totally besides the point anyway. WC or WB accesses, if a 3D 
 app or a driver does high-freq change_page_attr() calls, it will _lose_ 
 the performance game:

Yes, high frequency as in doing it in fast paths is not a good idea, but 
reasonably low frequency (as in acceptable process exit latencies for 
example) are something to aim for. Right now with WBINVD and other problems 
it is too slow.

   in everything from the card and use it. In graphics, if you remap 
   anything on the fly and it's not a slowpath you've lost the 
   performance game even before you began it.
  
  The typical case would be lots of user space DRI clients supplying 
  their own buffers on the fly. There's not really a fixed pool in this 
  case, but it all varies dynamically. In some scenarios that could 
  happen quite often.
 
 in what scenarios? Please give me in-tree examples of such high-freq 
 change_page_attr() cases, where the driver authors would like to call it 
 with high frequency but are unable to do it and see performance problems 
 due to the WBINVD.

Some workloads do regular mapping into the GART aperture, but it is
not too critical yet.

But it is not too widely used because it is too slow; but i've got
requests from various parties over the years for more efficient c_p_a().
It's a chicken'n'egg problem -- you're asking for users but the users
don't use it yet because it's too slow.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Andi Kleen
 Write-Combining can be very useful for devices that are behind a slow or 
 a high-latency transport, such as PCI, and which are mapped UnCached 

That is what I wrote! If you meant the same we must have been
spectacularly miscommunicating.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread dean gaudet
On Fri, 11 Jan 2008, Ingo Molnar wrote:

 * Andi Kleen [EMAIL PROTECTED] wrote:
 
  Cached requires the cache line to be read first before you can write 
  it.
 
 nonsense, and you should know it. It is perfectly possible to construct 
 fully written cachelines, without reading the cacheline first. MOVDQ is 
 SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
 can generate full cacheline writes, _without_ filling in the cacheline 
 first.

did you mean to write MOVNTPS above?


 Bulk ops (string ops, etc.) will do full cacheline writes too, 
 without filling in the cacheline.

on intel with fast strings enabled yes.  mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread dean gaudet
On Fri, 11 Jan 2008, dean gaudet wrote:

 On Fri, 11 Jan 2008, Ingo Molnar wrote:
 
  * Andi Kleen [EMAIL PROTECTED] wrote:
  
   Cached requires the cache line to be read first before you can write 
   it.
  
  nonsense, and you should know it. It is perfectly possible to construct 
  fully written cachelines, without reading the cacheline first. MOVDQ is 
  SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
  can generate full cacheline writes, _without_ filling in the cacheline 
  first.
 
 did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a 
non-temporal store... i ran a microbenchmark streaming stores to every 16 
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU 
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles 
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines 
worth of stores being stuffed into the store queue as fast as possible... 
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do 
in hardware without the non-temporal hint.

-dean


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-11 Thread Arjan van de Ven
On Fri, 11 Jan 2008 09:02:46 -0800 (PST)
dean gaudet [EMAIL PROTECTED] wrote:

  Bulk ops (string ops, etc.) will do full cacheline writes too, 
  without filling in the cacheline.
 
 on intel with fast strings enabled yes.  mind you intel gives hints in
 the documentation these operations don't respect coherence... and i
 asked about this when they posted their memory ordering paper but got
 no response.

I know there's an answer on the way; it just needs to get checked by
architects for cpus we shipped in like the last 10 years ;)

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > > I think you have it fundamentally backwards: the best for 
> > > performance is WB + cflush. What would WC offer for performance 
> > > that cflush cannot do?
> > 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to 
> construct fully written cachelines, without reading the cacheline 
> first. MOVDQ is SSE1 so on basically in every CPU today - and it is 16 
> byte aligned and can generate full cacheline writes, _without_ filling 
> in the cacheline first. Bulk ops (string ops, etc.) will do full 
> cacheline writes too, without filling in the cacheline. Especially 
> with high performance 3D ops we do _NOT_ need any funky reads from 
> anywhere because 3D software can stream a lot of writes out: we 
> construct a full frame or a portion of a frame, or upload vertices or 
> shader scripts, textures, etc.
> 
> ( also, _even_ when there is a cache fill pending on for a partially
>   written cacheline, that might go on in parallel and it is not 
>   necessarily holding up the CPU unless it has an actual data dependency 
>   on that. )

btw., that's not to suggest that WC does not have its place - but it's 
not where you think it is.

Write-Combining can be very useful for devices that are behind a slow or 
a high-latency transport, such as PCI, and which are mapped UnCached 
today. Such as a networking card (or storage card) descriptor rings. New 
entries in the TX ring are typically constructed on the CPU side without 
knowing anything about their contents (and hence no need to read them, 
the drivers typically shadow the ring on the host side already, it is 
required for good UC performance today), so the new descriptors can be 
streamed out efficiently with WC, instead of generating lots of small UC 
accesses. But these descriptor rings have constant size, they have a 
fixed place in the aperture and are ioremap()ed once during device init 
and that's it. No high-frequency change_page_attributes() usage is done 
or desired here.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> > > > but that's not too smart: why dont they use WB plus cflush 
> > > > instead?
> > > 
> > > Because they need to access it WC for performance.
> > 
> > I think you have it fundamentally backwards: the best for 
> > performance is WB + cflush. What would WC offer for performance that 
> > cflush cannot do?
> 
> Cached requires the cache line to be read first before you can write 
> it.

nonsense, and you should know it. It is perfectly possible to construct 
fully written cachelines, without reading the cacheline first. MOVDQ is 
SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
can generate full cacheline writes, _without_ filling in the cacheline 
first. Bulk ops (string ops, etc.) will do full cacheline writes too, 
without filling in the cacheline. Especially with high performance 3D 
ops we do _NOT_ need any funky reads from anywhere because 3D software 
can stream a lot of writes out: we construct a full frame or a portion 
of a frame, or upload vertices or shader scripts, textures, etc.

( also, _even_ when there is a cache fill pending on for a partially
  written cacheline, that might go on in parallel and it is not 
  necessarily holding up the CPU unless it has an actual data dependency 
  on that. )

but that's totally besides the point anyway. WC or WB accesses, if a 3D 
app or a driver does high-freq change_page_attr() calls, it will _lose_ 
the performance game:

> > also, it's irrelevant to change_page_attr() call frequency. Just map 
> > in everything from the card and use it. In graphics, if you remap 
> > anything on the fly and it's not a slowpath you've lost the 
> > performance game even before you began it.
> 
> The typical case would be lots of user space DRI clients supplying 
> their own buffers on the fly. There's not really a fixed pool in this 
> case, but it all varies dynamically. In some scenarios that could 
> happen quite often.

in what scenarios? Please give me in-tree examples of such high-freq 
change_page_attr() cases, where the driver authors would like to call it 
with high frequency but are unable to do it and see performance problems 
due to the WBINVD.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 01:22:04PM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > > What is very real though are the hard limitations of MTRRs. So i'd 
> > > rather first like to see a clean PAT approach (which all other 
> > > modern OSs have already migrated to in the past 10 years)
> > 
> > That's mostly orthogonal. Don't know why you bring it up now?
> 
> because the PAT (Page Attribute Table support) patchset and the CPA 
> (change_page_attr()) patchset are are not orthogonal at all - as their 
> name already signals: because they change the implementation/effects of 
> the same interface(s). [just at different levels].
> 
> Both patchsets change how the kernel pagetable caching is handled. PAT 
> changes the kernel pte details and unshackles us from MTRR reliance and 
> thus solves real problems on real boxes:
> 
>  55 files changed, 1288 insertions(+), 237 deletions(-)
> 
> CPA moves change_page_attr() from invwb flushing to cflush flushing, for 
> a speedup/latency-win, plus a whole bunch of intermingled fixes and 
> improvements to page attribute modification:
> 
>  26 files changed, 882 insertions(+), 423 deletions(-)
> 
> so in terms of risk management, the "perfect patch order" is:
> 
>  - minimal_set of correctness fixes to the highlevel cpa code.

The two clear bug fix patches are refcount and flush order.

refcnt could be moved earlier; flush order would be quite painful
because there are quite a lot of patches dependent on it.

I could move ref count earlier, but I would prefer not to because
of the significant work it would be for me. 

Since it is all already bisectable I'm also not sure what debugging
advantages the reordering would be. 

It's already bisectable (I have not booted all immediate steps,
but several of them and I believe all compile) and in small pieces for that.

If it's really broken it would need to be reverted and then the ref count 
stuff would go too.  But I hope that won't be needed. 

And even losing the reference count fixes wouldn't be catastrophic -- in the
worst case you lose some minor performance because kernel mappings are 
unnecessarily split to 4K pages, but it's not a correctness fix. 

So while the reordering would be possible, it would imho not
bring very much advantages and I must admit I'm not too motivated
to do another time-consuming reordering for a relatively weak reason.  

If it was a 5 patch series I would probably not complain too much
about this, but it's 25+ patches.

>  - ( then any provably NOP cleanups to pave the way. )
> 
>  - then change the lowlevel pte code (PAT) to reduce/eliminate the need 
>to have runtime MTRR use

That's a completely different area really. Most of the PAT code
has nothing to do with clear_page_attr(). Please think that through
again after reading both patchkits.

As far as I can see making it wait for PAT would mean delaying
it for a longer time which would be a pity.


>  - then structural improvements/cleanups of the highlevel cpa code
> 
>  - then the cflush (optional) performance feature ontop of it.

There are actual a lot more performance features in there (self snoop,
minimal TLB flushing some other stuff). Most of it is related to 
that in fact.
 
> 
>  - then gigabyte-largepages/TLBs support [new CPU feature that further
>complicates page-attribute management]

That's already at the end.

> 
> All in an easy-to-revert fashion. We _will_ regress here, and this stuff 

That's already the case.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> > What is very real though are the hard limitations of MTRRs. So i'd 
> > rather first like to see a clean PAT approach (which all other 
> > modern OSs have already migrated to in the past 10 years)
> 
> That's mostly orthogonal. Don't know why you bring it up now?

because the PAT (Page Attribute Table support) patchset and the CPA 
(change_page_attr()) patchset are are not orthogonal at all - as their 
name already signals: because they change the implementation/effects of 
the same interface(s). [just at different levels].

Both patchsets change how the kernel pagetable caching is handled. PAT 
changes the kernel pte details and unshackles us from MTRR reliance and 
thus solves real problems on real boxes:

 55 files changed, 1288 insertions(+), 237 deletions(-)

CPA moves change_page_attr() from invwb flushing to cflush flushing, for 
a speedup/latency-win, plus a whole bunch of intermingled fixes and 
improvements to page attribute modification:

 26 files changed, 882 insertions(+), 423 deletions(-)

so in terms of risk management, the "perfect patch order" is:

 - minimal_set of correctness fixes to the highlevel cpa code.

 - ( then any provably NOP cleanups to pave the way. )

 - then change the lowlevel pte code (PAT) to reduce/eliminate the need 
   to have runtime MTRR use

 - then structural improvements/cleanups of the highlevel cpa code

 - then the cflush (optional) performance feature ontop of it.

 - then gigabyte-largepages/TLBs support [new CPU feature that further
   complicates page-attribute management]

All in an easy-to-revert fashion. We _will_ regress here, and this stuff 
is very hard to debug.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:57:26AM +0100, Ingo Molnar wrote:
> 
> > > > >   WBINVD isnt particular fast (takes a few msecs), but why is 
> > > > >   that a problem? Drivers dont do high-frequency ioremap-ing. 
> > > > >   It's typically only done at driver/device startup and that's 
> > > > >   it.
> > > > 
> > > > Actually graphics drivers can do higher frequency allocation of WC 
> > > > memory (with PAT) support.
> > > 
> > > but that's not too smart: why dont they use WB plus cflush instead?
> > 
> > Because they need to access it WC for performance.
> 
> I think you have it fundamentally backwards: the best for performance is 
> WB + cflush. What would WC offer for performance that cflush cannot do?

Cached requires the cache line to be read first before you can write it.

WC on the other hand does not allocate a cache line and just dumps
the data into a special write combining buffer.  It was invented originally 
because reads from AGP were incredibly slow.

And it's race less regarding the caching protocol (assuming you flush
the caches and TLBs correctly). 

Another typical problem is that if something
is uncached then you can't have it in any other caches because if that
cache eventually flushes it will corrupt the data.
That can happen with remapping apertures for example which remap data
behind the CPUs back.

CLFLUSH is really only a hint but it cannot be used if UC is needed
for correctness.

> also, it's irrelevant to change_page_attr() call frequency. Just map in 
> everything from the card and use it. In graphics, if you remap anything 
> on the fly and it's not a slowpath you've lost the performance game even 
> before you began it.

The typical case would be lots of user space DRI clients supplying
their own buffers on the fly. There's not really a fixed pool 
in this case, but it all varies dynamically. In some scenarios
that could happen quite often.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:43:51AM +0100, Ingo Molnar wrote:
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > >   from doing a cflush-range instruction instead of a WBINVD. But why?
> > 
> > - WBINVD is a very nasty operation. I was talking to some CPU people 
> > and they really recommended to get rid of it as far as possible. 
> > Stopping the CPU for msecs is just wrong and there are apparently even 
> > some theoretical live lock situations. - It is not interruptible in 
> > earlier VT versions and messes up real time in the hypervisor. Some 
> > people were doing KVM on rt kernels and had latency spikes from that.
> 
> ok, i thought you might have had some higher rationale too. Frankly, 
> WBINVD wont ever go away from the x86 instruction set, and while i am 

We can (and should) eliminate it from normal operation at least. The one during
reboot has to be kept unfortunately.

> very symphathetic to rt and latency issues, it's still a risky change in 
> a historically fragile area of code.

It is, but the resulting code is significantly more efficient and does
various other things in better ways (the explicit list allows various
other optimizations). It's how c_p_a should have been written from day one. 


> What is very real though are the hard limitations of MTRRs. So i'd 
> rather first like to see a clean PAT approach (which all other modern 

That's mostly orthogonal. Don't know why you bring it up now?

Anyways more efficient c_p_a() makes PAT usage easier.

> structural cleanups and bugfixes you did as well, which would allow us 
> to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
> (optional) cflush approach basically as the final step. Right now cflush 
> is mixed inextractably into the CPA patchset. WBINVD latency is really 

You could always disable it. Just set full_flush = 1.
Anyways it would be possible to disable it more fully, but at least
the individual flush infrastructure is used by other code too and useful
for it.


> the last of our problems here.

I disagree on that.

> 
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> > >   typically
> > 
> > Actually graphics drivers can do higher frequency allocation of WC 
> > memory (with PAT) support.
> 
> but that is plain stupid. If it can be WC mapped, why not map it WB and 
> use cflush when updating memory? 

Not reliable due to the speculative out of order nature of x86 CPUs. 

See my other mail to Dave explaining one scenario where it can go 
wrong.

> > >   all'
> > 
> > Yes, that is why it took so long. But it's eventually needed to make 
> > PAT actually useful for once.
> 
> as i see it, the utility of PAT comes from going away from MTRRs, and 
> the resulting flexibility it gives system designers to shape memory and 
> IO BARs.

MTRRs will not go away, just used less.

One advantage of PAT is that you can set any page in memory efficiently
to WC or UC without requiring hacks like an hardware aperture. But to
make good use of that this operation needs to be reasonably efficient.
> 
> > >   WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
> > >   But with this specific range flush we've got all the risks of 
> > >   accidentally not flushing _enough_. Especially if some boundary of 
> > >   a mapped area is imprecisely.
> > 
> > Not sure what you mean here? If someone doesn't pass in the correct 
> > area then not everything will be uncached in the first place. Using 
> > something as cached that should be uncached is usually noticed because 
> > it tends to be quite slow or corrupt data.
> 
> yes, 'quite slow or corrupt data' is what i meant:

That will already happen without my patchkit.  No change to the positive
or the negative.

> 
> > >   asking for trouble as it's very hard to debug and the operations 
>  ^^
> > >   here (ioremap) are typically done only once per system bootup. 
> > >   [...]
> > 
> > change_page_attr() is used for more than just ioremap (e.g. a common 
> > case is mapping memory into AGP apertures which still happens in many 
> > graphics drivers). I also expect it will be used even more in the 
> > future when PAT usage will become more wide spread.
> 
> i expect usage to not change in pattern. ioremap()ping on the fly is not 
> too smart.

Once PAT is in people will use it more I expect. And contrary to your
claims it is already used for more than just ioremap. Please grep
for it.

Actually in DEBUG_PAGEALLOC kernels it is currently heavily used,
but I eliminated that.

> 
> [ quoting this from the top: ]
> 
> > > finally managed to get the time to review your CPA patchset, and i 
> > > fundamentally agree with most of the detail changes done in it. But 
> > > here are a few structural high-level observations:
> > 
> > I have a few changes and will post a updated version shortly.
> 
> thanks. 

Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> > > >   WBINVD isnt particular fast (takes a few msecs), but why is 
> > > >   that a problem? Drivers dont do high-frequency ioremap-ing. 
> > > >   It's typically only done at driver/device startup and that's 
> > > >   it.
> > > 
> > > Actually graphics drivers can do higher frequency allocation of WC 
> > > memory (with PAT) support.
> > 
> > but that's not too smart: why dont they use WB plus cflush instead?
> 
> Because they need to access it WC for performance.

I think you have it fundamentally backwards: the best for performance is 
WB + cflush. What would WC offer for performance that cflush cannot do?

also, it's irrelevant to change_page_attr() call frequency. Just map in 
everything from the card and use it. In graphics, if you remap anything 
on the fly and it's not a slowpath you've lost the performance game even 
before you began it.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 08:20:26PM +1000, Dave Airlie wrote:
> This is only possible as long as we know all the parts involved, for
> example on AMD we have problems with that
> over-eager prefetching so for drivers on AMD chipsets we have to do
> something else more than likely using change_page_attr.

What over eager prefetching? AFAIK AMD chips don't do any prefetching
that Intel chips don't do. In fact P4 does more aggressive speculation
than any AMD chip. 

One problem I see is that the CPU (both AMD and Intel) can follow random 
addresses even from speculative execution and then there is a window between
your CLFLUSH and eventual use where the data could be forced back
into cache behind your back. Then eventually the cache gets flushed
and overwrites the data in memory.

The only way to prevent that is to unmap or remap to uncached.

Might happen only rarely of course, but it can. We've had reports
over the years for bugs which were tracked down to such cases. Each
was a bitch to debug usually keeping people busy for a long time.

> 
> >
> > Well GARTs and those are widely used even without AGP busses.
> 
> Yes mainly we have used this scheme with Intel GARTs (also the only
> driver we have done with the new GPU system) by working closely with
> Intel engineers..

Have you discussed the case above?

Also BTW even if you can solve it for Intel GPUs there are others too
who really rely on WC/UC.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> There's also unfortunately still one outstanding bug (triggeredb y 
> some recent changes) on 32bit that makes the self test suite not pass 
> currently.

ok. I'd expect there to be a whole lot of bugs triggered in this area. 
We have to shape it in a way that puts the most useful bits to the head 
and the riskiest bits to the end, so that we can do reverts in a 
sensible and least destructive way - if the need arises.

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> 
> - WBINVD is a very nasty operation. I was talking to some CPU people 
> and they really recommended to get rid of it as far as possible. 
> Stopping the CPU for msecs is just wrong and there are apparently even 
> some theoretical live lock situations. - It is not interruptible in 
> earlier VT versions and messes up real time in the hypervisor. Some 
> people were doing KVM on rt kernels and had latency spikes from that.

ok, i thought you might have had some higher rationale too. Frankly, 
WBINVD wont ever go away from the x86 instruction set, and while i am 
very symphathetic to rt and latency issues, it's still a risky change in 
a historically fragile area of code.

What is very real though are the hard limitations of MTRRs. So i'd 
rather first like to see a clean PAT approach (which all other modern 
OSs have already migrated to in the past 10 years), with all the 
structural cleanups and bugfixes you did as well, which would allow us 
to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
(optional) cflush approach basically as the final step. Right now cflush 
is mixed inextractably into the CPA patchset. WBINVD latency is really 
the last of our problems here.

> >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> >   typically
> 
> Actually graphics drivers can do higher frequency allocation of WC 
> memory (with PAT) support.

but that is plain stupid. If it can be WC mapped, why not map it WB and 
use cflush when updating memory? That gives cache-coherency on demand 
_and_ true read side caching - so it's faster as well. Modern GX cards 
can take that just fine. The best of both worlds. Unconditional WC is 
way overrated.

> > - secondly, obviously doing a 'flush some' instead of a 'flush all'
> >   operation is risky. There's not many ways we can get the 'flush 
> >   all'
> 
> Yes, that is why it took so long. But it's eventually needed to make 
> PAT actually useful for once.

as i see it, the utility of PAT comes from going away from MTRRs, and 
the resulting flexibility it gives system designers to shape memory and 
IO BARs.

> >   WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
> >   But with this specific range flush we've got all the risks of 
> >   accidentally not flushing _enough_. Especially if some boundary of 
> >   a mapped area is imprecisely.
> 
> Not sure what you mean here? If someone doesn't pass in the correct 
> area then not everything will be uncached in the first place. Using 
> something as cached that should be uncached is usually noticed because 
> it tends to be quite slow or corrupt data.

yes, 'quite slow or corrupt data' is what i meant:

> >   asking for trouble as it's very hard to debug and the operations 
 ^^
> >   here (ioremap) are typically done only once per system bootup. 
> >   [...]
> 
> change_page_attr() is used for more than just ioremap (e.g. a common 
> case is mapping memory into AGP apertures which still happens in many 
> graphics drivers). I also expect it will be used even more in the 
> future when PAT usage will become more wide spread.

i expect usage to not change in pattern. ioremap()ping on the fly is not 
too smart.

[ quoting this from the top: ]

> > finally managed to get the time to review your CPA patchset, and i 
> > fundamentally agree with most of the detail changes done in it. But 
> > here are a few structural high-level observations:
> 
> I have a few changes and will post a updated version shortly.

thanks. I've test-merged the PAT patchset from Venki/Suresh to have a 
look, and to me the most logical way to layer these changes would be:

 - PAT
 - non-cflush bits of CPA
 - cflush bits of CPA
 - GBPAGES

hm?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Dave Airlie
On Jan 10, 2008 7:55 PM, Andi Kleen <[EMAIL PROTECTED]> wrote:
> On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> > >
> > > finally managed to get the time to review your CPA patchset, and i
> > > fundamentally agree with most of the detail changes done in it. But here
> > > are a few structural high-level observations:
> > >
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > >   from doing a cflush-range instruction instead of a WBINVD. But why?
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> > >   only done at driver/device startup and that's it. Whether module load
> > >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> >
> > read graphics drivers, even though I think we may avoid the whole path
>
> You mean avoid change_page_attr() ?

yes in some cases...

>
> > if we can and end up doing some of this in the drivers
> > when they know more about the situation so can avoid safeties..
>
> Please explain, but it sounds very dubious.

So on Intel 9xx hardware we know how to flush the whole pipe from chip
to chipset to RAM, so when in the Intel
 driver we can leave pages mapped cached, and just flush all stages
before having the GPU access them,

This is only possible as long as we know all the parts involved, for
example on AMD we have problems with that
over-eager prefetching so for drivers on AMD chipsets we have to do
something else more than likely using change_page_attr.

>
> Well GARTs and those are widely used even without AGP busses.

Yes mainly we have used this scheme with Intel GARTs (also the only
driver we have done with the new GPU system) by working closely with
Intel engineers..

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:04:43AM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> > >   typically only done at driver/device startup and that's it.
> > 
> > Actually graphics drivers can do higher frequency allocation of WC 
> > memory (with PAT) support.
> 
> but that's not too smart: why dont they use WB plus cflush instead? 

Because they need to access it WC for performance.

> ioremap() will always be slow and we dont care about stupid uses of it.

Uninterruptible msec latencies are too slow even for slow path
initialization code imho.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen <[EMAIL PROTECTED]> wrote:

> >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> >   typically only done at driver/device startup and that's it.
> 
> Actually graphics drivers can do higher frequency allocation of WC 
> memory (with PAT) support.

but that's not too smart: why dont they use WB plus cflush instead? 
ioremap() will always be slow and we dont care about stupid uses of it.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Dave Airlie <[EMAIL PROTECTED]> wrote:

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> >   only done at driver/device startup and that's it. Whether module load
> >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> read graphics drivers, even though I think we may avoid the whole path 
> if we can and end up doing some of this in the drivers when they know 
> more about the situation so can avoid safeties..

by 'read graphics drivers' do you mean direct framebuffer access? In any 
case, a driver (or even userspace) can use cflush just fine, it's an 
unprivileged instruction. But this is about the ioremap() implementation 
using cflush instead of WBINVD, and that is a slowpath and is up to the 
kernel anyway - i'm not aware of many high-frequency ioremap() users, so 
robustness concerns control the policy here. So could you please explain 
your point in more detail?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> >
> > finally managed to get the time to review your CPA patchset, and i
> > fundamentally agree with most of the detail changes done in it. But here
> > are a few structural high-level observations:
> >
> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> >   only done at driver/device startup and that's it. Whether module load
> >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> read graphics drivers, even though I think we may avoid the whole path

You mean avoid change_page_attr() ? 

> if we can and end up doing some of this in the drivers
> when they know more about the situation so can avoid safeties..

Please explain, but it sounds very dubious.

> but I still see this being used in AGP a fair bit at some point on
> some drivers..

Well GARTs and those are widely used even without AGP busses.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 10:31:26AM +0100, Ingo Molnar wrote:
> 
> Andi,
> 
> finally managed to get the time to review your CPA patchset, and i 
> fundamentally agree with most of the detail changes done in it. But here 
> are a few structural high-level observations:

I have a few changes and will post a updated version shortly.
There's also unfortunately still one outstanding bug (triggeredb y
some recent changes) on 32bit that makes the self test suite not pass
currently.

> - firstly, there's no rationale given. So we'll change ioremap()/etc.
>   from doing a cflush-range instruction instead of a WBINVD. But why?

- WBINVD is a very nasty operation. I was talking to some CPU people
and they really recommended to get rid of it as far as possible.
Stopping the CPU for msecs is just wrong and there are apparently
even some theoretical live lock situations.
- It is not interruptible in earlier VT versions and messes up
real time in the hypervisor. Some people were doing KVM on rt 
kernels and had latency spikes from that.

I'll add that to the changelog.

>   WBINVD isnt particular fast (takes a few msecs), but why is that a
>   problem? Drivers dont do high-frequency ioremap-ing. It's typically

Actually graphics drivers can do higher frequency allocation of WC
memory (with PAT) support.

>   only done at driver/device startup and that's it. Whether module load
>   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> - secondly, obviously doing a 'flush some' instead of a 'flush all' 
>   operation is risky. There's not many ways we can get the 'flush all'

Yes, that is why it took so long. But it's eventually needed
to make PAT actually useful for once.

>   WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
>   with this specific range flush we've got all the risks of accidentally
>   not flushing _enough_. Especially if some boundary of a mapped area is 
>   imprecisely.

Not sure what you mean here? If someone doesn't pass in the correct
area then not everything will be uncached in the first place. 
Using something as cached that should be uncached is usually
noticed because it tends to be quite slow or corrupt data.

I don't think you have thought that concern through.

>   asking for trouble as it's very hard to debug and the operations here
>   (ioremap) are typically done only once per system bootup. So we'll add

change_page_attr() is used for more than just ioremap (e.g. a common
case is mapping memory into AGP apertures which still happens in
many graphics drivers). I also expect it will be used even more
in the future when PAT usage will become more wide spread.

> - the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
>   they come first in the series, instead of being mixed into it?

It all depends on each other; I don't think any of the changes can
be easily reordered.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Dave Airlie
>
> finally managed to get the time to review your CPA patchset, and i
> fundamentally agree with most of the detail changes done in it. But here
> are a few structural high-level observations:
>
> - firstly, there's no rationale given. So we'll change ioremap()/etc.
>   from doing a cflush-range instruction instead of a WBINVD. But why?
>   WBINVD isnt particular fast (takes a few msecs), but why is that a
>   problem? Drivers dont do high-frequency ioremap-ing. It's typically
>   only done at driver/device startup and that's it. Whether module load
>   time takes 1254 msecs instead of 1250 msecs is no big deal.

read graphics drivers, even though I think we may avoid the whole path
if we can and end up doing some of this in the drivers
when they know more about the situation so can avoid safeties..

but I still see this being used in AGP a fair bit at some point on
some drivers..

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

Andi,

finally managed to get the time to review your CPA patchset, and i 
fundamentally agree with most of the detail changes done in it. But here 
are a few structural high-level observations:

- firstly, there's no rationale given. So we'll change ioremap()/etc.
  from doing a cflush-range instruction instead of a WBINVD. But why?
  WBINVD isnt particular fast (takes a few msecs), but why is that a
  problem? Drivers dont do high-frequency ioremap-ing. It's typically
  only done at driver/device startup and that's it. Whether module load
  time takes 1254 msecs instead of 1250 msecs is no big deal.

- secondly, obviously doing a 'flush some' instead of a 'flush all' 
  operation is risky. There's not many ways we can get the 'flush all'
  WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
  with this specific range flush we've got all the risks of accidentally
  not flushing _enough_. Especially if some boundary of a mapped area is 
  imprecisely. Accidentally leaving aliases around in the cache is
  asking for trouble as it's very hard to debug and the operations here
  (ioremap) are typically done only once per system bootup. So we'll add
  _fundamental_ fragility to an already historically fragile and 
  bug-ridden piece of code. That does not sound too smart to me and you 
  do not analyze these concerns in your submission.

- the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
  they come first in the series, instead of being mixed into it?

so i'd suggest for you to reshape this ontop of the PAT patchset done by 
Venki and Suresh, with the cflush stuff kept at the very end of the 
series, and even that should be boot and .config configurable, with 
default off - at least initially. The cflush performance advantages seem 
dubious at best, and they bring in very real regression dangers.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

Andi,

finally managed to get the time to review your CPA patchset, and i 
fundamentally agree with most of the detail changes done in it. But here 
are a few structural high-level observations:

- firstly, there's no rationale given. So we'll change ioremap()/etc.
  from doing a cflush-range instruction instead of a WBINVD. But why?
  WBINVD isnt particular fast (takes a few msecs), but why is that a
  problem? Drivers dont do high-frequency ioremap-ing. It's typically
  only done at driver/device startup and that's it. Whether module load
  time takes 1254 msecs instead of 1250 msecs is no big deal.

- secondly, obviously doing a 'flush some' instead of a 'flush all' 
  operation is risky. There's not many ways we can get the 'flush all'
  WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
  with this specific range flush we've got all the risks of accidentally
  not flushing _enough_. Especially if some boundary of a mapped area is 
  imprecisely. Accidentally leaving aliases around in the cache is
  asking for trouble as it's very hard to debug and the operations here
  (ioremap) are typically done only once per system bootup. So we'll add
  _fundamental_ fragility to an already historically fragile and 
  bug-ridden piece of code. That does not sound too smart to me and you 
  do not analyze these concerns in your submission.

- the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
  they come first in the series, instead of being mixed into it?

so i'd suggest for you to reshape this ontop of the PAT patchset done by 
Venki and Suresh, with the cflush stuff kept at the very end of the 
series, and even that should be boot and .config configurable, with 
default off - at least initially. The cflush performance advantages seem 
dubious at best, and they bring in very real regression dangers.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Dave Airlie

 finally managed to get the time to review your CPA patchset, and i
 fundamentally agree with most of the detail changes done in it. But here
 are a few structural high-level observations:

 - firstly, there's no rationale given. So we'll change ioremap()/etc.
   from doing a cflush-range instruction instead of a WBINVD. But why?
   WBINVD isnt particular fast (takes a few msecs), but why is that a
   problem? Drivers dont do high-frequency ioremap-ing. It's typically
   only done at driver/device startup and that's it. Whether module load
   time takes 1254 msecs instead of 1250 msecs is no big deal.

read graphics drivers, even though I think we may avoid the whole path
if we can and end up doing some of this in the drivers
when they know more about the situation so can avoid safeties..

but I still see this being used in AGP a fair bit at some point on
some drivers..

Dave.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 10:31:26AM +0100, Ingo Molnar wrote:
 
 Andi,
 
 finally managed to get the time to review your CPA patchset, and i 
 fundamentally agree with most of the detail changes done in it. But here 
 are a few structural high-level observations:

I have a few changes and will post a updated version shortly.
There's also unfortunately still one outstanding bug (triggeredb y
some recent changes) on 32bit that makes the self test suite not pass
currently.

 - firstly, there's no rationale given. So we'll change ioremap()/etc.
   from doing a cflush-range instruction instead of a WBINVD. But why?

- WBINVD is a very nasty operation. I was talking to some CPU people
and they really recommended to get rid of it as far as possible.
Stopping the CPU for msecs is just wrong and there are apparently
even some theoretical live lock situations.
- It is not interruptible in earlier VT versions and messes up
real time in the hypervisor. Some people were doing KVM on rt 
kernels and had latency spikes from that.

I'll add that to the changelog.

   WBINVD isnt particular fast (takes a few msecs), but why is that a
   problem? Drivers dont do high-frequency ioremap-ing. It's typically

Actually graphics drivers can do higher frequency allocation of WC
memory (with PAT) support.

   only done at driver/device startup and that's it. Whether module load
   time takes 1254 msecs instead of 1250 msecs is no big deal.
 
 - secondly, obviously doing a 'flush some' instead of a 'flush all' 
   operation is risky. There's not many ways we can get the 'flush all'

Yes, that is why it took so long. But it's eventually needed
to make PAT actually useful for once.

   WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
   with this specific range flush we've got all the risks of accidentally
   not flushing _enough_. Especially if some boundary of a mapped area is 
   imprecisely.

Not sure what you mean here? If someone doesn't pass in the correct
area then not everything will be uncached in the first place. 
Using something as cached that should be uncached is usually
noticed because it tends to be quite slow or corrupt data.

I don't think you have thought that concern through.

   asking for trouble as it's very hard to debug and the operations here
   (ioremap) are typically done only once per system bootup. So we'll add

change_page_attr() is used for more than just ioremap (e.g. a common
case is mapping memory into AGP apertures which still happens in
many graphics drivers). I also expect it will be used even more
in the future when PAT usage will become more wide spread.

 - the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
   they come first in the series, instead of being mixed into it?

It all depends on each other; I don't think any of the changes can
be easily reordered.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
 
  finally managed to get the time to review your CPA patchset, and i
  fundamentally agree with most of the detail changes done in it. But here
  are a few structural high-level observations:
 
  - firstly, there's no rationale given. So we'll change ioremap()/etc.
from doing a cflush-range instruction instead of a WBINVD. But why?
WBINVD isnt particular fast (takes a few msecs), but why is that a
problem? Drivers dont do high-frequency ioremap-ing. It's typically
only done at driver/device startup and that's it. Whether module load
time takes 1254 msecs instead of 1250 msecs is no big deal.
 
 read graphics drivers, even though I think we may avoid the whole path

You mean avoid change_page_attr() ? 

 if we can and end up doing some of this in the drivers
 when they know more about the situation so can avoid safeties..

Please explain, but it sounds very dubious.

 but I still see this being used in AGP a fair bit at some point on
 some drivers..

Well GARTs and those are widely used even without AGP busses.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Dave Airlie [EMAIL PROTECTED] wrote:

  - firstly, there's no rationale given. So we'll change ioremap()/etc.
from doing a cflush-range instruction instead of a WBINVD. But why?
WBINVD isnt particular fast (takes a few msecs), but why is that a
problem? Drivers dont do high-frequency ioremap-ing. It's typically
only done at driver/device startup and that's it. Whether module load
time takes 1254 msecs instead of 1250 msecs is no big deal.
 
 read graphics drivers, even though I think we may avoid the whole path 
 if we can and end up doing some of this in the drivers when they know 
 more about the situation so can avoid safeties..

by 'read graphics drivers' do you mean direct framebuffer access? In any 
case, a driver (or even userspace) can use cflush just fine, it's an 
unprivileged instruction. But this is about the ioremap() implementation 
using cflush instead of WBINVD, and that is a slowpath and is up to the 
kernel anyway - i'm not aware of many high-frequency ioremap() users, so 
robustness concerns control the policy here. So could you please explain 
your point in more detail?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen [EMAIL PROTECTED] wrote:

WBINVD isnt particular fast (takes a few msecs), but why is that a 
problem? Drivers dont do high-frequency ioremap-ing. It's 
typically only done at driver/device startup and that's it.
 
 Actually graphics drivers can do higher frequency allocation of WC 
 memory (with PAT) support.

but that's not too smart: why dont they use WB plus cflush instead? 
ioremap() will always be slow and we dont care about stupid uses of it.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:04:43AM +0100, Ingo Molnar wrote:
 
 * Andi Kleen [EMAIL PROTECTED] wrote:
 
 WBINVD isnt particular fast (takes a few msecs), but why is that a 
 problem? Drivers dont do high-frequency ioremap-ing. It's 
 typically only done at driver/device startup and that's it.
  
  Actually graphics drivers can do higher frequency allocation of WC 
  memory (with PAT) support.
 
 but that's not too smart: why dont they use WB plus cflush instead? 

Because they need to access it WC for performance.

 ioremap() will always be slow and we dont care about stupid uses of it.

Uninterruptible msec latencies are too slow even for slow path
initialization code imho.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Dave Airlie
On Jan 10, 2008 7:55 PM, Andi Kleen [EMAIL PROTECTED] wrote:
 On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
  
   finally managed to get the time to review your CPA patchset, and i
   fundamentally agree with most of the detail changes done in it. But here
   are a few structural high-level observations:
  
   - firstly, there's no rationale given. So we'll change ioremap()/etc.
 from doing a cflush-range instruction instead of a WBINVD. But why?
 WBINVD isnt particular fast (takes a few msecs), but why is that a
 problem? Drivers dont do high-frequency ioremap-ing. It's typically
 only done at driver/device startup and that's it. Whether module load
 time takes 1254 msecs instead of 1250 msecs is no big deal.
 
  read graphics drivers, even though I think we may avoid the whole path

 You mean avoid change_page_attr() ?

yes in some cases...


  if we can and end up doing some of this in the drivers
  when they know more about the situation so can avoid safeties..

 Please explain, but it sounds very dubious.

So on Intel 9xx hardware we know how to flush the whole pipe from chip
to chipset to RAM, so when in the Intel
 driver we can leave pages mapped cached, and just flush all stages
before having the GPU access them,

This is only possible as long as we know all the parts involved, for
example on AMD we have problems with that
over-eager prefetching so for drivers on AMD chipsets we have to do
something else more than likely using change_page_attr.


 Well GARTs and those are widely used even without AGP busses.

Yes mainly we have used this scheme with Intel GARTs (also the only
driver we have done with the new GPU system) by working closely with
Intel engineers..

Dave.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen [EMAIL PROTECTED] wrote:

 There's also unfortunately still one outstanding bug (triggeredb y 
 some recent changes) on 32bit that makes the self test suite not pass 
 currently.

ok. I'd expect there to be a whole lot of bugs triggered in this area. 
We have to shape it in a way that puts the most useful bits to the head 
and the riskiest bits to the end, so that we can do reverts in a 
sensible and least destructive way - if the need arises.

  - firstly, there's no rationale given. So we'll change ioremap()/etc.
from doing a cflush-range instruction instead of a WBINVD. But why?
 
 - WBINVD is a very nasty operation. I was talking to some CPU people 
 and they really recommended to get rid of it as far as possible. 
 Stopping the CPU for msecs is just wrong and there are apparently even 
 some theoretical live lock situations. - It is not interruptible in 
 earlier VT versions and messes up real time in the hypervisor. Some 
 people were doing KVM on rt kernels and had latency spikes from that.

ok, i thought you might have had some higher rationale too. Frankly, 
WBINVD wont ever go away from the x86 instruction set, and while i am 
very symphathetic to rt and latency issues, it's still a risky change in 
a historically fragile area of code.

What is very real though are the hard limitations of MTRRs. So i'd 
rather first like to see a clean PAT approach (which all other modern 
OSs have already migrated to in the past 10 years), with all the 
structural cleanups and bugfixes you did as well, which would allow us 
to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
(optional) cflush approach basically as the final step. Right now cflush 
is mixed inextractably into the CPA patchset. WBINVD latency is really 
the last of our problems here.

WBINVD isnt particular fast (takes a few msecs), but why is that a 
problem? Drivers dont do high-frequency ioremap-ing. It's 
typically
 
 Actually graphics drivers can do higher frequency allocation of WC 
 memory (with PAT) support.

but that is plain stupid. If it can be WC mapped, why not map it WB and 
use cflush when updating memory? That gives cache-coherency on demand 
_and_ true read side caching - so it's faster as well. Modern GX cards 
can take that just fine. The best of both worlds. Unconditional WC is 
way overrated.

  - secondly, obviously doing a 'flush some' instead of a 'flush all'
operation is risky. There's not many ways we can get the 'flush 
all'
 
 Yes, that is why it took so long. But it's eventually needed to make 
 PAT actually useful for once.

as i see it, the utility of PAT comes from going away from MTRRs, and 
the resulting flexibility it gives system designers to shape memory and 
IO BARs.

WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
But with this specific range flush we've got all the risks of 
accidentally not flushing _enough_. Especially if some boundary of 
a mapped area is imprecisely.
 
 Not sure what you mean here? If someone doesn't pass in the correct 
 area then not everything will be uncached in the first place. Using 
 something as cached that should be uncached is usually noticed because 
 it tends to be quite slow or corrupt data.

yes, 'quite slow or corrupt data' is what i meant:

asking for trouble as it's very hard to debug and the operations 
 ^^
here (ioremap) are typically done only once per system bootup. 
[...]
 
 change_page_attr() is used for more than just ioremap (e.g. a common 
 case is mapping memory into AGP apertures which still happens in many 
 graphics drivers). I also expect it will be used even more in the 
 future when PAT usage will become more wide spread.

i expect usage to not change in pattern. ioremap()ping on the fly is not 
too smart.

[ quoting this from the top: ]

  finally managed to get the time to review your CPA patchset, and i 
  fundamentally agree with most of the detail changes done in it. But 
  here are a few structural high-level observations:
 
 I have a few changes and will post a updated version shortly.

thanks. I've test-merged the PAT patchset from Venki/Suresh to have a 
look, and to me the most logical way to layer these changes would be:

 - PAT
 - non-cflush bits of CPA
 - cflush bits of CPA
 - GBPAGES

hm?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 08:20:26PM +1000, Dave Airlie wrote:
 This is only possible as long as we know all the parts involved, for
 example on AMD we have problems with that
 over-eager prefetching so for drivers on AMD chipsets we have to do
 something else more than likely using change_page_attr.

What over eager prefetching? AFAIK AMD chips don't do any prefetching
that Intel chips don't do. In fact P4 does more aggressive speculation
than any AMD chip. 

One problem I see is that the CPU (both AMD and Intel) can follow random 
addresses even from speculative execution and then there is a window between
your CLFLUSH and eventual use where the data could be forced back
into cache behind your back. Then eventually the cache gets flushed
and overwrites the data in memory.

The only way to prevent that is to unmap or remap to uncached.

Might happen only rarely of course, but it can. We've had reports
over the years for bugs which were tracked down to such cases. Each
was a bitch to debug usually keeping people busy for a long time.

 
 
  Well GARTs and those are widely used even without AGP busses.
 
 Yes mainly we have used this scheme with Intel GARTs (also the only
 driver we have done with the new GPU system) by working closely with
 Intel engineers..

Have you discussed the case above?

Also BTW even if you can solve it for Intel GPUs there are others too
who really rely on WC/UC.

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen [EMAIL PROTECTED] wrote:

  WBINVD isnt particular fast (takes a few msecs), but why is 
  that a problem? Drivers dont do high-frequency ioremap-ing. 
  It's typically only done at driver/device startup and that's 
  it.
   
   Actually graphics drivers can do higher frequency allocation of WC 
   memory (with PAT) support.
  
  but that's not too smart: why dont they use WB plus cflush instead?
 
 Because they need to access it WC for performance.

I think you have it fundamentally backwards: the best for performance is 
WB + cflush. What would WC offer for performance that cflush cannot do?

also, it's irrelevant to change_page_attr() call frequency. Just map in 
everything from the card and use it. In graphics, if you remap anything 
on the fly and it's not a slowpath you've lost the performance game even 
before you began it.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:43:51AM +0100, Ingo Molnar wrote:
   - firstly, there's no rationale given. So we'll change ioremap()/etc.
 from doing a cflush-range instruction instead of a WBINVD. But why?
  
  - WBINVD is a very nasty operation. I was talking to some CPU people 
  and they really recommended to get rid of it as far as possible. 
  Stopping the CPU for msecs is just wrong and there are apparently even 
  some theoretical live lock situations. - It is not interruptible in 
  earlier VT versions and messes up real time in the hypervisor. Some 
  people were doing KVM on rt kernels and had latency spikes from that.
 
 ok, i thought you might have had some higher rationale too. Frankly, 
 WBINVD wont ever go away from the x86 instruction set, and while i am 

We can (and should) eliminate it from normal operation at least. The one during
reboot has to be kept unfortunately.

 very symphathetic to rt and latency issues, it's still a risky change in 
 a historically fragile area of code.

It is, but the resulting code is significantly more efficient and does
various other things in better ways (the explicit list allows various
other optimizations). It's how c_p_a should have been written from day one. 


 What is very real though are the hard limitations of MTRRs. So i'd 
 rather first like to see a clean PAT approach (which all other modern 

That's mostly orthogonal. Don't know why you bring it up now?

Anyways more efficient c_p_a() makes PAT usage easier.

 structural cleanups and bugfixes you did as well, which would allow us 
 to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
 (optional) cflush approach basically as the final step. Right now cflush 
 is mixed inextractably into the CPA patchset. WBINVD latency is really 

You could always disable it. Just set full_flush = 1.
Anyways it would be possible to disable it more fully, but at least
the individual flush infrastructure is used by other code too and useful
for it.


 the last of our problems here.

I disagree on that.

 
 WBINVD isnt particular fast (takes a few msecs), but why is that a 
 problem? Drivers dont do high-frequency ioremap-ing. It's 
 typically
  
  Actually graphics drivers can do higher frequency allocation of WC 
  memory (with PAT) support.
 
 but that is plain stupid. If it can be WC mapped, why not map it WB and 
 use cflush when updating memory? 

Not reliable due to the speculative out of order nature of x86 CPUs. 

See my other mail to Dave explaining one scenario where it can go 
wrong.

 all'
  
  Yes, that is why it took so long. But it's eventually needed to make 
  PAT actually useful for once.
 
 as i see it, the utility of PAT comes from going away from MTRRs, and 
 the resulting flexibility it gives system designers to shape memory and 
 IO BARs.

MTRRs will not go away, just used less.

One advantage of PAT is that you can set any page in memory efficiently
to WC or UC without requiring hacks like an hardware aperture. But to
make good use of that this operation needs to be reasonably efficient.
 
 WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
 But with this specific range flush we've got all the risks of 
 accidentally not flushing _enough_. Especially if some boundary of 
 a mapped area is imprecisely.
  
  Not sure what you mean here? If someone doesn't pass in the correct 
  area then not everything will be uncached in the first place. Using 
  something as cached that should be uncached is usually noticed because 
  it tends to be quite slow or corrupt data.
 
 yes, 'quite slow or corrupt data' is what i meant:

That will already happen without my patchkit.  No change to the positive
or the negative.

 
 asking for trouble as it's very hard to debug and the operations 
  ^^
 here (ioremap) are typically done only once per system bootup. 
 [...]
  
  change_page_attr() is used for more than just ioremap (e.g. a common 
  case is mapping memory into AGP apertures which still happens in many 
  graphics drivers). I also expect it will be used even more in the 
  future when PAT usage will become more wide spread.
 
 i expect usage to not change in pattern. ioremap()ping on the fly is not 
 too smart.

Once PAT is in people will use it more I expect. And contrary to your
claims it is already used for more than just ioremap. Please grep
for it.

Actually in DEBUG_PAGEALLOC kernels it is currently heavily used,
but I eliminated that.

 
 [ quoting this from the top: ]
 
   finally managed to get the time to review your CPA patchset, and i 
   fundamentally agree with most of the detail changes done in it. But 
   here are a few structural high-level observations:
  
  I have a few changes and will post a updated version shortly.
 
 thanks. I've test-merged the PAT patchset from Venki/Suresh to have a 
 look, and to me the most logical way to layer these changes would 

Re: CPA patchset

2008-01-10 Thread Andi Kleen
On Thu, Jan 10, 2008 at 11:57:26AM +0100, Ingo Molnar wrote:
 
   WBINVD isnt particular fast (takes a few msecs), but why is 
   that a problem? Drivers dont do high-frequency ioremap-ing. 
   It's typically only done at driver/device startup and that's 
   it.

Actually graphics drivers can do higher frequency allocation of WC 
memory (with PAT) support.
   
   but that's not too smart: why dont they use WB plus cflush instead?
  
  Because they need to access it WC for performance.
 
 I think you have it fundamentally backwards: the best for performance is 
 WB + cflush. What would WC offer for performance that cflush cannot do?

Cached requires the cache line to be read first before you can write it.

WC on the other hand does not allocate a cache line and just dumps
the data into a special write combining buffer.  It was invented originally 
because reads from AGP were incredibly slow.

And it's race less regarding the caching protocol (assuming you flush
the caches and TLBs correctly). 

Another typical problem is that if something
is uncached then you can't have it in any other caches because if that
cache eventually flushes it will corrupt the data.
That can happen with remapping apertures for example which remap data
behind the CPUs back.

CLFLUSH is really only a hint but it cannot be used if UC is needed
for correctness.

 also, it's irrelevant to change_page_attr() call frequency. Just map in 
 everything from the card and use it. In graphics, if you remap anything 
 on the fly and it's not a slowpath you've lost the performance game even 
 before you began it.

The typical case would be lots of user space DRI clients supplying
their own buffers on the fly. There's not really a fixed pool 
in this case, but it all varies dynamically. In some scenarios
that could happen quite often.

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen [EMAIL PROTECTED] wrote:

  What is very real though are the hard limitations of MTRRs. So i'd 
  rather first like to see a clean PAT approach (which all other 
  modern OSs have already migrated to in the past 10 years)
 
 That's mostly orthogonal. Don't know why you bring it up now?

because the PAT (Page Attribute Table support) patchset and the CPA 
(change_page_attr()) patchset are are not orthogonal at all - as their 
name already signals: because they change the implementation/effects of 
the same interface(s). [just at different levels].

Both patchsets change how the kernel pagetable caching is handled. PAT 
changes the kernel pte details and unshackles us from MTRR reliance and 
thus solves real problems on real boxes:

 55 files changed, 1288 insertions(+), 237 deletions(-)

CPA moves change_page_attr() from invwb flushing to cflush flushing, for 
a speedup/latency-win, plus a whole bunch of intermingled fixes and 
improvements to page attribute modification:

 26 files changed, 882 insertions(+), 423 deletions(-)

so in terms of risk management, the perfect patch order is:

 - minimal_set of correctness fixes to the highlevel cpa code.

 - ( then any provably NOP cleanups to pave the way. )

 - then change the lowlevel pte code (PAT) to reduce/eliminate the need 
   to have runtime MTRR use

 - then structural improvements/cleanups of the highlevel cpa code

 - then the cflush (optional) performance feature ontop of it.

 - then gigabyte-largepages/TLBs support [new CPU feature that further
   complicates page-attribute management]

All in an easy-to-revert fashion. We _will_ regress here, and this stuff 
is very hard to debug.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPA patchset

2008-01-10 Thread Ingo Molnar

* Andi Kleen [EMAIL PROTECTED] wrote:

but that's not too smart: why dont they use WB plus cflush 
instead?
   
   Because they need to access it WC for performance.
  
  I think you have it fundamentally backwards: the best for 
  performance is WB + cflush. What would WC offer for performance that 
  cflush cannot do?
 
 Cached requires the cache line to be read first before you can write 
 it.

nonsense, and you should know it. It is perfectly possible to construct 
fully written cachelines, without reading the cacheline first. MOVDQ is 
SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
can generate full cacheline writes, _without_ filling in the cacheline 
first. Bulk ops (string ops, etc.) will do full cacheline writes too, 
without filling in the cacheline. Especially with high performance 3D 
ops we do _NOT_ need any funky reads from anywhere because 3D software 
can stream a lot of writes out: we construct a full frame or a portion 
of a frame, or upload vertices or shader scripts, textures, etc.

( also, _even_ when there is a cache fill pending on for a partially
  written cacheline, that might go on in parallel and it is not 
  necessarily holding up the CPU unless it has an actual data dependency 
  on that. )

but that's totally besides the point anyway. WC or WB accesses, if a 3D 
app or a driver does high-freq change_page_attr() calls, it will _lose_ 
the performance game:

  also, it's irrelevant to change_page_attr() call frequency. Just map 
  in everything from the card and use it. In graphics, if you remap 
  anything on the fly and it's not a slowpath you've lost the 
  performance game even before you began it.
 
 The typical case would be lots of user space DRI clients supplying 
 their own buffers on the fly. There's not really a fixed pool in this 
 case, but it all varies dynamically. In some scenarios that could 
 happen quite often.

in what scenarios? Please give me in-tree examples of such high-freq 
change_page_attr() cases, where the driver authors would like to call it 
with high frequency but are unable to do it and see performance problems 
due to the WBINVD.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/