RE: RE: RE: RE: Memory performance / Cache problem

2009-10-14 Thread Woodruff, Richard

> From: e...@gmx.de [mailto:e...@gmx.de]
> Sent: Wednesday, October 14, 2009 12:23 PM
> To: Woodruff, Richard; Premi, Sanjeev; linux-omap@vger.kernel.org
> Subject: Re: RE: RE: RE: Memory performance / Cache problem

> > Yes aligned buffers can make a difference.  But probably more so for small
> > copies.  Of course you must touch the memory or mprotect() it so its
> > faulted in, but indications are you have done this.
>
> Mh, alignment (to an address) is done with malloc already. Probably you mean
> something different. I don't understand the difference. For me is
> malloc+memset=calloc.
> I'll send you the benchmark code, if you like.

Ok, if it compiles I could try on sdp3430 or sdp3630.

Alignment to a cache line is always best.  This is 64 bytes in A8.  Better yet, 
being 4K aligned is a good thing to reduce MMU effects.

> In both kernels I used the same rootfs (via NFS). Indeed I used CS2009q1 and
> its libs, but we are talking about factor 2..6. This must be something 
> serious.

2009 pay version has optimized thrumb2 and arm mode libraries you download 
separately.  I don't recall if free/lite version integrated this.

> What is your feeling? Does the 22 something strange or are the newer kernels
> slower that they have to be.

There are a lot of differences between 22 kernel and current ones.  First 
things I'd check would be around MMU settings, special ARM cp15 memory control 
regs, then omap memory and clock settings.  Also some bad device could undercut 
you.  Always good to cat /proc/interrupts and make sure something isn't spewing.

> Would be interesting to see results on other Omap3 boards with both old an new
> kernels.

I've not noticed anything on sdp on somewhat recent kernels.

Regards,
Richard W.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RE: RE: RE: Memory performance / Cache problem

2009-10-14 Thread epsi
> > Mem clock is both times 166MHz. I don't know whether are differences in
> cycle
> > access and timing, but memclock is fine.
> 
> How did you physically verify this?

Oszi show 166MHz, also the kernel message about freq are in both kernels the 
same.

> > Following Siarhei hints of initialize the buffers (around 1.2 MByte
> each)
> > I get different results in 22kernel for use of
> > malloc alone
> > memcpy =   473.764, loop4 =   448.430, loop1 =   102.770, rand =   
> 29.641
> > calloc alone
> > memcpy =   405.947, loop4 =   361.550, loop1 =95.441, rand =   
> 21.853
> > malloc+memset:
> > memcpy =   239.294, loop4 =   188.617, loop1 =80.871, rand =
> 4.726
> >
> > In 31kernel all 3 measures are about the same (unfortunatly low) level
> of
> > malloc+memset in 22.
> 
> Yes aligned buffers can make a difference.  But probably more so for small
> copies.  Of course you must touch the memory or mprotect() it so its
> faulted in, but indications are you have done this.

Mh, alignment (to an address) is done with malloc already. Probably you mean 
something different. I don't understand the difference. For me is 
malloc+memset=calloc. 
I'll send you the benchmark code, if you like. 

> > I used a standard memcpy (think this is glib and hence not neonbased)?
> > To be neonbased I guess it has to be recompiled?
> 
> The version of glibc in use can make a difference.  CodeSourcery in 2009
> release added PLD's to mem operations.  This can give a good benefit.  It
> might be you have optimized library in one case and a non-optimized in
> another.

In both kernels I used the same rootfs (via NFS). Indeed I used CS2009q1 and 
its libs, but we are talking about factor 2..6. This must be something serious.

What is your feeling? Does the 22 something strange or are the newer kernels 
slower that they have to be.

Would be interesting to see results on other Omap3 boards with both old an new 
kernels.

Best regards
Steffen
-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: RE: Memory performance / Cache problem

2009-10-14 Thread Woodruff, Richard

> From: e...@gmx.de [mailto:e...@gmx.de]
> Sent: Wednesday, October 14, 2009 9:49 AM
> To: Woodruff, Richard; linux-omap@vger.kernel.org; Premi, Sanjeev
> Subject: Re: RE: RE: Memory performance / Cache problem
>
> Mem clock is both times 166MHz. I don't know whether are differences in cycle
> access and timing, but memclock is fine.

How did you physically verify this?

> Following Siarhei hints of initialize the buffers (around 1.2 MByte each)
> I get different results in 22kernel for use of
> malloc alone
> memcpy =   473.764, loop4 =   448.430, loop1 =   102.770, rand =29.641
> calloc alone
> memcpy =   405.947, loop4 =   361.550, loop1 =95.441, rand =21.853
> malloc+memset:
> memcpy =   239.294, loop4 =   188.617, loop1 =80.871, rand = 4.726
>
> In 31kernel all 3 measures are about the same (unfortunatly low) level of
> malloc+memset in 22.

Yes aligned buffers can make a difference.  But probably more so for small 
copies.  Of course you must touch the memory or mprotect() it so its faulted 
in, but indications are you have done this.

> I used a standard memcpy (think this is glib and hence not neonbased)?
> To be neonbased I guess it has to be recompiled?

The version of glibc in use can make a difference.  CodeSourcery in 2009 
release added PLD's to mem operations.  This can give a good benefit.  It might 
be you have optimized library in one case and a non-optimized in another.

Regards,
Richard W.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RE: RE: Memory performance / Cache problem

2009-10-14 Thread epsi
Mem clock is both times 166MHz. I don't know whether are differences in cycle 
access and timing, but memclock is fine.

Following Siarhei hints of initialize the buffers (around 1.2 MByte each)
I get different results in 22kernel for use of
malloc alone
memcpy =   473.764, loop4 =   448.430, loop1 =   102.770, rand =29.641
calloc alone
memcpy =   405.947, loop4 =   361.550, loop1 =95.441, rand =21.853
malloc+memset:
memcpy =   239.294, loop4 =   188.617, loop1 =80.871, rand = 4.726

In 31kernel all 3 measures are about the same (unfortunatly low) level of 
malloc+memset in 22.

First of all: What performance can be expected?
Does 22 make failures if it is so much faster?
Can the later kernels get a boost in memory handling?

I used a standard memcpy (think this is glib and hence not neonbased)? 
To be neonbased I guess it has to be recompiled?

How can I find out that neon and cache settings are ok?
Using a Omap3530 on EVM board

Unfortunatly I don't have a Lauterbach, just a Spectrum Digital which works 
only until Linux kernel is booting.

Best regards
Steffen


 Original-Nachricht 
> Datum: Wed, 14 Oct 2009 08:59:05 -0500
> Von: "Woodruff, Richard" 
> An: "e...@gmx.de" , "Premi, Sanjeev" , 
> "linux-omap@vger.kernel.org" 
> Betreff: RE: RE: Memory performance / Cache problem

> > There is no newer u-boot from TI available. There is a SDK 02.01.03.11
> > but it contains the same uboot 2008.10 with the only addition of the
> second
> > generation of EVM boards with another network chip.
> >
> > So I checked the uboot from git, but this doesn't support Microns NAND
> Flash
> > anymore. It is just working with ONENAND.
> >
> > I found a patch which shows the L2 Cache status while kernel boot and
> > implemented it : L2 Cache seems to be already enabled - so this is not
> the
> > reason.
> >
> > So any other ideas?
> 
> Are you confident your memory bus isn't running at 1/2 speed?
> 
> I recall there was a couple day window during wtbu kernel upgrades where
> memory bus speed with pm was running 1/2 speed after kernel started up. 
> This was somewhat a side effect of constraints frame work and a regression in
> forward porting. It seems unlikely psp kernel would have shipped with this
> bug but its something to check. This would match your results.
> 
> If your memcpy() is neon based then I might be worried about
> l1neon-caching effects along with factors of (exlcusive-l1-l2-read-allocate 
> cache + pld
> not being effective on l1 only l2).
> 
> Which memcpy test are you using? Something in lmbench or just one you
> wrote.  Generally results are a little hard to interpret with exclusive cache
> behavior in 3430's r1px core.  3630's r3p2 core gives more traditional
> results as exclusive feature has been removed by arm.
> 
> If you have the ability using Lauterbach + per file will allow internal
> space dump which will show all critical parameters during test.  It's a 1
> minute check for someone who has done it before to ensure the few parameters
> needed are in line.  I can send an example off line of how to do capture.  I
> won't have time to expand on all relevant parameters.
> 
> Regards,
> Richard W.

-- 
Neu: GMX DSL bis 50.000 kBit/s und 200,- Euro Startguthaben!
http://portal.gmx.net/de/go/dsl02
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: Memory performance / Cache problem

2009-10-14 Thread Woodruff, Richard
> There is no newer u-boot from TI available. There is a SDK 02.01.03.11
> but it contains the same uboot 2008.10 with the only addition of the second
> generation of EVM boards with another network chip.
>
> So I checked the uboot from git, but this doesn't support Microns NAND Flash
> anymore. It is just working with ONENAND.
>
> I found a patch which shows the L2 Cache status while kernel boot and
> implemented it : L2 Cache seems to be already enabled - so this is not the
> reason.
>
> So any other ideas?

Are you confident your memory bus isn't running at 1/2 speed?

I recall there was a couple day window during wtbu kernel upgrades where memory 
bus speed with pm was running 1/2 speed after kernel started up.  This was 
somewhat a side effect of constraints frame work and a regression in forward 
porting. It seems unlikely psp kernel would have shipped with this bug but its 
something to check. This would match your results.

If your memcpy() is neon based then I might be worried about l1neon-caching 
effects along with factors of (exlcusive-l1-l2-read-allocate cache + pld not 
being effective on l1 only l2).

Which memcpy test are you using? Something in lmbench or just one you wrote.  
Generally results are a little hard to interpret with exclusive cache behavior 
in 3430's r1px core.  3630's r3p2 core gives more traditional results as 
exclusive feature has been removed by arm.

If you have the ability using Lauterbach + per file will allow internal space 
dump which will show all critical parameters during test.  It's a 1 minute 
check for someone who has done it before to ensure the few parameters needed 
are in line.  I can send an example off line of how to do capture.  I won't 
have time to expand on all relevant parameters.

Regards,
Richard W.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RE: Memory performance / Cache problem

2009-10-13 Thread epsi
> 
> Can you upgrade to a newer u-boot? Either from the PSP release
> OR u-boot tree hosted at git.denx.de (atleast 2009.03)?
> 
> Also, it will be good to see the sample program you are using.
> 
> ~sanjeev
> 

There is no newer u-boot from TI available. There is a SDK 02.01.03.11
but it contains the same uboot 2008.10 with the only addition of the second 
generation of EVM boards with another network chip.

So I checked the uboot from git, but this doesn't support Microns NAND Flash 
anymore. It is just working with ONENAND.

I found a patch which shows the L2 Cache status while kernel boot and 
implemented it : L2 Cache seems to be already enabled - so this is not the 
reason.

So any other ideas? 
-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html