Re: [Cbe-oss-dev] [RFC 3/3] powerpc: copy_4K_page tweaked for Cell

2008-06-19 Thread Mark Nelson
On Fri, 20 Jun 2008 07:28:50 am Arnd Bergmann wrote:
> On Thursday 19 June 2008, Mark Nelson wrote:
> > .align  7
> > _GLOBAL(copy_4K_page)
> > dcbt0,r4/* Prefetch ONE SRC cacheline */
> > 
> > addir6,r3,-8/* prepare for stdu */
> > addir4,r4,-8/* prepare for ldu */
> > 
> > li  r10,32  /* copy 32 cache lines for a 4K page */
> > li  r12,128+8   /* prefetch distance*/
> 
> Since you have a loop here anyway instead of the fully unrolled
> code, why not provide a copy_64K_page function as well, jumping in
> here?

That is a good idea. What effect will that have on how the code
patching will work?

> 
> The inline 64k copy_page function otherwise just adds code size,
> as well as being a tiny bit slower. It may even be good to
> have an out-of-line copy_64K_page for the regular code, just
> calling copy_4K_page repeatedly.

Doing that sounds like it'll make the code patching easier.

Thanks!

Mark
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [Cbe-oss-dev] [RFC 3/3] powerpc: copy_4K_page tweaked for Cell

2008-06-19 Thread Arnd Bergmann
On Thursday 19 June 2008, Mark Nelson wrote:
> .align  7
> _GLOBAL(copy_4K_page)
> dcbt0,r4/* Prefetch ONE SRC cacheline */
> 
> addir6,r3,-8/* prepare for stdu */
> addir4,r4,-8/* prepare for ldu */
> 
> li  r10,32  /* copy 32 cache lines for a 4K page */
> li  r12,128+8   /* prefetch distance*/

Since you have a loop here anyway instead of the fully unrolled
code, why not provide a copy_64K_page function as well, jumping in
here?

The inline 64k copy_page function otherwise just adds code size,
as well as being a tiny bit slower. It may even be good to
have an out-of-line copy_64K_page for the regular code, just
calling copy_4K_page repeatedly.

Arnd <><
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[RFC 3/3] powerpc: copy_4K_page tweaked for Cell

2008-06-19 Thread Mark Nelson
/*
 * Copyright (C) 2008 Gunnar von Boehn, IBM Corp.
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 *
 *
 * copy_4K_page routine optimized for CELL-BE-PPC
 *
 * The CELL PPC core has 1 integerunit and 1 load/store unit
 * CELL: 1st level data cache = 32K - 2nd level data cache = 512K
 * - 3rd level data cache = 0K
 * To improve copy performance we need to prefetch source data
 * far ahead to hide this latency
 * For best performance instruction forms ending in "." like "andi."
 * should be avoided as they are implemented in microcode on CELL.
 *
 * The below code is loop unrolled for the CELL cache line of 128 bytes.
 */

#include 
#include 

#define PREFETCH_AHEAD 6
#define ZERO_AHEAD 4

.align  7
_GLOBAL(copy_4K_page)
dcbt0,r4/* Prefetch ONE SRC cacheline */

addir6,r3,-8/* prepare for stdu */
addir4,r4,-8/* prepare for ldu */

li  r10,32  /* copy 32 cache lines for a 4K page */
li  r12,128+8   /* prefetch distance*/

subir11,r10,PREFETCH_AHEAD
li  r10,PREFETCH_AHEAD

mtctr   r10
.LprefetchSRC:
dcbtr12,r4
addir12,r12,128
bdnz.LprefetchSRC

.Louterloop:/* copy while cache lines */
mtctr   r11

li  r11,128*ZERO_AHEAD +8   /* DCBZ dist */

.align  4
/* Copy whole cachelines, optimized by prefetching SRC cacheline */
.Lloop: /* Copy aligned body */
dcbtr12,r4  /* PREFETCH SOURCE some cache lines 
ahead*/
ld  r9, 0x08(r4)
dcbzr11,r6
ld  r7, 0x10(r4)/* 4 register stride copy */
ld  r8, 0x18(r4)/* 4 are optimal to hide 1st level 
cache lantency*/
ld  r0, 0x20(r4)
std r9, 0x08(r6)
std r7, 0x10(r6)
std r8, 0x18(r6)
std r0, 0x20(r6)
ld  r9, 0x28(r4)
ld  r7, 0x30(r4)
ld  r8, 0x38(r4)
ld  r0, 0x40(r4)
std r9, 0x28(r6)
std r7, 0x30(r6)
std r8, 0x38(r6)
std r0, 0x40(r6)
ld  r9, 0x48(r4)
ld  r7, 0x50(r4)
ld  r8, 0x58(r4)
ld  r0, 0x60(r4)
std r9, 0x48(r6)
std r7, 0x50(r6)
std r8, 0x58(r6)
std r0, 0x60(r6)
ld  r9, 0x68(r4)
ld  r7, 0x70(r4)
ld  r8, 0x78(r4)
ldu r0, 0x80(r4)
std r9, 0x68(r6)
std r7, 0x70(r6)
std r8, 0x78(r6)
stdur0, 0x80(r6)

bdnz.Lloop

sldir10,r10,2   /* adjust from 128 to 32 byte stride */
mtctr   r10
.Lloop2:/* Copy aligned body */
ld  r9, 0x08(r4)
ld  r7, 0x10(r4)
ld  r8, 0x18(r4)
ldu r0, 0x20(r4)
std r9, 0x08(r6)
std r7, 0x10(r6)
std r8, 0x18(r6)
stdur0, 0x20(r6)

bdnz.Lloop2

.Lendloop2:
blr
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev