subject:"\[Cbe\-oss\-dev\] \[RFC, PATCH\] CELL Oprofile SPU profiling updated patch"

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-27 Thread Arnd Bergmann

On Tuesday 27 February 2007, Maynard Johnson wrote:
> I have applied the "cleanup" patch that Arnd sent, but had to fix up a 
> few things:
>    -  Bug fix:  Initialize retval in spu_task_sync.c, line 95, otherwise 
> OProfile this function returns non-zero and OProfile fails.
>    -  Remove unused codes in include/linux/oprofile.h
>    -  Compile warnings:  Initialize offset and spu_cookie at lines 283 
> and 284 in spu_task_sync.c
> 
> With these changes and some userspace changes that were necessary to 
> correspond with Arnd's changes, our testing was successful.
> 
> A fixup patch is attached.
> 

The patch does not contain any of the metadata I need to apply it
(subject, description, signed-off-by).

> @@ -280,8 +280,8 @@ static int process_context_switch(struct
>  {
> unsigned long flags;
> int retval;
> -   unsigned int offset;
> -   unsigned long spu_cookie, app_dcookie;
> +   unsigned int offset = 0;
> +   unsigned long spu_cookie = 0, app_dcookie;
> retval = prepare_cached_spu_info(spu, objectId);
> if (retval)
> goto out;

No, this is wrong. Leaving the variables uninitialized at least warns
you about the bug you have in this function: when there is anything wrong,
you just continue writing the record with zero offset and dcookie values
in it. Instead, you should get handle the error condition somewhere
down the code.

It's harmless most of the time, but you really should not be painting
over your bugs by blindly initializing variables.

> diff -paur linux-orig/include/linux/oprofile.h 
> linux-new/include/linux/oprofile.h
> --- linux-orig/include/linux/oprofile.h 2007-02-27 14:41:29.0 -0600
> +++ linux-new/include/linux/oprofile.h  2007-02-27 14:43:18.0 -0600
> @@ -36,9 +36,6 @@
>  #define XEN_ENTER_SWITCH_CODE  10
>  #define SPU_PROFILING_CODE 11
>  #define SPU_CTX_SWITCH_CODE12
> -#define SPU_OFFSET_CODE13
> -#define SPU_COOKIE_CODE14
> -#define SPU_SHLIB_COOKIE_CODE  15
>  
>  struct super_block;
>  struct dentry;
> 
Right, I forgot about this.

Arnd <><

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-27 Thread Maynard Johnson

I have applied the "cleanup" patch that Arnd sent, but had to fix up a 
few things:
  -  Bug fix:  Initialize retval in spu_task_sync.c, line 95, otherwise 
OProfile this function returns non-zero and OProfile fails.

  -  Remove unused codes in include/linux/oprofile.h
  -  Compile warnings:  Initialize offset and spu_cookie at lines 283 
and 284 in spu_task_sync.c


With these changes and some userspace changes that were necessary to 
correspond with Arnd's changes, our testing was successful.


A fixup patch is attached.

P.S.  We have a single patch with all these changes applied if anyone 
would like us to post it.


-Maynard


Arnd Bergmann wrote:


On Thursday 22 February 2007, Carl Love wrote:
 


This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
code.
   



There was a significant amount of whitespace breakage in this patch,
which I cleaned up. The patch below consists of the other things
I changed as a further cleanup. Note that I changed the format
of the context switch record, which I found too complicated, as
I described on IRC last week.

Arnd <><

 



diff -paur linux-orig/arch/powerpc/oprofile/cell/spu_task_sync.c linux-new/arch/powerpc/oprofile/cell/spu_task_sync.c
--- linux-orig/arch/powerpc/oprofile/cell/spu_task_sync.c	2007-02-27 17:10:24.0 -0600
+++ linux-new/arch/powerpc/oprofile/cell/spu_task_sync.c	2007-02-27 17:08:57.0 -0600
@@ -92,7 +92,7 @@ prepare_cached_spu_info(struct spu * spu
 {
 	unsigned long flags;
 	struct vma_to_fileoffset_map * new_map;
-	int retval;
+	int retval = 0;
 	struct cached_info * info;
 
 /* We won't bother getting cache_lock here since
@@ -280,8 +280,8 @@ static int process_context_switch(struct
 {
 	unsigned long flags;
 	int retval;
-	unsigned int offset;
-	unsigned long spu_cookie, app_dcookie;
+	unsigned int offset = 0;
+	unsigned long spu_cookie = 0, app_dcookie;
 	retval = prepare_cached_spu_info(spu, objectId);
 	if (retval)
 		goto out;
diff -paur linux-orig/include/linux/oprofile.h linux-new/include/linux/oprofile.h
--- linux-orig/include/linux/oprofile.h	2007-02-27 14:41:29.0 -0600
+++ linux-new/include/linux/oprofile.h	2007-02-27 14:43:18.0 -0600
@@ -36,9 +36,6 @@
 #define XEN_ENTER_SWITCH_CODE  10
 #define SPU_PROFILING_CODE 11
 #define SPU_CTX_SWITCH_CODE12
-#define SPU_OFFSET_CODE13
-#define SPU_COOKIE_CODE14
-#define SPU_SHLIB_COOKIE_CODE  15
 
 struct super_block;
 struct dentry;

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-26 Thread Michael Ellerman

On Tue, 2007-02-27 at 00:50 +0100, Arnd Bergmann wrote:
> On Thursday 22 February 2007, Carl Love wrote:
> > This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
> > to add in the SPU profiling capabilities.  In addition, a 'cell' 
> > subdirectory
> > was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
> > code.
> 
> There was a significant amount of whitespace breakage in this patch,
> which I cleaned up. The patch below consists of the other things
> I changed as a further cleanup. Note that I changed the format
> of the context switch record, which I found too complicated, as
> I described on IRC last week.
> 
>   Arnd <><
> 
> --
> Subject: cleanup spu oprofile code
> 
> From: Arnd Bergmann <[EMAIL PROTECTED]>
> This cleans up some of the new oprofile code. It's mostly
> cosmetic changes, like way multi-line comments are formatted.
> The most significant change is a simplification of the
> context-switch record format.
> 
> It does mean the oprofile report tool needs to be adapted,
> but I'm sure that it pays off in the end.

I hate to be a stickler, but this patch is quite large, contains
multiple changes, and mixes formatting changes with functional
changes ... makes it a little hard to review :/

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-26 Thread Arnd Bergmann

On Thursday 22 February 2007, Carl Love wrote:
> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
> to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
> was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
> code.

There was a significant amount of whitespace breakage in this patch,
which I cleaned up. The patch below consists of the other things
I changed as a further cleanup. Note that I changed the format
of the context switch record, which I found too complicated, as
I described on IRC last week.

Arnd <><

--
Subject: cleanup spu oprofile code

From: Arnd Bergmann <[EMAIL PROTECTED]>
This cleans up some of the new oprofile code. It's mostly
cosmetic changes, like way multi-line comments are formatted.
The most significant change is a simplification of the
context-switch record format.

It does mean the oprofile report tool needs to be adapted,
but I'm sure that it pays off in the end.

Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
Index: linux-2.6/arch/powerpc/oprofile/cell/spu_task_sync.c
===
--- linux-2.6.orig/arch/powerpc/oprofile/cell/spu_task_sync.c
+++ linux-2.6/arch/powerpc/oprofile/cell/spu_task_sync.c
@@ -61,11 +61,12 @@ static void destroy_cached_info(struct k
 static struct cached_info * get_cached_info(struct spu * the_spu, int spu_num)
 {
struct kref * ref;
-   struct cached_info * ret_info = NULL;
+   struct cached_info * ret_info;
if (spu_num >= num_spu_nodes) {
printk(KERN_ERR "SPU_PROF: "
   "%s, line %d: Invalid index %d into spu info cache\n",
   __FUNCTION__, __LINE__, spu_num);
+   ret_info = NULL;
goto out;
}
if (!spu_info[spu_num] && the_spu) {
@@ -89,9 +90,9 @@ static struct cached_info * get_cached_i
 static int
 prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
 {
-   unsigned long flags = 0;
+   unsigned long flags;
struct vma_to_fileoffset_map * new_map;
-   int retval = 0;
+   int retval;
struct cached_info * info;
 
/* We won't bother getting cache_lock here since
@@ -112,6 +113,7 @@ prepare_cached_spu_info(struct spu * spu
printk(KERN_ERR "SPU_PROF: "
   "%s, line %d: create vma_map failed\n",
   __FUNCTION__, __LINE__);
+   retval = -ENOMEM;
goto err_alloc;
}
new_map = create_vma_map(spu, objectId);
@@ -119,6 +121,7 @@ prepare_cached_spu_info(struct spu * spu
printk(KERN_ERR "SPU_PROF: "
   "%s, line %d: create vma_map failed\n",
   __FUNCTION__, __LINE__);
+   retval = -ENOMEM;
goto err_alloc;
}
 
@@ -144,7 +147,7 @@ prepare_cached_spu_info(struct spu * spu
goto out;
 
 err_alloc:
-   retval = -1;
+   kfree(info);
 out:
return retval;
 }
@@ -215,11 +218,9 @@ static inline unsigned long fast_get_dco
 static unsigned long
 get_exec_dcookie_and_offset(struct spu * spu, unsigned int * offsetp,
unsigned long * spu_bin_dcookie,
-   unsigned long * shlib_dcookie,
unsigned int spu_ref)
 {
unsigned long app_cookie = 0;
-   unsigned long * image_cookie = NULL;
unsigned int my_offset = 0;
struct file * app = NULL;
struct vm_area_struct * vma;
@@ -252,24 +253,17 @@ get_exec_dcookie_and_offset(struct spu *
 my_offset, spu_ref,
 vma->vm_file->f_dentry->d_name.name);
*offsetp = my_offset;
-   if (my_offset == 0)
-   image_cookie = spu_bin_dcookie;
-   else if (vma->vm_file != app)
-   image_cookie = shlib_dcookie;
break;
}
 
-   if (image_cookie) {
-   *image_cookie = fast_get_dcookie(vma->vm_file->f_dentry,
+   *spu_bin_dcookie = fast_get_dcookie(vma->vm_file->f_dentry,
 vma->vm_file->f_vfsmnt);
-   pr_debug("got dcookie for %s\n",
-vma->vm_file->f_dentry->d_name.name);
-   }
+   pr_debug("got dcookie for %s\n", vma->vm_file->f_dentry->d_name.name);
 
- out:
+out:
return app_cookie;
 
- fail_no_image_cookie:
+fail_no_image_cookie:
printk(KERN_ERR "SPU_PROF: "
"%s, line %d: Cannot find dcookie for SPU binary\n",
__FUNCTION__, __LINE__);
@@ -285,18 +279,18 @@ get_exec_dcookie_and_offset(struct spu *
 static int process_context_switch(struct spu * spu, unsigned int objectId)
 {
unsigned long flags;
-   int retval = 0;
-   unsigned int offset = 0;
-   unsigned long spu_cookie = 0, app_dcookie = 0, shlib_cookie = 0;
+

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-18 Thread Maynard Johnson


Maynard Johnson wrote:


Arnd Bergmann wrote:

 


On Friday 16 February 2007 01:32, Maynard Johnson wrote:

   


config OPROFILE_CELL
  bool "OProfile for Cell Broadband Engine"
  depends on OPROFILE && SPU_FS
  default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && 
OPROFILE = m))

  help
Profiling of Cell BE SPUs requires special support enabled
by this option.  Both SPU_FS and OPROFILE options must be
set 'y' or both be set 'm'.
=

Can anyone see a problem with any of this . . . or perhaps a suggestion 
of a better way?
 


The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think
should be allowed. 
   

Right, good catch.  I'll add another OR to the 'default y' and correct 
the text.
 


Actually, it makes more sense to do the following:

config OPROFILE_CELL
  bool "OProfile for Cell Broadband Engine"
  depends on  (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && 
OPROFILE = y) || (SPU_FS = m && OPROFILE = m)

  default y
  help
Profiling of Cell BE SPUs requires special support enabled by 
this option.



> I also don't see any place in the code where you actually
 


use CONFIG_OPROFILE_CELL.
   

As I mentioned, I will use CONFIG_OPROFILE_CELL in the 
arch/powerpc/oprofile/Makefile as follows:

 oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \
  cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o

 




[snip]


Arnd <><
   




___
Linuxppc-dev mailing list
[EMAIL PROTECTED]
https://ozlabs.org/mailman/listinfo/linuxppc-dev
 




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-16 Thread Maynard Johnson


Arnd Bergmann wrote:


On Friday 16 February 2007 01:32, Maynard Johnson wrote:


config OPROFILE_CELL
   bool "OProfile for Cell Broadband Engine"
   depends on OPROFILE && SPU_FS
   default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && 
OPROFILE = m))

   help
 Profiling of Cell BE SPUs requires special support enabled
 by this option.  Both SPU_FS and OPROFILE options must be
 set 'y' or both be set 'm'.
=

Can anyone see a problem with any of this . . . or perhaps a suggestion 
of a better way?



The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think
should be allowed. 
Right, good catch.  I'll add another OR to the 'default y' and correct 
the text.


> I also don't see any place in the code where you actually

use CONFIG_OPROFILE_CELL.
As I mentioned, I will use CONFIG_OPROFILE_CELL in the 
arch/powerpc/oprofile/Makefile as follows:

 oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \
  cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o



Ideally, you should be able to have an oprofile_spu module that can be
loaded after spufs.ko and oprofile.ko. In that case you only need

config OPROFILE_SPU
depends on OPROFILE && SPU_FS
default y

and it will automatically build oprofile_spu as a module if one of the two
is a module and won't build it if one of them is disabled.
Hmmm . . . I guess that would entail splitting out the SPU-related stuff 
from op_model_cell.c into a new file.  Maybe more -- that's just what 
comes to mind right now.  Could be very tricky, and I wonder if it's 
worth the bother.


Arnd <><



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-16 Thread Arnd Bergmann

On Friday 16 February 2007 01:32, Maynard Johnson wrote:
> config OPROFILE_CELL
>         bool "OProfile for Cell Broadband Engine"
>         depends on OPROFILE && SPU_FS
>         default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && 
> OPROFILE = m))
>         help
>           Profiling of Cell BE SPUs requires special support enabled
>           by this option.  Both SPU_FS and OPROFILE options must be
>           set 'y' or both be set 'm'.
> =
> 
> Can anyone see a problem with any of this . . . or perhaps a suggestion 
> of a better way?

The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think
should be allowed. I also don't see any place in the code where you actually
use CONFIG_OPROFILE_CELL.

Ideally, you should be able to have an oprofile_spu module that can be
loaded after spufs.ko and oprofile.ko. In that case you only need

config OPROFILE_SPU
depends on OPROFILE && SPU_FS
default y

and it will automatically build oprofile_spu as a module if one of the two
is a module and won't build it if one of them is disabled.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Maynard Johnson


Arnd Bergmann wrote:


On Thursday 15 February 2007 00:52, Carl Love wrote:


 


--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 
16:43:14.0 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig  2007-02-13 
19:04:46.271028904 -0600
@@ -7,7 +7,8 @@

config OPROFILE
tristate "OProfile system profiling (EXPERIMENTAL)"
-   depends on PROFILING
+   default m
+   depends on SPU_FS && PROFILING
help
  OProfile is a profiling system capable of profiling the
  whole system, include the kernel, kernel modules, libraries,
   



Milton already commented on this being wrong. I think what you want
is
depends on PROFILING && (SPU_FS = n || SPU_FS)

that should make sure that when SPU_FS=y that OPROFILE can not be 'm'.
 

The above suggestion would not work if SPU_FS is not defined, since the 
entire config option is ignored if an undefined symbol is used.  So, 
here's what I propose instead: 
   - Leave the existing 'config OPROFILE' unchanged from its current 
form in mainline (shown below)

   - Add the new 'config OPROFILE_CELL' (shown below)
   - In arch/powerpc/configs/cell-defconfig, set CONFIG_OPROFILE=m, to 
correspond to setting for CONFIG_SPU_FS

   - In arch/powerpc/oprofile/Makefile, do the following:
   oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \
   
cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o


===
config OPROFILE
   tristate "OProfile system profiling (EXPERIMENTAL)"
   depends on PROFILING
   help
 OProfile is a profiling system capable of profiling the
 whole system, include the kernel, kernel modules, libraries,
 and applications.

 If unsure, say N.

config OPROFILE_CELL
   bool "OProfile for Cell Broadband Engine"
   depends on OPROFILE && SPU_FS
   default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && 
OPROFILE = m))

   help
 Profiling of Cell BE SPUs requires special support enabled
 by this option.  Both SPU_FS and OPROFILE options must be
 set 'y' or both be set 'm'.
=

Can anyone see a problem with any of this . . . or perhaps a suggestion 
of a better way?


Thanks.

-Maynard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Arnd Bergmann

On Thursday 15 February 2007 22:50, Paul E. McKenney wrote:
> Is this 1.5ms with interrupts disabled?  This time period is problematic
> from a realtime perspective if so -- need to be able to preempt.

No, interrupts should be enabled here. Still, 1.5ms is probably a little
too long without a cond_resched() in case kernel preemption is disabled.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Paul E. McKenney

On Thu, Feb 15, 2007 at 12:21:58PM -0800, Carl Love wrote:
> On Thu, 2007-02-15 at 15:37 +0100, Arnd Bergmann wrote:

[ . . . ]

> > I agree with Milton that it would be far nicer even to calculate
> > the value from user space, but since you say that would
> > violate the oprofile interface conventions, let's not go there.
> > In order to make this code nicer on the user, you should probably
> > insert a 'cond_resched()' somewhere in the loop, maybe every
> > 500 iterations or so.
> > 
> > it also looks like there is whitespace damage in the code here.
> 
> I will double check on the whitespace damage.  I thought I had gotten
> all that out.  
> 
> I have done some quick measurements.  The above method limits the loop
> to at most 2^16 iterations.  Based on running the algorithm in user
> space, it takes about 3ms of computation time to do the loop 2^16 times.
> 
> At the vary least, we need to put the resched in say every 10,000
> iterations which would be about every 0.5ms.  Should we do a resched
> more often?  
> 
> Additionally we could up the size of the table to 512 which would reduce
> the maximum time to about 1.5ms.  What do people think about increasing
> the table size?

Is this 1.5ms with interrupts disabled?  This time period is problematic
from a realtime perspective if so -- need to be able to preempt.

Thanx, Paul

> A little more general discussion about the logarithmic algorithm and
> limiting the range.  The hardware supports a 24 bit LFSR value. This
> means the user can say is capture a sample every N cycles, where N is in
> the range of 1 to 2^24.  The OProfile user tool enforces a minimum value
> of N to make sure the overhead of OProfile doesn't bring the machine to
> its knees.  The minimum values is not intended to guarantee the
> performance impact of OProfile will not be significant.  It is left as
> an exercise for the user to pick an N that will give minimal performance
> impact.  We set the lower limit for N for SPU profiling to 100,000. This
> is actually high enough that we don't seem to see much performance
> impact when running OProfile.  If the user picked N=2^24 then for a
> 3.2GHz machine you would get about 200 samples per second on each node.
> Where a sample consists of the PC value for all 8 SPUs on the node.  If
> the user wanted to do a relatively long OProfile run, I can see where
> they might use N=2^24 to avoid gathering too much data.  My gut feeling
> is that the sampling frequency for N=2^24 is not low enough that someone
> would never want to use it when doing long runs.  Hence, we should not
> arbitrarily reduce the maximum value for N.  Although I would expect
> that the typical value for N will be in the range of several hundred
> thousand to a few million.
> 
> As for using a logarithmic spacing of the precomputed values, this
> approach means that the space between the precomputed values at the high
> end would be much larger then 2^14, assuming 256 precomputed values.
> That means it could take much longer then 3ms to get the needed LFSR
> value for a large N.  By evenly spacing the precomputed values, we can
> ensure that for all N it will take less then 3ms to get the value.
> Personally, I am more comfortable with a hard limit on the compute time
> then a variable time that could get much bigger then the 1ms threshold
> that Arnd wants for resched.  Any thoughts?
> 
> > 
> > > +
> > > +/* This interface allows a profiler (e.g., OProfile) to store
> > > + * spu_context information needed for profiling, allowing it to
> > > + * be saved across context save/restore operation.
> > > + *
> > > + * Assumes the caller has already incremented the ref count to
> > > + * profile_info; then spu_context_destroy must call kref_put
> > > + * on prof_info_kref.
> > > + */
> > > +void spu_set_profile_private(struct spu_context * ctx, void * 
> > > profile_info,
> > > +  struct kref * prof_info_kref,
> > > +  void (* prof_info_release) (struct kref * kref))
> > > +{
> > > + ctx->profile_private = profile_info;
> > > + ctx->prof_priv_kref = prof_info_kref;
> > > + ctx->prof_priv_release = prof_info_release;
> > > +}
> > > +EXPORT_SYMBOL_GPL(spu_set_profile_private);
> > 
> > I think you don't need the profile_private member here, if you just use
> > container_of with ctx->prof_priv_kref in all users.
> > 
> > Arnd <><
> 
> ___
> cbe-oss-dev mailing list
> [EMAIL PROTECTED]
> https://ozlabs.org/mailman/listinfo/cbe-oss-dev
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Arnd Bergmann

On Thursday 15 February 2007 21:21, Carl Love wrote:

> I have done some quick measurements.  The above method limits the loop
> to at most 2^16 iterations.  Based on running the algorithm in user
> space, it takes about 3ms of computation time to do the loop 2^16 times.
> 
> At the vary least, we need to put the resched in say every 10,000
> iterations which would be about every 0.5ms.  Should we do a resched
> more often?  

Yes, just to be on the safe side, I'd suggest to do it every 1000
iterations.
 
> Additionally we could up the size of the table to 512 which would reduce
> the maximum time to about 1.5ms.  What do people think about increasing
> the table size?

No, that won't help too much. I'd say 256 or 128 entries is the most
we should have.

> As for using a logarithmic spacing of the precomputed values, this
> approach means that the space between the precomputed values at the high
> end would be much larger then 2^14, assuming 256 precomputed values.
> That means it could take much longer then 3ms to get the needed LFSR
> value for a large N.  By evenly spacing the precomputed values, we can
> ensure that for all N it will take less then 3ms to get the value.
> Personally, I am more comfortable with a hard limit on the compute time
> then a variable time that could get much bigger then the 1ms threshold
> that Arnd wants for resched.  Any thoughts?

When using precomputed values on a logarithmic scale, I'd recommend
just rounding to the closest value and accepting the relative inaccuracy,
instead of using the precomputed value as the base and then calculating
from there.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Carl Love

On Thu, 2007-02-15 at 15:37 +0100, Arnd Bergmann wrote:
> On Thursday 15 February 2007 00:52, Carl Love wrote:
> 
> 
> > --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 
> > 16:43:14.0 -0600
> > +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig  2007-02-13 
> > 19:04:46.271028904 -0600
> > @@ -7,7 +7,8 @@
> >  
> >  config OPROFILE
> > tristate "OProfile system profiling (EXPERIMENTAL)"
> > -   depends on PROFILING
> > +   default m
> > +   depends on SPU_FS && PROFILING
> > help
> >   OProfile is a profiling system capable of profiling the
> >   whole system, include the kernel, kernel modules, libraries,
> 
> Milton already commented on this being wrong. I think what you want
> is
>   depends on PROFILING && (SPU_FS = n || SPU_FS)
> 
> that should make sure that when SPU_FS=y that OPROFILE can not be 'm'.
> 
> > @@ -15,3 +16,10 @@
> >  
> >   If unsure, say N.
> >  
> > +config OPROFILE_CELL
> > +   bool "OProfile for Cell Broadband Engine"
> > +   depends on SPU_FS && OPROFILE
> > +   default y
> > +   help
> > + OProfile for Cell BE requires special support enabled
> > + by this option.
> 
> You should at least mention that this allows profiling the spus.
> 
> > +#define EFWCALL  ENOSYS /* Use an existing error number that is as
> > +* close as possible for a FW call that failed.
> > +* The probability of the call failing is
> > +* very low.  Passing up the error number
> > +* ensures that the user will see an error
> > +* message saying OProfile did not start.
> > +* Dmesg will contain an accurate message
> > +* about the failure.
> > +*/
> 
> ENOSYS looks wrong though. It would appear to the user as if the oprofile
> function in the kernel was not present. I'd suggest EIO, and not use 
> an extra define for that.
> 

OK, will do. 

> 
> >  static int
> >  rtas_ibm_cbe_perftools(int subfunc, int passthru,
> >void *address, unsigned long length)
> >  {
> > u64 paddr = __pa(address);
> >  
> > -   return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru,
> > -paddr >> 32, paddr & 0x, length);
> > +   pm_rtas_token = rtas_token("ibm,cbe-perftools");
> > +
> > +   if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) {
> > +   printk(KERN_ERR
> > +  "%s: rtas token ibm,cbe-perftools unknown\n",
> > +  __FUNCTION__);
> > +   return -EFWCALL;
> > +   } else {
> > +
> > +   return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, 
> > +passthru, paddr >> 32, paddr & 0x, length); 
> > +   }
> >  }
> 
> Are you now reading the rtas token every time you call rtas? that seems
> like a waste of time.

There are actually two RTAS calls, i.e. two tokens.  Once for setting up
the debug bus.  The other to do the SPU PC collection.  Yes, we are
getting the token each time using the single global pm_rtas_token.  To
make sure we had the correct token, I made sure to call it each time.
As you point out it is very wasteful.  It probably would be best to just
have a second global variable say spu_rtas_token.  Then do a single call
for each global variable.  Then we just use the global variable in the
appropriate rtas_call.  This would eliminate a significant number of
calls to look up the token.  I should have thought of that earlier.  

> 
> 
> > +#define size 24
> > +#define ENTRIES  (0x1<<8) /* 256 */
> > +#define MAXLFSR  0xFF
> > +
> > +int initial_lfsr[] =
> > +{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445,
> > + 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, 
> > 4996256,
> > + 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843,
> > + 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714,
> > + 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106,
> > + 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539,
> > + 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372,
> > + 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380,
> > + 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555,
> > + 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120,
> > + 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753,
> > + 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582,
> > + 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, 
> > 7330509,
> > + 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266,
> > + 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108,
> > + 2012376, 13661123, 14655718, 95340

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Arnd Bergmann

On Thursday 15 February 2007 17:15, Maynard Johnson wrote:
> >>+void spu_set_profile_private(struct spu_context * ctx, void * profile_info,
> >>+      struct kref * prof_info_kref,
> >>+      void (* prof_info_release) (struct kref * kref))
> >>+{
> >>+ ctx->profile_private = profile_info;
> >>+ ctx->prof_priv_kref = prof_info_kref;
> >>+ ctx->prof_priv_release = prof_info_release;
> >>+}
> >>+EXPORT_SYMBOL_GPL(spu_set_profile_private);
> >>    
> >>
> >
> >I think you don't need the profile_private member here, if you just use
> >container_of with ctx->prof_priv_kref in all users.
> >  
> >
> Sorry, I don't follow. We want the profile_private to be stored in the 
> spu_context, don't we?  How else would I be able to do that?  And 
> besides, wouldn't container_of need the struct name of profile_private?  
> SPUFS doesn't have access to the type.

The idea was to have spu_get_profile_private return the kref pointer,
and then change the user of that to do

+   if (!spu_info[spu_num] && the_spu) {
+   spu_info[spu_num] = container_of(
+   spu_get_profile_private(the_spu->ctx),
+   struct cached_info, cache_kref);
+   if (spu_info[spu_num])
+   kref_get(&spu_info[spu_num]->cache_ref);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Maynard Johnson


Arnd Bergmann wrote:


On Thursday 15 February 2007 00:52, Carl Love wrote:


 


--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 
16:43:14.0 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig  2007-02-13 
19:04:46.271028904 -0600
@@ -7,7 +7,8 @@

config OPROFILE
tristate "OProfile system profiling (EXPERIMENTAL)"
-   depends on PROFILING
+   default m
+   depends on SPU_FS && PROFILING
help
  OProfile is a profiling system capable of profiling the
  whole system, include the kernel, kernel modules, libraries,
   



Milton already commented on this being wrong. I think what you want
is
depends on PROFILING && (SPU_FS = n || SPU_FS)

that should make sure that when SPU_FS=y that OPROFILE can not be 'm'.
 

Blast it!  I did this right on our development system, but neglected to 
update the patch correctly to remove this dependency and 'default m'.  
I'll fix in the next patch.


 


@@ -15,3 +16,10 @@

  If unsure, say N.

+config OPROFILE_CELL
+   bool "OProfile for Cell Broadband Engine"
+   depends on SPU_FS && OPROFILE
+   default y
+   help
+ OProfile for Cell BE requires special support enabled
+ by this option.
   



You should at least mention that this allows profiling the spus.
 


OK.

 


+#define EFWCALL  ENOSYS /* Use an existing error number that is as
+* close as possible for a FW call that failed.
+* The probability of the call failing is
+* very low.  Passing up the error number
+* ensures that the user will see an error
+* message saying OProfile did not start.
+* Dmesg will contain an accurate message
+* about the failure.
+*/
   



ENOSYS looks wrong though. It would appear to the user as if the oprofile
function in the kernel was not present. I'd suggest EIO, and not use 
an extra define for that.
 


Carl will reply to this.



 


static int
rtas_ibm_cbe_perftools(int subfunc, int passthru,
   void *address, unsigned long length)
{
u64 paddr = __pa(address);

-   return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru,
-paddr >> 32, paddr & 0x, length);
+   pm_rtas_token = rtas_token("ibm,cbe-perftools");
+
+   if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) {
+   printk(KERN_ERR
+  "%s: rtas token ibm,cbe-perftools unknown\n",
+  __FUNCTION__);
+   return -EFWCALL;
+   } else {
+
+		return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, 
+			 passthru, paddr >> 32, paddr & 0x, length); 
+	}

}
   



Are you now reading the rtas token every time you call rtas? that seems
like a waste of time.
 


Carl will reply.



 


+#define size 24
+#define ENTRIES  (0x1<<8) /* 256 */
+#define MAXLFSR  0xFF
+
+int initial_lfsr[] =
+{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445,
+ 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, 4996256,
+ 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843,
+ 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714,
+ 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106,
+ 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539,
+ 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372,
+ 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380,
+ 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555,
+ 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120,
+ 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753,
+ 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582,
+ 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, 7330509,
+ 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266,
+ 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108,
+ 2012376, 13661123, 14655718, 9534083, 16637925, 2537745, 9787923, 12750103,
+ 4660370, 3283461, 14862772, 7034955, 6679872, 8918232, 6506913, 103649,
+ 6085577, 13324033, 14251613, 11058220, 11998181, 3100233, 468898, 7104918,
+ 12498413, 14408165, 1208514, 15712321, 3088687, 14778333, 3632503, 11151952,
+ 98896, 9159367, 8866146, 4780737, 4925758, 12362320, 4122783, 8543358,
+ 7056879, 10876914, 6282881, 1686625, 5100373, 4573666, 9265515, 13593840,
+ 5853060, 110, 4237111, 1576, 14344137, 4608332, 6590210, 13745050,
+ 10916568, 12340402, 7145275, 4417153, 2300360, 12079643, 7608534, 15238251,
+ 4947424, 7014722, 3984546, 7168073, 10759589, 16293080, 3757181, 4577717,
+ 5163790, 2488841

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-15 Thread Arnd Bergmann

On Thursday 15 February 2007 00:52, Carl Love wrote:


> --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig   2007-01-18 
> 16:43:14.0 -0600
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig2007-02-13 
> 19:04:46.271028904 -0600
> @@ -7,7 +7,8 @@
>  
>  config OPROFILE
>   tristate "OProfile system profiling (EXPERIMENTAL)"
> - depends on PROFILING
> + default m
> + depends on SPU_FS && PROFILING
>   help
> OProfile is a profiling system capable of profiling the
> whole system, include the kernel, kernel modules, libraries,

Milton already commented on this being wrong. I think what you want
is
depends on PROFILING && (SPU_FS = n || SPU_FS)

that should make sure that when SPU_FS=y that OPROFILE can not be 'm'.

> @@ -15,3 +16,10 @@
>  
> If unsure, say N.
>  
> +config OPROFILE_CELL
> + bool "OProfile for Cell Broadband Engine"
> + depends on SPU_FS && OPROFILE
> + default y
> + help
> +   OProfile for Cell BE requires special support enabled
> +   by this option.

You should at least mention that this allows profiling the spus.
  
> +#define EFWCALL  ENOSYS /* Use an existing error number that is as
> +  * close as possible for a FW call that failed.
> +  * The probability of the call failing is
> +  * very low.  Passing up the error number
> +  * ensures that the user will see an error
> +  * message saying OProfile did not start.
> +  * Dmesg will contain an accurate message
> +  * about the failure.
> +  */

ENOSYS looks wrong though. It would appear to the user as if the oprofile
function in the kernel was not present. I'd suggest EIO, and not use 
an extra define for that.


>  static int
>  rtas_ibm_cbe_perftools(int subfunc, int passthru,
>  void *address, unsigned long length)
>  {
>   u64 paddr = __pa(address);
>  
> - return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru,
> -  paddr >> 32, paddr & 0x, length);
> + pm_rtas_token = rtas_token("ibm,cbe-perftools");
> +
> + if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) {
> + printk(KERN_ERR
> +"%s: rtas token ibm,cbe-perftools unknown\n",
> +__FUNCTION__);
> + return -EFWCALL;
> + } else {
> +
> + return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, 
> +  passthru, paddr >> 32, paddr & 0x, length); 
> + }
>  }

Are you now reading the rtas token every time you call rtas? that seems
like a waste of time.


> +#define size 24
> +#define ENTRIES  (0x1<<8) /* 256 */
> +#define MAXLFSR  0xFF
> +
> +int initial_lfsr[] =
> +{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445,
> + 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, 4996256,
> + 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843,
> + 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714,
> + 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106,
> + 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539,
> + 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372,
> + 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380,
> + 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555,
> + 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120,
> + 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753,
> + 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582,
> + 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, 
> 7330509,
> + 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266,
> + 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108,
> + 2012376, 13661123, 14655718, 9534083, 16637925, 2537745, 9787923, 12750103,
> + 4660370, 3283461, 14862772, 7034955, 6679872, 8918232, 6506913, 103649,
> + 6085577, 13324033, 14251613, 11058220, 11998181, 3100233, 468898, 7104918,
> + 12498413, 14408165, 1208514, 15712321, 3088687, 14778333, 3632503, 11151952,
> + 98896, 9159367, 8866146, 4780737, 4925758, 12362320, 4122783, 8543358,
> + 7056879, 10876914, 6282881, 1686625, 5100373, 4573666, 9265515, 13593840,
> + 5853060, 110, 4237111, 1576, 14344137, 4608332, 6590210, 13745050,
> + 10916568, 12340402, 7145275, 4417153, 2300360, 12079643, 7608534, 15238251,
> + 4947424, 7014722, 3984546, 7168073, 10759589, 16293080, 3757181, 4577717,
> + 5163790, 2488841, 4650617, 3650022, 5440654, 1814617, 6939232, 15540909,
> + 501788, 1060986, 5058235, 5078222, 3734500, 10762065, 390862, 5172712,
> + 1070780, 790442

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-12 Thread Carl Love

On Sun, 2007-02-11 at 16:46 -0600, Milton Miller wrote:

[cut]

> 
> As far as I understand, you are providing access to a completely new
> hardware that is related to the PMU hardware by the fact that it
> collects a program counter.   It doesn't use the PMU counters nor the
> PMU event selection.
> 
> In fact, why can the existing op_model_cell profiling not run while
> the SPU profiling runs?   Is there a shared debug bus inside the
> chip?   Or just the data stream with your buffer_sync code?
> 

There are two reasons you cannot do SPU profiling and profiling on non
SPU events at the sametime.  1) the SPU PC values are routed on the
debug bus.  You cannot also route the signals for other non SPU events
on the debug bus since they will conflict.  Specifically, the signals
would get logically OR'd on the bus.  The exception is PPU cycles which
does not use the debug bus.  2) the hardware that captures the SPU
program counters has some shared components with the HW performance
counters.  To use the SPU program counter hardware, the performance
counter hardware must be disabled.

In summary, we cannot do SPU cycle profiling and non SPU event profiling
at the same time due to limitations in the hardware.

[cut]

> milton
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-11 Thread Milton Miller



On Feb 9, 2007, at 10:17 AM, Carl Love wrote:


On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote:

 On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a 
time

interval.
The kernel is (a) calling cpufreq to get the current cpu
frequency,
(b)
converting the rate to a cycle count, (c) converting this to a
24 bit
LFSR count, an iterative algorithm (in this patch, starting 
from

one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?


There are two things 1) setup the LFSR to control how often the 
Hardware

puts samples into the trace buffer.  2) setup the kernel timer to read
the trace buffer (your d, d is a function of the cpu freq) and process
the samples.

(c is nonsense)



Well, its cacluate the rate from the count vs count from the rate.



The kernel timer was set with the goal the the hardware trace buffer
would not get more then half full to ensure we would not lose samples
even for the maximum rate that the hardware would be adding samples to
the trace buffer (user specified N=100,000).


That should be well commented.




The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is
the
performance counter entry for that event.


Ok I haven't looked a the api closely.



Actually, the oprofile userspace fills out files in a file system to 
tell
the kernel what it needs to know.   The powerpc code defines the 
reources

needed to use the PMU hardware, which is (1) common mux selects, for
processors that need them, and (2) a set of directories, one for each 
of the

pmu counters, each of which contain the controls for that counter (such
as enable kernel space, enable user space, event timeout to interrupt or
sample collection, etc.   The ctr[i].count is one of these files.




Specifically with SPU
profiling, we do not use performance counters because the CELL HW 
does

not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is
an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is 
taken
and stored in the trace buffer.  Hence, the value of N specified by 
the
user must get converted to the LFSR value that is N from the end of 
the

sequence.


Ok so its arbitray load count to max vs count and compare.   A 
critical

detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The above statement makes no sense to me.


I think I talked past you.  Your description of hadrware vs mine was
different in that the counter always ends at a specified point and is
loaded with the variable count, where mine had it comparing to the
count as it incremented.

However, I now see that you were referring to the fact that what the
user specifes, the count, has to be converted to a LFSR value.  My point
is this can be done in the user-space oprofile code.  It already has
to look up magic numbers for setting up event selection muxes for other
hardware, adding a lfsr calculation is not beyond resoon.  Nor is having
it provide two values in two files.



Determining the initial LFSR value that is N values from the last value
in the sequence is not easy to do.


Well, its easy, its just order(N).


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per 
CPU

clock cycle regardless of the CPU frequency or changes in the
frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.


Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.


The LFSR value is computed ONCE when you start OProfile.  The value is
setup in the hardware once when OProfile starts.  The hardware will
always restart with the value given to it after it reaches the last
value in the sequence.  What you call the "time between unloads" is the
time at which you schedule the kernel routine to empty the trace 
buffer.

It is calculated once.  It would only need to be recomputed if the cpu
frequency changed.






The obvious problem is

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller



On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote:


On Friday 09 February 2007 19:47, Milton Miller wrote:

On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:




Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.


But _why_ do you want to solve that problem? we don't have
dynamically linked binaries and I really don't see why the loader
would want to create artificial elf headers...


I'm explainign how they could be handled.   Actully I think
the other proposal (an identified structure that points to
sevear elf headers) would be more approprate.  As you point
out, if there are presently no dynamic libraries in use, it
doesn't have to be solved today.  I'm just trying to make
the code future proof, or at least a clear path forward.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).


we thought about this in the past and discarded it because of
the complexity of an spufs interface that would handle this
correctly.


Not sure what would be difficuult, and it would allow other
binary formats.   But parsing the headers in the kernel
means existing userspace doesn't have to be upgraded, so I
am not proposing this requirement.




yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.


yes, good point.


This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


I'm not really following you here, but probably you misunderstood
my point as well.


I was thinking in terms of dyanmic libraries, and totally skipped
your comment about the relocaiton info being missing.   My reply
point was that the table could be compressed to the current
entry if all hased to the same vm area.




This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


There is some work going on for another profiler independent
of the hardware feature that only relies on instrumenting the
spu executable for things like DMA transfers and overlay
changes.


Regardless, its beyond the current scope.

milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Arnd Bergmann

On Friday 09 February 2007 19:47, Milton Miller wrote:
> On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:
> 

> > Doing the translation in two stages in user space, as you
> > suggest here, definitely makes sense to me. I think it
> > can be done a little simpler though:
> >
> > Why would you need the accurate dcookie information to be
> > provided by the kernel? The ELF loader is done in user
> > space, and the kernel only reproduces what it thinks that
> > came up with. If the kernel only gives the dcookie information
> > about the SPU ELF binary to the oprofile user space, then
> > that can easily recreate the same mapping.
> 
> Actually, I was trying to cover issues such as anonymous
> memory.   If the kernel doesn't generate dcookies for
> the load segments the information is not recoverable once
> the task exits.  This would also allow the loader to create
> an artifical elf header that covered both the base executable
> and a dynamicly linked one.
> 
> Other alternatives exist, such as a structure for context-id
> that would have its own identifing magic and an array of elf
> header pointers.

But _why_ do you want to solve that problem? we don't have
dynamically linked binaries and I really don't see why the loader
would want to create artificial elf headers...

> > The kernel still needs to provide the overlay identifiers
> > though.
> 
> Yes, which means it needs to parse the header (or libpse
> be enhanced to write the monitor info to a spufs file).

we thought about this in the past and discarded it because of
the complexity of an spufs interface that would handle this
correctly. 

> > yes, this sounds nice. But tt does not at all help accuracy,
> > only performance, right?
> 
> It allows the user space to know when the sample was taken
> and  be aware of the ambiguity.   If the sample rate is
> high enough in relation to the overlay switch rate, user space
> could decide to discard the ambiguous samples.

yes, good point.

> >> This approach allows multiple objects by its nature.  A new
> >> elf header could be constructed in memory that contained
> >> the union of the elf objects load segments, and the tools
> >> will magically work.   Alternatively the object id could
> >> point to a new structure, identified via a new header, that
> >> it points to other elf headers (easily differentiated by the
> >> elf magic headers).   Other binary formats, including several
> >> objects in a ar archive, could be supported.
> >
> > Yes, that would be a new feature if the kernel passed dcookie
> > information for every section, but I doubt that it is worth
> > it. I have not seen any program that allows loading code
> > from more than one ELF file. In particular, the ELF format
> > on the SPU is currently lacking the relocation mechanisms
> > that you would need for resolving spu-side symbols at load
> > time
> 
> Actually, It could check all load segments, and only report
> those where the dcookie changes (as opposed to the offset).

I'm not really following you here, but probably you misunderstood
my point as well.

> > This seems to incur a run-time overhead on the SPU even if not
> > profiling, I would consider that not acceptable.
> 
> It definitely is overhead.  Which means it would have to be
> optional, like gprof.

There is some work going on for another profiler independent
of the hardware feature that only relies on instrumenting the
spu executable for things like DMA transfers and overlay
changes. 

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller



On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:


On Thursday 08 February 2007 15:18, Milton Miller wrote:


The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
present samples that may have been collected during context
switch processing, they must be discarded.


I thought it already did handle overlays, what did I miss here?


It does, see my reply to Maynard.  Not sure what I was thinking
when I wrote this, possibly I was thinking of the inaccuracies.




My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.


Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).




To identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


right.


This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.


yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.




This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


.


If better overlay identification is required, in theory the
overlay switch code could be augmented to record the switches
(DMA reference time from the PowerPC memory and record a
relative decrementer in the SPU), this is obviously a future
item.  But it is facilitated by having user space resolve the
SPU to source file translation.


This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller



On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote:


Milton,
Thank you for your comments.  Carl will reply to certain parts of your 
posting where he's more knowledgeable than I.  See my replies below.




Thanks for the pleasant tone and dialog.


Milton Miller wrote:

On Feb 6, 2007, at 5:02 PM, Carl Love wrote:

This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".





Data collected


The current patch starts tackling these translation issues for the
presently common case of a static self contained binary from a single
file, either single separate source file or embedded in the data of
the host application.   When creating the trace entry for a SPU
context switch, it records the application owner, pid, tid, and
dcookie of the main executable.   It addition, it looks up the
object-id as a virtual address and records the offset if it is 
non-zero,

or the dcookie of the object if it is zero.   The code then creates
a data structure by reading the elf headers from the user process
(at the address given by the object-id) and building a list of
SPU address to elf object offsets, as specified by the ELF loader
headers.   In addition to the elf loader section, it processes the
overlay headers and records the address, size, and magic number of
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to


Yes, we do handle overlays.  (Note: I'm looking into a bug
right now in our overlay support.)


I knew you handled overlays, and I did not mean to say that
you did not.   I am not sure how that got there.  I may have
been thinking of the kernel supplied context switch code
discussion, or how the code supplied dcookie or offset but
not both.  Actually, I might have been referring to the
fact that overlays are guessed rather than recorded.


present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


Discussions on this topic in the past have resulted in the
current implementation precisely because we're able to record
the samples as fileoffsets, just as the userspace tools expect.


I was not part of the previous discussions, so please forgive me.


I haven't had time to check out how much this would impact the
userspace tools, but my gut feel is that it would be quite
significant.  If we were developing this module with a matching
newly-created userspace tool, I would be more inclined to agree
that this makes sense.


I have not yet studied the user space tool.   In fact, when I
made this proposal, I had not studied the kernel oprofile code
either, although I had read the concepts and discussion of the
event buffer when the base patch was added to the kernel.

I have read and now consider myself to have some understanding
of the kernel side.  I note that the user space tool calls itself
alpha and the kernel support experimental.   I only looked at
the user space enough to determine it is written in C++.

I would hope the tool would be modular enough to insert a data
transformation pass to do this conversion that the kernel is
doing.


But you give no rationale for your
proposal that justifies the change.  The current implementation
works, it has no impact on normal, non-profiling behavior,  and
the overhead during profiling is not noticeable.


I was proposing this for several of reasons.

One, there were expressed limitations  in the current
proposal.  There is a requirement that everything be linked
into one ELF object for it to be profiled seemed significant
to me.  This implies that shared libraries (well, dynamic
linked ones) have no path forward in this framew

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Carl Love

On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote:
>  On Feb 8, 2007, at 4:51 PM, Carl Love wrote:
>  
> > On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:
> >> On Thursday 08 February 2007 15:18, Milton Miller wrote:
> >>
> >>> 1) sample rate setup
> >>>
> >>> In the current patch, the user specifies a sample rate as a time
> >>> interval.
> >>> The kernel is (a) calling cpufreq to get the current cpu 
> >>> frequency,
> >>> (b)
> >>> converting the rate to a cycle count, (c) converting this to a 
> >>> 24 bit
> >>> LFSR count, an iterative algorithm (in this patch, starting from
> >>> one of 256 values so a max of 2^16 or 64k iterations), (d)
> >>> calculating
> >>> an trace unload interval.   In addition, a cpufreq notifier is
> >>> registered
> >>> to recalculate on frequency changes.
> >
> > No.  The user issues the command opcontrol --event:N  where N is the
> > number of events (cycles, l2 cache misses, instructions retired etc)
> > that are to elapse between collecting the samples.
>  
> So you are saying that b and c are primary, and a is used to calculate
> a safe value for d.   All of the above work is dont, just from a
> different starting point?

There are two things 1) setup the LFSR to control how often the Hardware
puts samples into the trace buffer.  2) setup the kernel timer to read
the trace buffer (your d, d is a function of the cpu freq) and process
the samples.  

(c is nonsense)

The kernel timer was set with the goal the the hardware trace buffer
would not get more then half full to ensure we would not lose samples
even for the maximum rate that the hardware would be adding samples to
the trace buffer (user specified N=100,000).  

>  
> 
> > The OProfile passes
> > the value N to the kernel via the variable ctr[i].count.  Where i is 
> > the
> > performance counter entry for that event.
>  
> Ok I haven't looked a the api closely.
>  
> > Specifically with SPU
> > profiling, we do not use performance counters because the CELL HW does
> > not allow the normal the PPU to read the SPU PC when a performance
> > counter interrupt occurs.  We are using some additional hw support in
> > the chip that allows us to periodically capture the SPU PC.  There is 
> > an
> > LFSR hardware counter that can be started at an arbitrary LFSR value.
> > When the last LFSR value in the sequence is reached, a sample is taken
> > and stored in the trace buffer.  Hence, the value of N specified by the
> > user must get converted to the LFSR value that is N from the end of the
> > sequence.
>  
> Ok so its arbitray load count to max vs count and compare.   A critical
> detail when computing the value to load, but the net result is the
> same; the value for the count it hard to determine.

The above statement makes no sense to me.  

Determining the initial LFSR value that is N values from the last value
in the sequence is not easy to do.  

>  
> > The same clock that the processor is running at is used to
> > control the LFSR count.  Hence the LFSR counter increments once per CPU
> > clock cycle regardless of the CPU frequency or changes in the 
> > frequency.
> > There is no calculation for the LFSR value that is a function of the
> > processor frequency.  There is no need to adjust the LFSR when the
> > processor frequency changes.
>  
> 
> 
> Oh, so the lfsr doesn't have to be recomputed, only the time
> between unloads.

The LFSR value is computed ONCE when you start OProfile.  The value is
setup in the hardware once when OProfile starts.  The hardware will
always restart with the value given to it after it reaches the last
value in the sequence.  What you call the "time between unloads" is the
time at which you schedule the kernel routine to empty the trace buffer.
It is calculated once.  It would only need to be recomputed if the cpu
frequency changed.

>  
> >
> > Milton had a comment about the code
> >
> >  if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> >> + spu_cycle_reset = ctr[0].count;
> >> + return;
> >> + }
> >
> > Well, given the above description, it is clear that if you are doing 
> > SPU
> > event profiling, the value N is put into the cntr[0].cnt entry since
> > there is only one event.  Thus in cell_global_start_spu() you use
> > spu_cycle_reset as the argument to the lfsr calculation routine to get
> > the LFSR value that is N from the end of the sequence.
>  
> I was looking at the patch and the context was not very good.   You
> might consider adding -p to your diff command, it provides the function
> name after the @@.
>  
> However, in this case, I think I just need to see the final result.
>  
> >
> >>>
> >>> The obvious problem is step (c), running a loop potentially 64
> >>> thousand
> >>> times in kernel space will have a noticeable impact on other 
> >>> threads.
> >>>
> >>> I propose instead that user space perform the above 4 steps, and
> >>> provide
> >>> the kernel with two inputs: (1) the v

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller



On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a time
interval.
The kernel is (a) calling cpufreq to get the current cpu 
frequency,

(b)
converting the rate to a cycle count, (c) converting this to a 
24 bit

LFSR count, an iterative algorithm (in this patch, starting from
one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?



The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is 
the

performance counter entry for that event.


Ok I haven't looked a the api closely.


Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is 
an

LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.


Ok so its arbitray load count to max vs count and compare.   A critical
detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the 
frequency.

There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.




Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.



Milton had a comment about the code

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {

+ spu_cycle_reset = ctr[0].count;
+ return;
+ }


Well, given the above description, it is clear that if you are doing 
SPU

event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.


I was looking at the patch and the context was not very good.   You
might consider adding -p to your diff command, it provides the function
name after the @@.

However, in this case, I think I just need to see the final result.





The obvious problem is step (c), running a loop potentially 64
thousand
times in kernel space will have a noticeable impact on other 
threads.


I propose instead that user space perform the above 4 steps, and
provide
the kernel with two inputs: (1) the value to load in the LFSR 
and (2)
the periodic frequency / time interval at which to empty the 
hardware

trace buffer, perform sample analysis, and send the data to the
oprofile
subsystem.

There should be no security issues with this approach.   If the 
LFSR

value
is calculated incorrectly, either it will be too short, causing 
the

trace
array to overfill and data to be dropped, or it will be too 
long, and

there will be fewer samples.   Likewise, the kernel periodic poll
can be
too long, again causing overflow, or too frequent, causing only 
timer

execution overhead.

Various data is collected by the kernel while processing the
periodic timer,
this approach would also allow the profiling tools to control the
frequency of this collection.   More frequent collection results 
in

more
accurate sample data, with the linear cost of poll execution
overhead.

Frequency changes can be handled either by the profile code 
setting
collection at a higher than necessary rate, or by interacting 
with

the
governor to limit the speeds.

Optionally, the kernel can add a record indicating that some 
data was

likely dropped if it is able to read all 256 entries without
underflowing
the array.  This can be used as hint to user space that the 
kernel

time
was too long for the collection rate.


Moving the sample rate computation to user space sounds like the right
idea, but why not have a more drastic version of it:


No, I do not

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Maynard Johnson


Milton,
Thank you for your comments.  Carl will reply to certain parts of your 
posting where he's more knowledgeable than I.  See my replies below.


-Maynard

Milton Miller wrote:


On Feb 6, 2007, at 5:02 PM, Carl Love wrote:

 


This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".


   

 


[snip]



Data collected


The current patch starts tackling these translation issues for the
presently common case of a static self contained binary from a single
file, either single separate source file or embedded in the data of
the host application.   When creating the trace entry for a SPU
context switch, it records the application owner, pid, tid, and
dcookie of the main executable.   It addition, it looks up the
object-id as a virtual address and records the offset if it is non-zero,
or the dcookie of the object if it is zero.   The code then creates
a data structure by reading the elf headers from the user process
(at the address given by the object-id) and building a list of
SPU address to elf object offsets, as specified by the ELF loader
headers.   In addition to the elf loader section, it processes the
overlay headers and records the address, size, and magic number of
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
 

Yes, we do handle overlays.  (Note: I'm looking into a bug right now in 
our overlay support.)



present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.
 

Discussions on this topic in the past have resulted in the current 
implementation precisely because we're able to record the samples as 
fileoffsets, just as the userspace tools expect.   I haven't had time to 
check out how much this would impact the userspace tools, but my gut 
feel is that it would be quite significant.  If we were developing this 
module with a matching newly-created userspace tool, I would be more 
inclined to agree that this makes sense.  But you give no rationale for 
your proposal that justifies the change.  The current implementation 
works, it has no impact on normal, non-profiling behavior,  and the 
overhead during profiling is not noticeable.



This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.
 

Yes, there is a window here where an overlay could occur before we 
finish processing a group of samples that were actually taken from a 
different overlay.  The obvious way to prevent that is for the kernel 
(or SPUFS) to be notified of the overlay and let OProfile know that we 
need to drain (perhaps discard would be best) our sample trace buffer.  
As you indicate above, your proposal faces the same issue, but would 
just decrease the number of bogus samples.  I contend that the relative 
number of bogus samples will be quite low in either case.  Ideally, we 
should have a mechanism to eliminate them completely so as to avoid 
confusion the user's part when they're looking at a report.  Even a few 
bogus samples in the wrong place can be troubling.  Such a mechanism 
will be a good future enhancement.


[snip]


milton
--
[EMAIL PROTECTED]   Milton Miller
Speaking for myself only.

___
Linuxppc-dev mailing list
[EMAIL PROTEC

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Carl Love

On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:
> On Thursday 08 February 2007 15:18, Milton Miller wrote:
>  
> > 1) sample rate setup
> > 
> > In the current patch, the user specifies a sample rate as a time 
> > interval.
> > The kernel is (a) calling cpufreq to get the current cpu frequency, 
> > (b)
> > converting the rate to a cycle count, (c) converting this to a 24 bit
> > LFSR count, an iterative algorithm (in this patch, starting from
> > one of 256 values so a max of 2^16 or 64k iterations), (d) 
> > calculating
> > an trace unload interval.   In addition, a cpufreq notifier is 
> > registered
> > to recalculate on frequency changes.

No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.  The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is the
performance counter entry for that event.  Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.  The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.  

Milton had a comment about the code 

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> + spu_cycle_reset = ctr[0].count;
> + return;
> + }

Well, given the above description, it is clear that if you are doing SPU
event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.

> > 
> > The obvious problem is step (c), running a loop potentially 64 
> > thousand
> > times in kernel space will have a noticeable impact on other threads.
> > 
> > I propose instead that user space perform the above 4 steps, and 
> > provide
> > the kernel with two inputs: (1) the value to load in the LFSR and (2)
> > the periodic frequency / time interval at which to empty the hardware
> > trace buffer, perform sample analysis, and send the data to the 
> > oprofile
> > subsystem.
> > 
> > There should be no security issues with this approach.   If the LFSR 
> > value
> > is calculated incorrectly, either it will be too short, causing the 
> > trace
> > array to overfill and data to be dropped, or it will be too long, and
> > there will be fewer samples.   Likewise, the kernel periodic poll 
> > can be
> > too long, again causing overflow, or too frequent, causing only timer
> > execution overhead.
> > 
> > Various data is collected by the kernel while processing the 
> > periodic timer,
> > this approach would also allow the profiling tools to control the
> > frequency of this collection.   More frequent collection results in 
> > more
> > accurate sample data, with the linear cost of poll execution 
> > overhead.
> > 
> > Frequency changes can be handled either by the profile code setting
> > collection at a higher than necessary rate, or by interacting with 
> > the
> > governor to limit the speeds.
> > 
> > Optionally, the kernel can add a record indicating that some data was
> > likely dropped if it is able to read all 256 entries without 
> > underflowing
> > the array.  This can be used as hint to user space that the kernel 
> > time
> > was too long for the collection rate.
>  
> Moving the sample rate computation to user space sounds like the right
> idea, but why not have a more drastic version of it:

No, I do not agree.  The user/kernel API pass N where N is the number of
events between samples.  We are not at liberty to just change the API.
We we did do this, we fully expect that John Levon will push back saying
why make an architecture specific API change when it isn't necessary.  

Please define "drastic" in this context.  Do you mean make the table
bigger i.e. more then 256 precomputed elements?  I did 256 since Arnd
seemed to think that would be a reasonable size. Based on his example
How much kernel space are we willing to use to same some computati

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Adrian Reber

On Thu, Feb 08, 2007 at 06:21:56PM +0100, Arnd Bergmann wrote:
[...]
> Moving the sample rate computation to user space sounds like the right
> idea, but why not have a more drastic version of it:
> 
> Right now, all products that support this feature run at the same clock
> rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand
> this correctly, the value depends only on the frequency, so we could simply
> hardcode this in the kernel, and print out a warning message if we ever
> encounter a different frequency, right?

Just for the record... CAB is running with 2.8 GHz. At least all the boards
I have seen.

Adrian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Arnd Bergmann

On Thursday 08 February 2007 15:18, Milton Miller wrote:

> 1) sample rate setup
> 
> In the current patch, the user specifies a sample rate as a time 
> interval.
> The kernel is (a) calling cpufreq to get the current cpu frequency, 
> (b)
> converting the rate to a cycle count, (c) converting this to a 24 bit
> LFSR count, an iterative algorithm (in this patch, starting from
> one of 256 values so a max of 2^16 or 64k iterations), (d) 
> calculating
> an trace unload interval.   In addition, a cpufreq notifier is 
> registered
> to recalculate on frequency changes.
> 
> The obvious problem is step (c), running a loop potentially 64 
> thousand
> times in kernel space will have a noticeable impact on other threads.
> 
> I propose instead that user space perform the above 4 steps, and 
> provide
> the kernel with two inputs: (1) the value to load in the LFSR and (2)
> the periodic frequency / time interval at which to empty the hardware
> trace buffer, perform sample analysis, and send the data to the 
> oprofile
> subsystem.
> 
> There should be no security issues with this approach.   If the LFSR 
> value
> is calculated incorrectly, either it will be too short, causing the 
> trace
> array to overfill and data to be dropped, or it will be too long, and
> there will be fewer samples.   Likewise, the kernel periodic poll 
> can be
> too long, again causing overflow, or too frequent, causing only timer
> execution overhead.
> 
> Various data is collected by the kernel while processing the 
> periodic timer,
> this approach would also allow the profiling tools to control the
> frequency of this collection.   More frequent collection results in 
> more
> accurate sample data, with the linear cost of poll execution 
> overhead.
> 
> Frequency changes can be handled either by the profile code setting
> collection at a higher than necessary rate, or by interacting with 
> the
> governor to limit the speeds.
> 
> Optionally, the kernel can add a record indicating that some data was
> likely dropped if it is able to read all 256 entries without 
> underflowing
> the array.  This can be used as hint to user space that the kernel 
> time
> was too long for the collection rate.

Moving the sample rate computation to user space sounds like the right
idea, but why not have a more drastic version of it:

Right now, all products that support this feature run at the same clock
rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand
this correctly, the value depends only on the frequency, so we could simply
hardcode this in the kernel, and print out a warning message if we ever
encounter a different frequency, right?

> The current patch specifically identifies that only single
> elf objects are handled.  There is no code to handle dynamic
> linked libraries or overlays.   Nor is there any method to
> present samples that may have been collected during context
> switch processing, they must be discarded.

I thought it already did handle overlays, what did I miss here?

> My proposal is to change what is presented to user space.  Instead
> of trying to translate the SPU address to the backing file
> as the samples are recorded, store the samples as the SPU
> context and address.  The context switch would record tid,
> pid, object id as it does now.   In addition, if this is a
> new object-id, the kernel would read elf headers as it does
> today.  However, it would then proceed to provide accurate
> dcookie information for each loader region and overlay.

Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.

The kernel still needs to provide the overlay identifiers
though.

> To identify which overlays are active, (instead of the present
> read on use and search the list to translate approach) the
> kernel would record the location of the overlay identifiers
> as it parsed the kernel, but would then read the identification
> word and would record the present value as an sample from
> a separate but related stream.   The kernel could maintain
> the last value for each overlay and only send profile events
> for the deltas.

right.

> This approach trades translation lookup overhead for each
> recorded sample for a burst of data on new context activation.
> In addition it exposes the sample point of the overlay identifier
> vs the address collection.  This allows the ambiguity to be
> exposed to user space.   In addition, with the above proposed
> kernel timer

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Maynard Johnson


Michael,
Thanks very much for the advice.  Both issues have been solved now, with 
your help.


-Maynard

Michael Ellerman wrote:


On Wed, 2007-02-07 at 09:41 -0600, Maynard Johnson wrote:
 


Carl Love wrote:

   


Subject: Add support to OProfile for profiling Cell BE SPUs

From: Maynard Johnson <[EMAIL PROTECTED]>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
code.

Signed-off-by: Carl Love <[EMAIL PROTECTED]>
Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig


 

I've discovered more problems with the kref handling for the cached_info 
object that we store in the spu_context.  :-(


When the OProfile module initially determines that no cached_info yet 
exists for a given spu_context, it creates the cached_info, inits the 
cached_info's kref (which increments the refcount) and does a kref_get 
(for SPUFS' ref) before passing the cached_info reference off to SUPFS 
to store into the spu_context.  When OProfile shuts down or the SPU job 
ends, OProfile gives up its ref to the cached_info with kref_put.  Then 
when SPUFS destroys the spu_context, it also gives up its ref.  HOWEVER 
. . . . If OProfile shuts down while the SPU job is still active _and_ 
_then_ is restarted while the job is still active, OProfile will find 
that the cached_info exists for the given spu_context, so it won't go 
through the process of creating it and doing kref_init on the kref.  
Under this scenario, OProfile does not own a ref of its own to the 
cached_info, and should not be doing a kref_put when done using the 
cached_info -- but it does, and so does SPUFS when the spu_context is 
destroyed.  The end result (with the code as currently written) is that 
an extra kref_put is done when the refcount is already down to zero.  To 
fix this, OProfile needs to detect when it finds an existing cached_info 
already stored in the spu_context.  Then, instead of creating a new one, 
it sets a reminder flag to be used later when it's done using the cached 
info to indicate whether or not it needs to call kref_put.
   



I think all you want to do is when oprofile finds the cached_info
already existing, it does a kref_get(). After all it doesn't have a
reference to it, so before it starts using it it must inc the ref count.

 

Unfortunately, there's another problem (one that should have been 
obvious to me).  The cached_info's kref "release" function is 
destroy_cached_info(), defined in the OProfile module.  If the OProfile 
module is unloaded when SPUFS destroys the spu_context and calls 
kref_put on the cached_info's kref -- KABOOM!  The destroy_cached_info 
function (the second arg to kref_put) is not in memory, so we get a 
paging fault.  I see a couple options to solve this:
   1) Don't store the cached_info in the spu_context.  Basically, go 
back to the simplistic model of creating/deleting the cached_info on 
every SPU task activation/deactivation.
   2)  If there's some way to do this, force the OProfile module to 
stay loaded until SPUFS has destroyed its last spu_context that holds a 
cached_info object.
   



There is a mechanism for that, you just have each cached_info inc the
module's refcount.

Another option would be to have a mapping, in the oprofile code, from
spu_contexts to cached_infos, ie. a hash table or something.

cheers

 




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller



On Feb 6, 2007, at 5:02 PM, Carl Love wrote:


This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".

This repost fixes the line wrap issue that Ben mentioned.  Also the 
kref

handling for the cached info has been fixed and simplified.

There are still a few items from the comments being discussed
specifically how to profile the dynamic code for the SPFS context 
switch

code and how to deal with dynamic code stubs for library support.  Our
proposal is to assign the samples from the SPFS and dynamic library 
code

to an anonymous sample bucket.  The support for properly handling the
symbol extraction in these cases would be deferred to a later SDK.

There is also a bug in profiling overlay code that we are 
investigating.


I'd like to talk about about both some of the concepts, including what
is logged and what the kernel API is, and provide some directed comments
on the code in the patch as it stands.

[Ok, the background and concepts portion is big enough, I'll send
this to the several lists for comment, and the code comments to
just cbe-oss-dev]

First, I'll give some background and my understanding of both the
hardware, and some of the linux interface.  I'm not consulting any
manuals, so I could be mistaken.   I'm basing this on a combination
of past reading of the kernel, this patch, a discussion with Arnd,
and my knowledge of the PowerPC processor and other IBM processors.
Hopefully this will be of use to other reviewers.

Background:

The cell broadband architecture consists of PowerPC processors
and synergistic processing units, or SPUs.  The current chip has
a multi-threaded PowerPC core and 8 SPUs.   Each SPU has an
execution core (running its own instruction set), a 256kB
private memory (local store), and DMA engine that can access
the coherent memory domain of the PowerPC processor(s).  Multiple
chips can be connected together in a single coherent SMP system.
The addresses provided to the DMA engine are effective, translated
through virtual to real address spaces like the PowerPC MMU.
The SPUs are intended to run application threads, and require
the PowerPC to perform exception handling and other operating
system tasks.

Because of the limited address space of the SPU (18
bits), the compilers (gcc) have support for code
overlays.  The ELF format defines these overlays
including file offsets, sizes, load address, and
a unique word to identify the overlay.  The toolchain
has support to check that an overlay is present and
setup the DMA engine to load it if it is not present.

In Linux, the SPUs are controlled through spufs.  Once
a SPU context is created, the local store and registers can be
modified through the file system.  A linux thread makes
a syscall requesting the context to run.  This call is
blocking to that thread; it waits for an event from the
SPU (either an exception or a reschedule). In this regard
the syscall can be thought of as a cross instruction set
function call.  The contents of the local store are
initialized and modified via read/write or mmap and memcopy
of a spufs file. Specifically, pages are not mapped into the
local store, they are copied into it.  The SPU context is
tied to the creating mm; this provides a clear context for
the DMA engine (controlled by the program in the SPU).

To assist with the use of the SPUs, and to provide some
portability to other environments, a library, libpse, is
available and widely used.   Among other items, it provides
an elf loader for SPU elf objects.  Applications may mmap
external objects or embed the elf objects into their
executable (as data).  The contained objects copied to the
SPU local store as dictated by the elf header.   To assist
other tools, spufs provides an object-id file.   Before this
patch, it has been treated as an opaque number. Although not
present in the first release of libpse, current versions write
the virtual address of the elf header to this file
when the loader is called.

This patch is trying to enable oprofile to collect and
analyze SPU execution, using the trace collection
hardware of the processor.   The trace collection hardware
consists of a LFSR (linear-feedback shift register) counter
and an array 256 bits wide and 256 entries deep.  When the
counter matches the programed value, it causes an entry
to be written to the trace array and the counter to be
reloaded.  On chip logic tracks how many entries have
been written and not yet read.  When programed to trace
the SPUs, the trace entry consists of the upper 16 bits of
the 18 bit program counter for each SPU; the 8 SPUs split
over the two words.  By their nature, LFSRs require a minimum
of hardware (typically 3 exclusive-or gates and a shift
register), but the count sequence is pseudo-random.

The oprofile system is designed to collect a stream of trace
data and context information to and provide it to user space
for post processing and analysis.   While at the lowest l

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-07 Thread Michael Ellerman

On Wed, 2007-02-07 at 09:41 -0600, Maynard Johnson wrote:
> Carl Love wrote:
> 
> >
> >Subject: Add support to OProfile for profiling Cell BE SPUs
> >
> >From: Maynard Johnson <[EMAIL PROTECTED]>
> >
> >This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
> >to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
> >was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
> >code.
> >
> >Signed-off-by: Carl Love <[EMAIL PROTECTED]>
> >Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]>
> >
> >Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig
> >  
> >
> I've discovered more problems with the kref handling for the cached_info 
> object that we store in the spu_context.  :-(
> 
> When the OProfile module initially determines that no cached_info yet 
> exists for a given spu_context, it creates the cached_info, inits the 
> cached_info's kref (which increments the refcount) and does a kref_get 
> (for SPUFS' ref) before passing the cached_info reference off to SUPFS 
> to store into the spu_context.  When OProfile shuts down or the SPU job 
> ends, OProfile gives up its ref to the cached_info with kref_put.  Then 
> when SPUFS destroys the spu_context, it also gives up its ref.  HOWEVER 
> . . . . If OProfile shuts down while the SPU job is still active _and_ 
> _then_ is restarted while the job is still active, OProfile will find 
> that the cached_info exists for the given spu_context, so it won't go 
> through the process of creating it and doing kref_init on the kref.  
> Under this scenario, OProfile does not own a ref of its own to the 
> cached_info, and should not be doing a kref_put when done using the 
> cached_info -- but it does, and so does SPUFS when the spu_context is 
> destroyed.  The end result (with the code as currently written) is that 
> an extra kref_put is done when the refcount is already down to zero.  To 
> fix this, OProfile needs to detect when it finds an existing cached_info 
> already stored in the spu_context.  Then, instead of creating a new one, 
> it sets a reminder flag to be used later when it's done using the cached 
> info to indicate whether or not it needs to call kref_put.

I think all you want to do is when oprofile finds the cached_info
already existing, it does a kref_get(). After all it doesn't have a
reference to it, so before it starts using it it must inc the ref count.

> Unfortunately, there's another problem (one that should have been 
> obvious to me).  The cached_info's kref "release" function is 
> destroy_cached_info(), defined in the OProfile module.  If the OProfile 
> module is unloaded when SPUFS destroys the spu_context and calls 
> kref_put on the cached_info's kref -- KABOOM!  The destroy_cached_info 
> function (the second arg to kref_put) is not in memory, so we get a 
> paging fault.  I see a couple options to solve this:
> 1) Don't store the cached_info in the spu_context.  Basically, go 
> back to the simplistic model of creating/deleting the cached_info on 
> every SPU task activation/deactivation.
> 2)  If there's some way to do this, force the OProfile module to 
> stay loaded until SPUFS has destroyed its last spu_context that holds a 
> cached_info object.

There is a mechanism for that, you just have each cached_info inc the
module's refcount.

Another option would be to have a mapping, in the oprofile code, from
spu_contexts to cached_infos, ie. a hash table or something.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-07 Thread Maynard Johnson

Carl Love wrote:

Subject: Add support to OProfile for profiling Cell BE SPUs

From: Maynard Johnson <[EMAIL PROTECTED]>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
code.

Signed-off-by: Carl Love <[EMAIL PROTECTED]>
Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig

I've discovered more problems with the kref handling for the cached_info 
object that we store in the spu_context.  :-(

When the OProfile module initially determines that no cached_info yet 
exists for a given spu_context, it creates the cached_info, inits the 
cached_info's kref (which increments the refcount) and does a kref_get 
(for SPUFS' ref) before passing the cached_info reference off to SUPFS 
to store into the spu_context.  When OProfile shuts down or the SPU job 
ends, OProfile gives up its ref to the cached_info with kref_put.  Then 
when SPUFS destroys the spu_context, it also gives up its ref.  HOWEVER 
. . . . If OProfile shuts down while the SPU job is still active _and_ 
_then_ is restarted while the job is still active, OProfile will find 
that the cached_info exists for the given spu_context, so it won't go 
through the process of creating it and doing kref_init on the kref.  
Under this scenario, OProfile does not own a ref of its own to the 
cached_info, and should not be doing a kref_put when done using the 
cached_info -- but it does, and so does SPUFS when the spu_context is 
destroyed.  The end result (with the code as currently written) is that 
an extra kref_put is done when the refcount is already down to zero.  To 
fix this, OProfile needs to detect when it finds an existing cached_info 
already stored in the spu_context.  Then, instead of creating a new one, 
it sets a reminder flag to be used later when it's done using the cached 
info to indicate whether or not it needs to call kref_put.

Unfortunately, there's another problem (one that should have been 
obvious to me).  The cached_info's kref "release" function is 
destroy_cached_info(), defined in the OProfile module.  If the OProfile 
module is unloaded when SPUFS destroys the spu_context and calls 
kref_put on the cached_info's kref -- KABOOM!  The destroy_cached_info 
function (the second arg to kref_put) is not in memory, so we get a 
paging fault.  I see a couple options to solve this:
   1) Don't store the cached_info in the spu_context.  Basically, go 
back to the simplistic model of creating/deleting the cached_info on 
every SPU task activation/deactivation.
   2)  If there's some way to do this, force the OProfile module to 
stay loaded until SPUFS has destroyed its last spu_context that holds a 
cached_info object.

I thought about putting the cached_info's kref "release" function in 
SPUFS, but this just won't work. This implies that SPUFS needs to know 
about the structure of the cached_info, e.g., that it contains the 
vma_map member that needs to be freed.  But even with that information, 
it's not enough, since the vma_map member consists of list of vma_maps, 
which is why we have the vma_map_free() function.  So SPUFS would still 
need access to vma_map_free() from the OProfile module.

Opinions from others would be appreciated.

Thanks.
-Maynard

+/* Container for caching information about an active SPU task.
+ * 
+ */

+struct cached_info {
+   struct vma_to_fileoffset_map * map;
+   struct spu * the_spu;   /* needed to access pointer to local_store */
+   struct kref cache_ref;
+};
+
+static struct cached_info * spu_info[MAX_NUMNODES * 8];
+
+static void destroy_cached_info(struct kref * kref)
+{
+   struct cached_info * info;
+   info = container_of(kref, struct cached_info, cache_ref);
+   vma_map_free(info->map);
+   kfree(info);
+}
+
+/* Return the cached_info for the passed SPU number.
+ * 
+ */

+static struct cached_info * get_cached_info(struct spu * the_spu, int spu_num)
+{
+   struct cached_info * ret_info = NULL;
+   unsigned long flags = 0;
+   if (spu_num >= num_spu_nodes) {
+		printk(KERN_ERR "SPU_PROF: " 
+		   "%s, line %d: Invalid index %d into spu info cache\n",
+		   __FUNCTION__, __LINE__, spu_num); 
+		goto out;

+   }
+   spin_lock_irqsave(&cache_lock, flags);
+   if (!spu_info[spu_num] && the_spu)
+   spu_info[spu_num] = (struct cached_info *)
+   spu_get_profile_private(the_spu->ctx);
+
+   ret_info = spu_info[spu_num];
+   spin_unlock_irqrestore(&cache_lock, flags);
+ out:
+   return ret_info;
+}
+
+
+/* Looks for cached info for the passed spu.  If not found, the
+ * cached info is created for the passed spu.
+ * Returns 0 for success; otherwise, -1 for error.  
+ */ 
+static int

+prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
+{
+   unsigned lon

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-06 Thread Carl Love

This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".  

This repost fixes the line wrap issue that Ben mentioned.  Also the kref
handling for the cached info has been fixed and simplified.

There are still a few items from the comments being discussed
specifically how to profile the dynamic code for the SPFS context switch
code and how to deal with dynamic code stubs for library support.  Our
proposal is to assign the samples from the SPFS and dynamic library code
to an anonymous sample bucket.  The support for properly handling the
symbol extraction in these cases would be deferred to a later SDK.

There is also a bug in profiling overlay code that we are investigating.


Subject: Add support to OProfile for profiling Cell BE SPUs

From: Maynard Johnson <[EMAIL PROTECTED]>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
code.

Signed-off-by: Carl Love <[EMAIL PROTECTED]>
Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig
===
--- linux-2.6.20-rc1.orig/arch/powerpc/configs/cell_defconfig   2007-01-18 
16:43:14.230540320 -0600
+++ linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig2007-02-01 
17:21:46.928875304 -0600
@@ -1403,7 +1403,7 @@
 # Instrumentation Support
 #
 CONFIG_PROFILING=y
-CONFIG_OPROFILE=y
+CONFIG_OPROFILE=m
 # CONFIG_KPROBES is not set
 
 #
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h   2007-02-03 
15:56:01.094856152 -0600
@@ -0,0 +1,78 @@
+ /*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Author: Maynard Johnson <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef PR_UTIL_H
+#define PR_UTIL_H
+
+#include 
+#include 
+#include 
+#include 
+
+static inline int number_of_online_nodes(void) 
+{
+   u32 cpu; u32 tmp;
+   int nodes = 0;
+   for_each_online_cpu(cpu) {
+   tmp = cbe_cpu_to_node(cpu) + 1;
+   if (tmp > nodes)
+   nodes++;
+   }
+   return nodes;
+}
+
+/* Defines used for sync_start */
+#define SKIP_GENERIC_SYNC 0
+#define SYNC_START_ERROR -1
+#define DO_GENERIC_SYNC 1
+
+struct vma_to_fileoffset_map
+{
+   struct vma_to_fileoffset_map *next;
+   unsigned int vma;
+   unsigned int size;
+   unsigned int offset;
+   unsigned int guard_ptr;
+   unsigned int guard_val;
+};
+
+/* The three functions below are for maintaining and accessing
+ * the vma-to-fileoffset map.
+ */
+struct vma_to_fileoffset_map * create_vma_map(const struct spu * spu, u64 
objectid);
+unsigned int vma_map_lookup(struct vma_to_fileoffset_map *map, unsigned int 
vma,
+   const struct spu * aSpu);
+void vma_map_free(struct vma_to_fileoffset_map *map);
+
+/*
+ * Entry point for SPU profiling.
+ * cycles_reset is the SPU_CYCLES count value specified by the user.
+ */
+void start_spu_profiling(unsigned int cycles_reset);
+
+void stop_spu_profiling(void);
+
+ 
+/* add the necessary profiling hooks */
+int spu_sync_start(void);
+
+/* remove the hooks */
+int spu_sync_stop(void);
+ 
+/* Record SPU program counter samples to the oprofile event buffer. */
+void spu_sync_buffer(int spu_num, unsigned int * samples, 
+int num_samples);
+
+void set_profiling_frequency(unsigned int freq_khz, unsigned int cycles_reset);
+
+#endif// PR_UTIL_H 
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c  2007-02-05 
09:32:25.708937424 -0600
@@ -0,0 +1,203 @@
+/*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Authors: Maynard Johnson <[EMAIL PROTECTED]>
+ *  Carl Love <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "pr_util.h"
+
+#define TRACE_ARRAY_SIZE 1024
+#define SCALE_SHIFT 14 
+
+static u3

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

32 matches

Site Navigation

Mail list logo

Footer information