Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Tuesday 27 February 2007, Maynard Johnson wrote: > I have applied the "cleanup" patch that Arnd sent, but had to fix up a > few things: > - Bug fix: Initialize retval in spu_task_sync.c, line 95, otherwise > OProfile this function returns non-zero and OProfile fails. > - Remove unused codes in include/linux/oprofile.h > - Compile warnings: Initialize offset and spu_cookie at lines 283 > and 284 in spu_task_sync.c > > With these changes and some userspace changes that were necessary to > correspond with Arnd's changes, our testing was successful. > > A fixup patch is attached. > The patch does not contain any of the metadata I need to apply it (subject, description, signed-off-by). > @@ -280,8 +280,8 @@ static int process_context_switch(struct > { > unsigned long flags; > int retval; > - unsigned int offset; > - unsigned long spu_cookie, app_dcookie; > + unsigned int offset = 0; > + unsigned long spu_cookie = 0, app_dcookie; > retval = prepare_cached_spu_info(spu, objectId); > if (retval) > goto out; No, this is wrong. Leaving the variables uninitialized at least warns you about the bug you have in this function: when there is anything wrong, you just continue writing the record with zero offset and dcookie values in it. Instead, you should get handle the error condition somewhere down the code. It's harmless most of the time, but you really should not be painting over your bugs by blindly initializing variables. > diff -paur linux-orig/include/linux/oprofile.h > linux-new/include/linux/oprofile.h > --- linux-orig/include/linux/oprofile.h 2007-02-27 14:41:29.0 -0600 > +++ linux-new/include/linux/oprofile.h 2007-02-27 14:43:18.0 -0600 > @@ -36,9 +36,6 @@ > #define XEN_ENTER_SWITCH_CODE 10 > #define SPU_PROFILING_CODE 11 > #define SPU_CTX_SWITCH_CODE12 > -#define SPU_OFFSET_CODE13 > -#define SPU_COOKIE_CODE14 > -#define SPU_SHLIB_COOKIE_CODE 15 > > struct super_block; > struct dentry; > Right, I forgot about this. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
I have applied the "cleanup" patch that Arnd sent, but had to fix up a few things: - Bug fix: Initialize retval in spu_task_sync.c, line 95, otherwise OProfile this function returns non-zero and OProfile fails. - Remove unused codes in include/linux/oprofile.h - Compile warnings: Initialize offset and spu_cookie at lines 283 and 284 in spu_task_sync.c With these changes and some userspace changes that were necessary to correspond with Arnd's changes, our testing was successful. A fixup patch is attached. P.S. We have a single patch with all these changes applied if anyone would like us to post it. -Maynard Arnd Bergmann wrote: On Thursday 22 February 2007, Carl Love wrote: This patch updates the existing arch/powerpc/oprofile/op_model_cell.c to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code. There was a significant amount of whitespace breakage in this patch, which I cleaned up. The patch below consists of the other things I changed as a further cleanup. Note that I changed the format of the context switch record, which I found too complicated, as I described on IRC last week. Arnd <>< diff -paur linux-orig/arch/powerpc/oprofile/cell/spu_task_sync.c linux-new/arch/powerpc/oprofile/cell/spu_task_sync.c --- linux-orig/arch/powerpc/oprofile/cell/spu_task_sync.c 2007-02-27 17:10:24.0 -0600 +++ linux-new/arch/powerpc/oprofile/cell/spu_task_sync.c 2007-02-27 17:08:57.0 -0600 @@ -92,7 +92,7 @@ prepare_cached_spu_info(struct spu * spu { unsigned long flags; struct vma_to_fileoffset_map * new_map; - int retval; + int retval = 0; struct cached_info * info; /* We won't bother getting cache_lock here since @@ -280,8 +280,8 @@ static int process_context_switch(struct { unsigned long flags; int retval; - unsigned int offset; - unsigned long spu_cookie, app_dcookie; + unsigned int offset = 0; + unsigned long spu_cookie = 0, app_dcookie; retval = prepare_cached_spu_info(spu, objectId); if (retval) goto out; diff -paur linux-orig/include/linux/oprofile.h linux-new/include/linux/oprofile.h --- linux-orig/include/linux/oprofile.h 2007-02-27 14:41:29.0 -0600 +++ linux-new/include/linux/oprofile.h 2007-02-27 14:43:18.0 -0600 @@ -36,9 +36,6 @@ #define XEN_ENTER_SWITCH_CODE 10 #define SPU_PROFILING_CODE 11 #define SPU_CTX_SWITCH_CODE12 -#define SPU_OFFSET_CODE13 -#define SPU_COOKIE_CODE14 -#define SPU_SHLIB_COOKIE_CODE 15 struct super_block; struct dentry;
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Tue, 2007-02-27 at 00:50 +0100, Arnd Bergmann wrote: > On Thursday 22 February 2007, Carl Love wrote: > > This patch updates the existing arch/powerpc/oprofile/op_model_cell.c > > to add in the SPU profiling capabilities. In addition, a 'cell' > > subdirectory > > was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling > > code. > > There was a significant amount of whitespace breakage in this patch, > which I cleaned up. The patch below consists of the other things > I changed as a further cleanup. Note that I changed the format > of the context switch record, which I found too complicated, as > I described on IRC last week. > > Arnd <>< > > -- > Subject: cleanup spu oprofile code > > From: Arnd Bergmann <[EMAIL PROTECTED]> > This cleans up some of the new oprofile code. It's mostly > cosmetic changes, like way multi-line comments are formatted. > The most significant change is a simplification of the > context-switch record format. > > It does mean the oprofile report tool needs to be adapted, > but I'm sure that it pays off in the end. I hate to be a stickler, but this patch is quite large, contains multiple changes, and mixes formatting changes with functional changes ... makes it a little hard to review :/ cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 22 February 2007, Carl Love wrote: > This patch updates the existing arch/powerpc/oprofile/op_model_cell.c > to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory > was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling > code. There was a significant amount of whitespace breakage in this patch, which I cleaned up. The patch below consists of the other things I changed as a further cleanup. Note that I changed the format of the context switch record, which I found too complicated, as I described on IRC last week. Arnd <>< -- Subject: cleanup spu oprofile code From: Arnd Bergmann <[EMAIL PROTECTED]> This cleans up some of the new oprofile code. It's mostly cosmetic changes, like way multi-line comments are formatted. The most significant change is a simplification of the context-switch record format. It does mean the oprofile report tool needs to be adapted, but I'm sure that it pays off in the end. Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]> Index: linux-2.6/arch/powerpc/oprofile/cell/spu_task_sync.c === --- linux-2.6.orig/arch/powerpc/oprofile/cell/spu_task_sync.c +++ linux-2.6/arch/powerpc/oprofile/cell/spu_task_sync.c @@ -61,11 +61,12 @@ static void destroy_cached_info(struct k static struct cached_info * get_cached_info(struct spu * the_spu, int spu_num) { struct kref * ref; - struct cached_info * ret_info = NULL; + struct cached_info * ret_info; if (spu_num >= num_spu_nodes) { printk(KERN_ERR "SPU_PROF: " "%s, line %d: Invalid index %d into spu info cache\n", __FUNCTION__, __LINE__, spu_num); + ret_info = NULL; goto out; } if (!spu_info[spu_num] && the_spu) { @@ -89,9 +90,9 @@ static struct cached_info * get_cached_i static int prepare_cached_spu_info(struct spu * spu, unsigned int objectId) { - unsigned long flags = 0; + unsigned long flags; struct vma_to_fileoffset_map * new_map; - int retval = 0; + int retval; struct cached_info * info; /* We won't bother getting cache_lock here since @@ -112,6 +113,7 @@ prepare_cached_spu_info(struct spu * spu printk(KERN_ERR "SPU_PROF: " "%s, line %d: create vma_map failed\n", __FUNCTION__, __LINE__); + retval = -ENOMEM; goto err_alloc; } new_map = create_vma_map(spu, objectId); @@ -119,6 +121,7 @@ prepare_cached_spu_info(struct spu * spu printk(KERN_ERR "SPU_PROF: " "%s, line %d: create vma_map failed\n", __FUNCTION__, __LINE__); + retval = -ENOMEM; goto err_alloc; } @@ -144,7 +147,7 @@ prepare_cached_spu_info(struct spu * spu goto out; err_alloc: - retval = -1; + kfree(info); out: return retval; } @@ -215,11 +218,9 @@ static inline unsigned long fast_get_dco static unsigned long get_exec_dcookie_and_offset(struct spu * spu, unsigned int * offsetp, unsigned long * spu_bin_dcookie, - unsigned long * shlib_dcookie, unsigned int spu_ref) { unsigned long app_cookie = 0; - unsigned long * image_cookie = NULL; unsigned int my_offset = 0; struct file * app = NULL; struct vm_area_struct * vma; @@ -252,24 +253,17 @@ get_exec_dcookie_and_offset(struct spu * my_offset, spu_ref, vma->vm_file->f_dentry->d_name.name); *offsetp = my_offset; - if (my_offset == 0) - image_cookie = spu_bin_dcookie; - else if (vma->vm_file != app) - image_cookie = shlib_dcookie; break; } - if (image_cookie) { - *image_cookie = fast_get_dcookie(vma->vm_file->f_dentry, + *spu_bin_dcookie = fast_get_dcookie(vma->vm_file->f_dentry, vma->vm_file->f_vfsmnt); - pr_debug("got dcookie for %s\n", -vma->vm_file->f_dentry->d_name.name); - } + pr_debug("got dcookie for %s\n", vma->vm_file->f_dentry->d_name.name); - out: +out: return app_cookie; - fail_no_image_cookie: +fail_no_image_cookie: printk(KERN_ERR "SPU_PROF: " "%s, line %d: Cannot find dcookie for SPU binary\n", __FUNCTION__, __LINE__); @@ -285,18 +279,18 @@ get_exec_dcookie_and_offset(struct spu * static int process_context_switch(struct spu * spu, unsigned int objectId) { unsigned long flags; - int retval = 0; - unsigned int offset = 0; - unsigned long spu_cookie = 0, app_dcookie = 0, shlib_cookie = 0; +
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Maynard Johnson wrote: Arnd Bergmann wrote: On Friday 16 February 2007 01:32, Maynard Johnson wrote: config OPROFILE_CELL bool "OProfile for Cell Broadband Engine" depends on OPROFILE && SPU_FS default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && OPROFILE = m)) help Profiling of Cell BE SPUs requires special support enabled by this option. Both SPU_FS and OPROFILE options must be set 'y' or both be set 'm'. = Can anyone see a problem with any of this . . . or perhaps a suggestion of a better way? The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think should be allowed. Right, good catch. I'll add another OR to the 'default y' and correct the text. Actually, it makes more sense to do the following: config OPROFILE_CELL bool "OProfile for Cell Broadband Engine" depends on (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && OPROFILE = y) || (SPU_FS = m && OPROFILE = m) default y help Profiling of Cell BE SPUs requires special support enabled by this option. > I also don't see any place in the code where you actually use CONFIG_OPROFILE_CELL. As I mentioned, I will use CONFIG_OPROFILE_CELL in the arch/powerpc/oprofile/Makefile as follows: oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \ cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o [snip] Arnd <>< ___ Linuxppc-dev mailing list [EMAIL PROTECTED] https://ozlabs.org/mailman/listinfo/linuxppc-dev - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Arnd Bergmann wrote: On Friday 16 February 2007 01:32, Maynard Johnson wrote: config OPROFILE_CELL bool "OProfile for Cell Broadband Engine" depends on OPROFILE && SPU_FS default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && OPROFILE = m)) help Profiling of Cell BE SPUs requires special support enabled by this option. Both SPU_FS and OPROFILE options must be set 'y' or both be set 'm'. = Can anyone see a problem with any of this . . . or perhaps a suggestion of a better way? The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think should be allowed. Right, good catch. I'll add another OR to the 'default y' and correct the text. > I also don't see any place in the code where you actually use CONFIG_OPROFILE_CELL. As I mentioned, I will use CONFIG_OPROFILE_CELL in the arch/powerpc/oprofile/Makefile as follows: oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \ cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o Ideally, you should be able to have an oprofile_spu module that can be loaded after spufs.ko and oprofile.ko. In that case you only need config OPROFILE_SPU depends on OPROFILE && SPU_FS default y and it will automatically build oprofile_spu as a module if one of the two is a module and won't build it if one of them is disabled. Hmmm . . . I guess that would entail splitting out the SPU-related stuff from op_model_cell.c into a new file. Maybe more -- that's just what comes to mind right now. Could be very tricky, and I wonder if it's worth the bother. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Friday 16 February 2007 01:32, Maynard Johnson wrote: > config OPROFILE_CELL > bool "OProfile for Cell Broadband Engine" > depends on OPROFILE && SPU_FS > default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && > OPROFILE = m)) > help > Profiling of Cell BE SPUs requires special support enabled > by this option. Both SPU_FS and OPROFILE options must be > set 'y' or both be set 'm'. > = > > Can anyone see a problem with any of this . . . or perhaps a suggestion > of a better way? The text suggests it doesn't allow SPU_FS=y with OPROFILE=m, which I think should be allowed. I also don't see any place in the code where you actually use CONFIG_OPROFILE_CELL. Ideally, you should be able to have an oprofile_spu module that can be loaded after spufs.ko and oprofile.ko. In that case you only need config OPROFILE_SPU depends on OPROFILE && SPU_FS default y and it will automatically build oprofile_spu as a module if one of the two is a module and won't build it if one of them is disabled. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Arnd Bergmann wrote: On Thursday 15 February 2007 00:52, Carl Love wrote: --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 16:43:14.0 -0600 +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig 2007-02-13 19:04:46.271028904 -0600 @@ -7,7 +7,8 @@ config OPROFILE tristate "OProfile system profiling (EXPERIMENTAL)" - depends on PROFILING + default m + depends on SPU_FS && PROFILING help OProfile is a profiling system capable of profiling the whole system, include the kernel, kernel modules, libraries, Milton already commented on this being wrong. I think what you want is depends on PROFILING && (SPU_FS = n || SPU_FS) that should make sure that when SPU_FS=y that OPROFILE can not be 'm'. The above suggestion would not work if SPU_FS is not defined, since the entire config option is ignored if an undefined symbol is used. So, here's what I propose instead: - Leave the existing 'config OPROFILE' unchanged from its current form in mainline (shown below) - Add the new 'config OPROFILE_CELL' (shown below) - In arch/powerpc/configs/cell-defconfig, set CONFIG_OPROFILE=m, to correspond to setting for CONFIG_SPU_FS - In arch/powerpc/oprofile/Makefile, do the following: oprofile-$(CONFIG_OPROFILE_CELL) += op_model_cell.o \ cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o === config OPROFILE tristate "OProfile system profiling (EXPERIMENTAL)" depends on PROFILING help OProfile is a profiling system capable of profiling the whole system, include the kernel, kernel modules, libraries, and applications. If unsure, say N. config OPROFILE_CELL bool "OProfile for Cell Broadband Engine" depends on OPROFILE && SPU_FS default y if ((SPU_FS = y && OPROFILE = y) || (SPU_FS = m && OPROFILE = m)) help Profiling of Cell BE SPUs requires special support enabled by this option. Both SPU_FS and OPROFILE options must be set 'y' or both be set 'm'. = Can anyone see a problem with any of this . . . or perhaps a suggestion of a better way? Thanks. -Maynard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 15 February 2007 22:50, Paul E. McKenney wrote: > Is this 1.5ms with interrupts disabled? This time period is problematic > from a realtime perspective if so -- need to be able to preempt. No, interrupts should be enabled here. Still, 1.5ms is probably a little too long without a cond_resched() in case kernel preemption is disabled. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thu, Feb 15, 2007 at 12:21:58PM -0800, Carl Love wrote: > On Thu, 2007-02-15 at 15:37 +0100, Arnd Bergmann wrote: [ . . . ] > > I agree with Milton that it would be far nicer even to calculate > > the value from user space, but since you say that would > > violate the oprofile interface conventions, let's not go there. > > In order to make this code nicer on the user, you should probably > > insert a 'cond_resched()' somewhere in the loop, maybe every > > 500 iterations or so. > > > > it also looks like there is whitespace damage in the code here. > > I will double check on the whitespace damage. I thought I had gotten > all that out. > > I have done some quick measurements. The above method limits the loop > to at most 2^16 iterations. Based on running the algorithm in user > space, it takes about 3ms of computation time to do the loop 2^16 times. > > At the vary least, we need to put the resched in say every 10,000 > iterations which would be about every 0.5ms. Should we do a resched > more often? > > Additionally we could up the size of the table to 512 which would reduce > the maximum time to about 1.5ms. What do people think about increasing > the table size? Is this 1.5ms with interrupts disabled? This time period is problematic from a realtime perspective if so -- need to be able to preempt. Thanx, Paul > A little more general discussion about the logarithmic algorithm and > limiting the range. The hardware supports a 24 bit LFSR value. This > means the user can say is capture a sample every N cycles, where N is in > the range of 1 to 2^24. The OProfile user tool enforces a minimum value > of N to make sure the overhead of OProfile doesn't bring the machine to > its knees. The minimum values is not intended to guarantee the > performance impact of OProfile will not be significant. It is left as > an exercise for the user to pick an N that will give minimal performance > impact. We set the lower limit for N for SPU profiling to 100,000. This > is actually high enough that we don't seem to see much performance > impact when running OProfile. If the user picked N=2^24 then for a > 3.2GHz machine you would get about 200 samples per second on each node. > Where a sample consists of the PC value for all 8 SPUs on the node. If > the user wanted to do a relatively long OProfile run, I can see where > they might use N=2^24 to avoid gathering too much data. My gut feeling > is that the sampling frequency for N=2^24 is not low enough that someone > would never want to use it when doing long runs. Hence, we should not > arbitrarily reduce the maximum value for N. Although I would expect > that the typical value for N will be in the range of several hundred > thousand to a few million. > > As for using a logarithmic spacing of the precomputed values, this > approach means that the space between the precomputed values at the high > end would be much larger then 2^14, assuming 256 precomputed values. > That means it could take much longer then 3ms to get the needed LFSR > value for a large N. By evenly spacing the precomputed values, we can > ensure that for all N it will take less then 3ms to get the value. > Personally, I am more comfortable with a hard limit on the compute time > then a variable time that could get much bigger then the 1ms threshold > that Arnd wants for resched. Any thoughts? > > > > > > + > > > +/* This interface allows a profiler (e.g., OProfile) to store > > > + * spu_context information needed for profiling, allowing it to > > > + * be saved across context save/restore operation. > > > + * > > > + * Assumes the caller has already incremented the ref count to > > > + * profile_info; then spu_context_destroy must call kref_put > > > + * on prof_info_kref. > > > + */ > > > +void spu_set_profile_private(struct spu_context * ctx, void * > > > profile_info, > > > + struct kref * prof_info_kref, > > > + void (* prof_info_release) (struct kref * kref)) > > > +{ > > > + ctx->profile_private = profile_info; > > > + ctx->prof_priv_kref = prof_info_kref; > > > + ctx->prof_priv_release = prof_info_release; > > > +} > > > +EXPORT_SYMBOL_GPL(spu_set_profile_private); > > > > I think you don't need the profile_private member here, if you just use > > container_of with ctx->prof_priv_kref in all users. > > > > Arnd <>< > > ___ > cbe-oss-dev mailing list > [EMAIL PROTECTED] > https://ozlabs.org/mailman/listinfo/cbe-oss-dev - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 15 February 2007 21:21, Carl Love wrote: > I have done some quick measurements. The above method limits the loop > to at most 2^16 iterations. Based on running the algorithm in user > space, it takes about 3ms of computation time to do the loop 2^16 times. > > At the vary least, we need to put the resched in say every 10,000 > iterations which would be about every 0.5ms. Should we do a resched > more often? Yes, just to be on the safe side, I'd suggest to do it every 1000 iterations. > Additionally we could up the size of the table to 512 which would reduce > the maximum time to about 1.5ms. What do people think about increasing > the table size? No, that won't help too much. I'd say 256 or 128 entries is the most we should have. > As for using a logarithmic spacing of the precomputed values, this > approach means that the space between the precomputed values at the high > end would be much larger then 2^14, assuming 256 precomputed values. > That means it could take much longer then 3ms to get the needed LFSR > value for a large N. By evenly spacing the precomputed values, we can > ensure that for all N it will take less then 3ms to get the value. > Personally, I am more comfortable with a hard limit on the compute time > then a variable time that could get much bigger then the 1ms threshold > that Arnd wants for resched. Any thoughts? When using precomputed values on a logarithmic scale, I'd recommend just rounding to the closest value and accepting the relative inaccuracy, instead of using the precomputed value as the base and then calculating from there. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thu, 2007-02-15 at 15:37 +0100, Arnd Bergmann wrote: > On Thursday 15 February 2007 00:52, Carl Love wrote: > > > > --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 > > 16:43:14.0 -0600 > > +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig 2007-02-13 > > 19:04:46.271028904 -0600 > > @@ -7,7 +7,8 @@ > > > > config OPROFILE > > tristate "OProfile system profiling (EXPERIMENTAL)" > > - depends on PROFILING > > + default m > > + depends on SPU_FS && PROFILING > > help > > OProfile is a profiling system capable of profiling the > > whole system, include the kernel, kernel modules, libraries, > > Milton already commented on this being wrong. I think what you want > is > depends on PROFILING && (SPU_FS = n || SPU_FS) > > that should make sure that when SPU_FS=y that OPROFILE can not be 'm'. > > > @@ -15,3 +16,10 @@ > > > > If unsure, say N. > > > > +config OPROFILE_CELL > > + bool "OProfile for Cell Broadband Engine" > > + depends on SPU_FS && OPROFILE > > + default y > > + help > > + OProfile for Cell BE requires special support enabled > > + by this option. > > You should at least mention that this allows profiling the spus. > > > +#define EFWCALL ENOSYS /* Use an existing error number that is as > > +* close as possible for a FW call that failed. > > +* The probability of the call failing is > > +* very low. Passing up the error number > > +* ensures that the user will see an error > > +* message saying OProfile did not start. > > +* Dmesg will contain an accurate message > > +* about the failure. > > +*/ > > ENOSYS looks wrong though. It would appear to the user as if the oprofile > function in the kernel was not present. I'd suggest EIO, and not use > an extra define for that. > OK, will do. > > > static int > > rtas_ibm_cbe_perftools(int subfunc, int passthru, > >void *address, unsigned long length) > > { > > u64 paddr = __pa(address); > > > > - return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru, > > -paddr >> 32, paddr & 0x, length); > > + pm_rtas_token = rtas_token("ibm,cbe-perftools"); > > + > > + if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) { > > + printk(KERN_ERR > > + "%s: rtas token ibm,cbe-perftools unknown\n", > > + __FUNCTION__); > > + return -EFWCALL; > > + } else { > > + > > + return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, > > +passthru, paddr >> 32, paddr & 0x, length); > > + } > > } > > Are you now reading the rtas token every time you call rtas? that seems > like a waste of time. There are actually two RTAS calls, i.e. two tokens. Once for setting up the debug bus. The other to do the SPU PC collection. Yes, we are getting the token each time using the single global pm_rtas_token. To make sure we had the correct token, I made sure to call it each time. As you point out it is very wasteful. It probably would be best to just have a second global variable say spu_rtas_token. Then do a single call for each global variable. Then we just use the global variable in the appropriate rtas_call. This would eliminate a significant number of calls to look up the token. I should have thought of that earlier. > > > > +#define size 24 > > +#define ENTRIES (0x1<<8) /* 256 */ > > +#define MAXLFSR 0xFF > > + > > +int initial_lfsr[] = > > +{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445, > > + 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, > > 4996256, > > + 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843, > > + 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714, > > + 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106, > > + 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539, > > + 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372, > > + 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380, > > + 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555, > > + 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120, > > + 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753, > > + 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582, > > + 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, > > 7330509, > > + 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266, > > + 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108, > > + 2012376, 13661123, 14655718, 95340
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 15 February 2007 17:15, Maynard Johnson wrote: > >>+void spu_set_profile_private(struct spu_context * ctx, void * profile_info, > >>+ struct kref * prof_info_kref, > >>+ void (* prof_info_release) (struct kref * kref)) > >>+{ > >>+ ctx->profile_private = profile_info; > >>+ ctx->prof_priv_kref = prof_info_kref; > >>+ ctx->prof_priv_release = prof_info_release; > >>+} > >>+EXPORT_SYMBOL_GPL(spu_set_profile_private); > >> > >> > > > >I think you don't need the profile_private member here, if you just use > >container_of with ctx->prof_priv_kref in all users. > > > > > Sorry, I don't follow. We want the profile_private to be stored in the > spu_context, don't we? How else would I be able to do that? And > besides, wouldn't container_of need the struct name of profile_private? > SPUFS doesn't have access to the type. The idea was to have spu_get_profile_private return the kref pointer, and then change the user of that to do + if (!spu_info[spu_num] && the_spu) { + spu_info[spu_num] = container_of( + spu_get_profile_private(the_spu->ctx), + struct cached_info, cache_kref); + if (spu_info[spu_num]) + kref_get(&spu_info[spu_num]->cache_ref); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Arnd Bergmann wrote: On Thursday 15 February 2007 00:52, Carl Love wrote: --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 16:43:14.0 -0600 +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig 2007-02-13 19:04:46.271028904 -0600 @@ -7,7 +7,8 @@ config OPROFILE tristate "OProfile system profiling (EXPERIMENTAL)" - depends on PROFILING + default m + depends on SPU_FS && PROFILING help OProfile is a profiling system capable of profiling the whole system, include the kernel, kernel modules, libraries, Milton already commented on this being wrong. I think what you want is depends on PROFILING && (SPU_FS = n || SPU_FS) that should make sure that when SPU_FS=y that OPROFILE can not be 'm'. Blast it! I did this right on our development system, but neglected to update the patch correctly to remove this dependency and 'default m'. I'll fix in the next patch. @@ -15,3 +16,10 @@ If unsure, say N. +config OPROFILE_CELL + bool "OProfile for Cell Broadband Engine" + depends on SPU_FS && OPROFILE + default y + help + OProfile for Cell BE requires special support enabled + by this option. You should at least mention that this allows profiling the spus. OK. +#define EFWCALL ENOSYS /* Use an existing error number that is as +* close as possible for a FW call that failed. +* The probability of the call failing is +* very low. Passing up the error number +* ensures that the user will see an error +* message saying OProfile did not start. +* Dmesg will contain an accurate message +* about the failure. +*/ ENOSYS looks wrong though. It would appear to the user as if the oprofile function in the kernel was not present. I'd suggest EIO, and not use an extra define for that. Carl will reply to this. static int rtas_ibm_cbe_perftools(int subfunc, int passthru, void *address, unsigned long length) { u64 paddr = __pa(address); - return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru, -paddr >> 32, paddr & 0x, length); + pm_rtas_token = rtas_token("ibm,cbe-perftools"); + + if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) { + printk(KERN_ERR + "%s: rtas token ibm,cbe-perftools unknown\n", + __FUNCTION__); + return -EFWCALL; + } else { + + return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, + passthru, paddr >> 32, paddr & 0x, length); + } } Are you now reading the rtas token every time you call rtas? that seems like a waste of time. Carl will reply. +#define size 24 +#define ENTRIES (0x1<<8) /* 256 */ +#define MAXLFSR 0xFF + +int initial_lfsr[] = +{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445, + 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, 4996256, + 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843, + 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714, + 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106, + 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539, + 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372, + 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380, + 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555, + 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120, + 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753, + 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582, + 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, 7330509, + 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266, + 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108, + 2012376, 13661123, 14655718, 9534083, 16637925, 2537745, 9787923, 12750103, + 4660370, 3283461, 14862772, 7034955, 6679872, 8918232, 6506913, 103649, + 6085577, 13324033, 14251613, 11058220, 11998181, 3100233, 468898, 7104918, + 12498413, 14408165, 1208514, 15712321, 3088687, 14778333, 3632503, 11151952, + 98896, 9159367, 8866146, 4780737, 4925758, 12362320, 4122783, 8543358, + 7056879, 10876914, 6282881, 1686625, 5100373, 4573666, 9265515, 13593840, + 5853060, 110, 4237111, 1576, 14344137, 4608332, 6590210, 13745050, + 10916568, 12340402, 7145275, 4417153, 2300360, 12079643, 7608534, 15238251, + 4947424, 7014722, 3984546, 7168073, 10759589, 16293080, 3757181, 4577717, + 5163790, 2488841
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 15 February 2007 00:52, Carl Love wrote: > --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig 2007-01-18 > 16:43:14.0 -0600 > +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig2007-02-13 > 19:04:46.271028904 -0600 > @@ -7,7 +7,8 @@ > > config OPROFILE > tristate "OProfile system profiling (EXPERIMENTAL)" > - depends on PROFILING > + default m > + depends on SPU_FS && PROFILING > help > OProfile is a profiling system capable of profiling the > whole system, include the kernel, kernel modules, libraries, Milton already commented on this being wrong. I think what you want is depends on PROFILING && (SPU_FS = n || SPU_FS) that should make sure that when SPU_FS=y that OPROFILE can not be 'm'. > @@ -15,3 +16,10 @@ > > If unsure, say N. > > +config OPROFILE_CELL > + bool "OProfile for Cell Broadband Engine" > + depends on SPU_FS && OPROFILE > + default y > + help > + OProfile for Cell BE requires special support enabled > + by this option. You should at least mention that this allows profiling the spus. > +#define EFWCALL ENOSYS /* Use an existing error number that is as > + * close as possible for a FW call that failed. > + * The probability of the call failing is > + * very low. Passing up the error number > + * ensures that the user will see an error > + * message saying OProfile did not start. > + * Dmesg will contain an accurate message > + * about the failure. > + */ ENOSYS looks wrong though. It would appear to the user as if the oprofile function in the kernel was not present. I'd suggest EIO, and not use an extra define for that. > static int > rtas_ibm_cbe_perftools(int subfunc, int passthru, > void *address, unsigned long length) > { > u64 paddr = __pa(address); > > - return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru, > - paddr >> 32, paddr & 0x, length); > + pm_rtas_token = rtas_token("ibm,cbe-perftools"); > + > + if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) { > + printk(KERN_ERR > +"%s: rtas token ibm,cbe-perftools unknown\n", > +__FUNCTION__); > + return -EFWCALL; > + } else { > + > + return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, > + passthru, paddr >> 32, paddr & 0x, length); > + } > } Are you now reading the rtas token every time you call rtas? that seems like a waste of time. > +#define size 24 > +#define ENTRIES (0x1<<8) /* 256 */ > +#define MAXLFSR 0xFF > + > +int initial_lfsr[] = > +{16777215, 3797240, 13519805, 11602690, 6497030, 7614675, 2328937, 2889445, > + 12364575, 8723156, 2450594, 16280864, 14742496, 10904589, 6434212, 4996256, > + 5814270, 13014041, 9825245, 410260, 904096, 15151047, 15487695, 3061843, > + 16482682, 7938572, 4893279, 9390321, 4320879, 5686402, 1711063, 10176714, > + 4512270, 1057359, 16700434, 5731602, 2070114, 16030890, 1208230, 15603106, > + 11857845, 6470172, 1362790, 7316876, 8534496, 1629197, 10003072, 1714539, > + 1814669, 7106700, 5427154, 3395151, 3683327, 12950450, 16620273, 12122372, > + 7194999, 9952750, 3608260, 13604295, 2266835, 14943567, 7079230, 777380, > + 4516801, 1737661, 8730333, 13796927, 3247181, 9950017, 3481896, 16527555, > + 13116123, 14505033, 9781119, 4860212, 7403253, 13264219, 12269980, 100120, > + 664506, 607795, 8274553, 13133688, 6215305, 13208866, 16439693, 3320753, > + 8773582, 13874619, 1784784, 4513501, 11002978, 9318515, 3038856, 14254582, > + 15484958, 15967857, 13504461, 13657322, 14724513, 13955736, 5695315, > 7330509, > + 12630101, 6826854, 439712, 4609055, 13288878, 1309632, 4996398, 11392266, > + 793740, 7653789, 2472670, 14641200, 5164364, 5482529, 10415855, 1629108, > + 2012376, 13661123, 14655718, 9534083, 16637925, 2537745, 9787923, 12750103, > + 4660370, 3283461, 14862772, 7034955, 6679872, 8918232, 6506913, 103649, > + 6085577, 13324033, 14251613, 11058220, 11998181, 3100233, 468898, 7104918, > + 12498413, 14408165, 1208514, 15712321, 3088687, 14778333, 3632503, 11151952, > + 98896, 9159367, 8866146, 4780737, 4925758, 12362320, 4122783, 8543358, > + 7056879, 10876914, 6282881, 1686625, 5100373, 4573666, 9265515, 13593840, > + 5853060, 110, 4237111, 1576, 14344137, 4608332, 6590210, 13745050, > + 10916568, 12340402, 7145275, 4417153, 2300360, 12079643, 7608534, 15238251, > + 4947424, 7014722, 3984546, 7168073, 10759589, 16293080, 3757181, 4577717, > + 5163790, 2488841, 4650617, 3650022, 5440654, 1814617, 6939232, 15540909, > + 501788, 1060986, 5058235, 5078222, 3734500, 10762065, 390862, 5172712, > + 1070780, 790442
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Sun, 2007-02-11 at 16:46 -0600, Milton Miller wrote: [cut] > > As far as I understand, you are providing access to a completely new > hardware that is related to the PMU hardware by the fact that it > collects a program counter. It doesn't use the PMU counters nor the > PMU event selection. > > In fact, why can the existing op_model_cell profiling not run while > the SPU profiling runs? Is there a shared debug bus inside the > chip? Or just the data stream with your buffer_sync code? > There are two reasons you cannot do SPU profiling and profiling on non SPU events at the sametime. 1) the SPU PC values are routed on the debug bus. You cannot also route the signals for other non SPU events on the debug bus since they will conflict. Specifically, the signals would get logically OR'd on the bus. The exception is PPU cycles which does not use the debug bus. 2) the hardware that captures the SPU program counters has some shared components with the HW performance counters. To use the SPU program counter hardware, the performance counter hardware must be disabled. In summary, we cannot do SPU cycle profiling and non SPU event profiling at the same time due to limitations in the hardware. [cut] > milton > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 10:17 AM, Carl Love wrote: On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote: On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? There are two things 1) setup the LFSR to control how often the Hardware puts samples into the trace buffer. 2) setup the kernel timer to read the trace buffer (your d, d is a function of the cpu freq) and process the samples. (c is nonsense) Well, its cacluate the rate from the count vs count from the rate. The kernel timer was set with the goal the the hardware trace buffer would not get more then half full to ensure we would not lose samples even for the maximum rate that the hardware would be adding samples to the trace buffer (user specified N=100,000). That should be well commented. The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Actually, the oprofile userspace fills out files in a file system to tell the kernel what it needs to know. The powerpc code defines the reources needed to use the PMU hardware, which is (1) common mux selects, for processors that need them, and (2) a set of directories, one for each of the pmu counters, each of which contain the controls for that counter (such as enable kernel space, enable user space, event timeout to interrupt or sample collection, etc. The ctr[i].count is one of these files. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The above statement makes no sense to me. I think I talked past you. Your description of hadrware vs mine was different in that the counter always ends at a specified point and is loaded with the variable count, where mine had it comparing to the count as it incremented. However, I now see that you were referring to the fact that what the user specifes, the count, has to be converted to a LFSR value. My point is this can be done in the user-space oprofile code. It already has to look up magic numbers for setting up event selection muxes for other hardware, adding a lfsr calculation is not beyond resoon. Nor is having it provide two values in two files. Determining the initial LFSR value that is N values from the last value in the sequence is not easy to do. Well, its easy, its just order(N). The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. The LFSR value is computed ONCE when you start OProfile. The value is setup in the hardware once when OProfile starts. The hardware will always restart with the value given to it after it reaches the last value in the sequence. What you call the "time between unloads" is the time at which you schedule the kernel routine to empty the trace buffer. It is calculated once. It would only need to be recomputed if the cpu frequency changed. The obvious problem is
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote: On Friday 09 February 2007 19:47, Milton Miller wrote: On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. But _why_ do you want to solve that problem? we don't have dynamically linked binaries and I really don't see why the loader would want to create artificial elf headers... I'm explainign how they could be handled. Actully I think the other proposal (an identified structure that points to sevear elf headers) would be more approprate. As you point out, if there are presently no dynamic libraries in use, it doesn't have to be solved today. I'm just trying to make the code future proof, or at least a clear path forward. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). we thought about this in the past and discarded it because of the complexity of an spufs interface that would handle this correctly. Not sure what would be difficuult, and it would allow other binary formats. But parsing the headers in the kernel means existing userspace doesn't have to be upgraded, so I am not proposing this requirement. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. yes, good point. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). I'm not really following you here, but probably you misunderstood my point as well. I was thinking in terms of dyanmic libraries, and totally skipped your comment about the relocaiton info being missing. My reply point was that the table could be compressed to the current entry if all hased to the same vm area. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. There is some work going on for another profiler independent of the hardware feature that only relies on instrumenting the spu executable for things like DMA transfers and overlay changes. Regardless, its beyond the current scope. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Friday 09 February 2007 19:47, Milton Miller wrote: > On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: > > > Doing the translation in two stages in user space, as you > > suggest here, definitely makes sense to me. I think it > > can be done a little simpler though: > > > > Why would you need the accurate dcookie information to be > > provided by the kernel? The ELF loader is done in user > > space, and the kernel only reproduces what it thinks that > > came up with. If the kernel only gives the dcookie information > > about the SPU ELF binary to the oprofile user space, then > > that can easily recreate the same mapping. > > Actually, I was trying to cover issues such as anonymous > memory. If the kernel doesn't generate dcookies for > the load segments the information is not recoverable once > the task exits. This would also allow the loader to create > an artifical elf header that covered both the base executable > and a dynamicly linked one. > > Other alternatives exist, such as a structure for context-id > that would have its own identifing magic and an array of elf > header pointers. But _why_ do you want to solve that problem? we don't have dynamically linked binaries and I really don't see why the loader would want to create artificial elf headers... > > The kernel still needs to provide the overlay identifiers > > though. > > Yes, which means it needs to parse the header (or libpse > be enhanced to write the monitor info to a spufs file). we thought about this in the past and discarded it because of the complexity of an spufs interface that would handle this correctly. > > yes, this sounds nice. But tt does not at all help accuracy, > > only performance, right? > > It allows the user space to know when the sample was taken > and be aware of the ambiguity. If the sample rate is > high enough in relation to the overlay switch rate, user space > could decide to discard the ambiguous samples. yes, good point. > >> This approach allows multiple objects by its nature. A new > >> elf header could be constructed in memory that contained > >> the union of the elf objects load segments, and the tools > >> will magically work. Alternatively the object id could > >> point to a new structure, identified via a new header, that > >> it points to other elf headers (easily differentiated by the > >> elf magic headers). Other binary formats, including several > >> objects in a ar archive, could be supported. > > > > Yes, that would be a new feature if the kernel passed dcookie > > information for every section, but I doubt that it is worth > > it. I have not seen any program that allows loading code > > from more than one ELF file. In particular, the ELF format > > on the SPU is currently lacking the relocation mechanisms > > that you would need for resolving spu-side symbols at load > > time > > Actually, It could check all load segments, and only report > those where the dcookie changes (as opposed to the offset). I'm not really following you here, but probably you misunderstood my point as well. > > This seems to incur a run-time overhead on the SPU even if not > > profiling, I would consider that not acceptable. > > It definitely is overhead. Which means it would have to be > optional, like gprof. There is some work going on for another profiler independent of the hardware feature that only relies on instrumenting the spu executable for things like DMA transfers and overlay changes. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to present samples that may have been collected during context switch processing, they must be discarded. I thought it already did handle overlays, what did I miss here? It does, see my reply to Maynard. Not sure what I was thinking when I wrote this, possibly I was thinking of the inaccuracies. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. right. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). . If better overlay identification is required, in theory the overlay switch code could be augmented to record the switches (DMA reference time from the PowerPC memory and record a relative decrementer in the SPU), this is obviously a future item. But it is facilitated by having user space resolve the SPU to source file translation. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote: Milton, Thank you for your comments. Carl will reply to certain parts of your posting where he's more knowledgeable than I. See my replies below. Thanks for the pleasant tone and dialog. Milton Miller wrote: On Feb 6, 2007, at 5:02 PM, Carl Love wrote: This is the first update to the patch previously posted by Maynard Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL". Data collected The current patch starts tackling these translation issues for the presently common case of a static self contained binary from a single file, either single separate source file or embedded in the data of the host application. When creating the trace entry for a SPU context switch, it records the application owner, pid, tid, and dcookie of the main executable. It addition, it looks up the object-id as a virtual address and records the offset if it is non-zero, or the dcookie of the object if it is zero. The code then creates a data structure by reading the elf headers from the user process (at the address given by the object-id) and building a list of SPU address to elf object offsets, as specified by the ELF loader headers. In addition to the elf loader section, it processes the overlay headers and records the address, size, and magic number of the overlay. When the hardware trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to Yes, we do handle overlays. (Note: I'm looking into a bug right now in our overlay support.) I knew you handled overlays, and I did not mean to say that you did not. I am not sure how that got there. I may have been thinking of the kernel supplied context switch code discussion, or how the code supplied dcookie or offset but not both. Actually, I might have been referring to the fact that overlays are guessed rather than recorded. present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. Discussions on this topic in the past have resulted in the current implementation precisely because we're able to record the samples as fileoffsets, just as the userspace tools expect. I was not part of the previous discussions, so please forgive me. I haven't had time to check out how much this would impact the userspace tools, but my gut feel is that it would be quite significant. If we were developing this module with a matching newly-created userspace tool, I would be more inclined to agree that this makes sense. I have not yet studied the user space tool. In fact, when I made this proposal, I had not studied the kernel oprofile code either, although I had read the concepts and discussion of the event buffer when the base patch was added to the kernel. I have read and now consider myself to have some understanding of the kernel side. I note that the user space tool calls itself alpha and the kernel support experimental. I only looked at the user space enough to determine it is written in C++. I would hope the tool would be modular enough to insert a data transformation pass to do this conversion that the kernel is doing. But you give no rationale for your proposal that justifies the change. The current implementation works, it has no impact on normal, non-profiling behavior, and the overhead during profiling is not noticeable. I was proposing this for several of reasons. One, there were expressed limitations in the current proposal. There is a requirement that everything be linked into one ELF object for it to be profiled seemed significant to me. This implies that shared libraries (well, dynamic linked ones) have no path forward in this framew
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote: > On Feb 8, 2007, at 4:51 PM, Carl Love wrote: > > > On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: > >> On Thursday 08 February 2007 15:18, Milton Miller wrote: > >> > >>> 1) sample rate setup > >>> > >>> In the current patch, the user specifies a sample rate as a time > >>> interval. > >>> The kernel is (a) calling cpufreq to get the current cpu > >>> frequency, > >>> (b) > >>> converting the rate to a cycle count, (c) converting this to a > >>> 24 bit > >>> LFSR count, an iterative algorithm (in this patch, starting from > >>> one of 256 values so a max of 2^16 or 64k iterations), (d) > >>> calculating > >>> an trace unload interval. In addition, a cpufreq notifier is > >>> registered > >>> to recalculate on frequency changes. > > > > No. The user issues the command opcontrol --event:N where N is the > > number of events (cycles, l2 cache misses, instructions retired etc) > > that are to elapse between collecting the samples. > > So you are saying that b and c are primary, and a is used to calculate > a safe value for d. All of the above work is dont, just from a > different starting point? There are two things 1) setup the LFSR to control how often the Hardware puts samples into the trace buffer. 2) setup the kernel timer to read the trace buffer (your d, d is a function of the cpu freq) and process the samples. (c is nonsense) The kernel timer was set with the goal the the hardware trace buffer would not get more then half full to ensure we would not lose samples even for the maximum rate that the hardware would be adding samples to the trace buffer (user specified N=100,000). > > > > The OProfile passes > > the value N to the kernel via the variable ctr[i].count. Where i is > > the > > performance counter entry for that event. > > Ok I haven't looked a the api closely. > > > Specifically with SPU > > profiling, we do not use performance counters because the CELL HW does > > not allow the normal the PPU to read the SPU PC when a performance > > counter interrupt occurs. We are using some additional hw support in > > the chip that allows us to periodically capture the SPU PC. There is > > an > > LFSR hardware counter that can be started at an arbitrary LFSR value. > > When the last LFSR value in the sequence is reached, a sample is taken > > and stored in the trace buffer. Hence, the value of N specified by the > > user must get converted to the LFSR value that is N from the end of the > > sequence. > > Ok so its arbitray load count to max vs count and compare. A critical > detail when computing the value to load, but the net result is the > same; the value for the count it hard to determine. The above statement makes no sense to me. Determining the initial LFSR value that is N values from the last value in the sequence is not easy to do. > > > The same clock that the processor is running at is used to > > control the LFSR count. Hence the LFSR counter increments once per CPU > > clock cycle regardless of the CPU frequency or changes in the > > frequency. > > There is no calculation for the LFSR value that is a function of the > > processor frequency. There is no need to adjust the LFSR when the > > processor frequency changes. > > > > Oh, so the lfsr doesn't have to be recomputed, only the time > between unloads. The LFSR value is computed ONCE when you start OProfile. The value is setup in the hardware once when OProfile starts. The hardware will always restart with the value given to it after it reaches the last value in the sequence. What you call the "time between unloads" is the time at which you schedule the kernel routine to empty the trace buffer. It is calculated once. It would only need to be recomputed if the cpu frequency changed. > > > > > Milton had a comment about the code > > > > if (ctr[0].event == SPU_CYCLES_EVENT_NUM) { > >> + spu_cycle_reset = ctr[0].count; > >> + return; > >> + } > > > > Well, given the above description, it is clear that if you are doing > > SPU > > event profiling, the value N is put into the cntr[0].cnt entry since > > there is only one event. Thus in cell_global_start_spu() you use > > spu_cycle_reset as the argument to the lfsr calculation routine to get > > the LFSR value that is N from the end of the sequence. > > I was looking at the patch and the context was not very good. You > might consider adding -p to your diff command, it provides the function > name after the @@. > > However, in this case, I think I just need to see the final result. > > > > >>> > >>> The obvious problem is step (c), running a loop potentially 64 > >>> thousand > >>> times in kernel space will have a noticeable impact on other > >>> threads. > >>> > >>> I propose instead that user space perform the above 4 steps, and > >>> provide > >>> the kernel with two inputs: (1) the v
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. Milton had a comment about the code if (ctr[0].event == SPU_CYCLES_EVENT_NUM) { + spu_cycle_reset = ctr[0].count; + return; + } Well, given the above description, it is clear that if you are doing SPU event profiling, the value N is put into the cntr[0].cnt entry since there is only one event. Thus in cell_global_start_spu() you use spu_cycle_reset as the argument to the lfsr calculation routine to get the LFSR value that is N from the end of the sequence. I was looking at the patch and the context was not very good. You might consider adding -p to your diff command, it provides the function name after the @@. However, in this case, I think I just need to see the final result. The obvious problem is step (c), running a loop potentially 64 thousand times in kernel space will have a noticeable impact on other threads. I propose instead that user space perform the above 4 steps, and provide the kernel with two inputs: (1) the value to load in the LFSR and (2) the periodic frequency / time interval at which to empty the hardware trace buffer, perform sample analysis, and send the data to the oprofile subsystem. There should be no security issues with this approach. If the LFSR value is calculated incorrectly, either it will be too short, causing the trace array to overfill and data to be dropped, or it will be too long, and there will be fewer samples. Likewise, the kernel periodic poll can be too long, again causing overflow, or too frequent, causing only timer execution overhead. Various data is collected by the kernel while processing the periodic timer, this approach would also allow the profiling tools to control the frequency of this collection. More frequent collection results in more accurate sample data, with the linear cost of poll execution overhead. Frequency changes can be handled either by the profile code setting collection at a higher than necessary rate, or by interacting with the governor to limit the speeds. Optionally, the kernel can add a record indicating that some data was likely dropped if it is able to read all 256 entries without underflowing the array. This can be used as hint to user space that the kernel time was too long for the collection rate. Moving the sample rate computation to user space sounds like the right idea, but why not have a more drastic version of it: No, I do not
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Milton, Thank you for your comments. Carl will reply to certain parts of your posting where he's more knowledgeable than I. See my replies below. -Maynard Milton Miller wrote: On Feb 6, 2007, at 5:02 PM, Carl Love wrote: This is the first update to the patch previously posted by Maynard Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL". [snip] Data collected The current patch starts tackling these translation issues for the presently common case of a static self contained binary from a single file, either single separate source file or embedded in the data of the host application. When creating the trace entry for a SPU context switch, it records the application owner, pid, tid, and dcookie of the main executable. It addition, it looks up the object-id as a virtual address and records the offset if it is non-zero, or the dcookie of the object if it is zero. The code then creates a data structure by reading the elf headers from the user process (at the address given by the object-id) and building a list of SPU address to elf object offsets, as specified by the ELF loader headers. In addition to the elf loader section, it processes the overlay headers and records the address, size, and magic number of the overlay. When the hardware trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to Yes, we do handle overlays. (Note: I'm looking into a bug right now in our overlay support.) present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. Discussions on this topic in the past have resulted in the current implementation precisely because we're able to record the samples as fileoffsets, just as the userspace tools expect. I haven't had time to check out how much this would impact the userspace tools, but my gut feel is that it would be quite significant. If we were developing this module with a matching newly-created userspace tool, I would be more inclined to agree that this makes sense. But you give no rationale for your proposal that justifies the change. The current implementation works, it has no impact on normal, non-profiling behavior, and the overhead during profiling is not noticeable. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. Yes, there is a window here where an overlay could occur before we finish processing a group of samples that were actually taken from a different overlay. The obvious way to prevent that is for the kernel (or SPUFS) to be notified of the overlay and let OProfile know that we need to drain (perhaps discard would be best) our sample trace buffer. As you indicate above, your proposal faces the same issue, but would just decrease the number of bogus samples. I contend that the relative number of bogus samples will be quite low in either case. Ideally, we should have a mechanism to eliminate them completely so as to avoid confusion the user's part when they're looking at a report. Even a few bogus samples in the wrong place can be troubling. Such a mechanism will be a good future enhancement. [snip] milton -- [EMAIL PROTECTED] Milton Miller Speaking for myself only. ___ Linuxppc-dev mailing list [EMAIL PROTEC
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: > On Thursday 08 February 2007 15:18, Milton Miller wrote: > > > 1) sample rate setup > > > > In the current patch, the user specifies a sample rate as a time > > interval. > > The kernel is (a) calling cpufreq to get the current cpu frequency, > > (b) > > converting the rate to a cycle count, (c) converting this to a 24 bit > > LFSR count, an iterative algorithm (in this patch, starting from > > one of 256 values so a max of 2^16 or 64k iterations), (d) > > calculating > > an trace unload interval. In addition, a cpufreq notifier is > > registered > > to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Milton had a comment about the code if (ctr[0].event == SPU_CYCLES_EVENT_NUM) { > + spu_cycle_reset = ctr[0].count; > + return; > + } Well, given the above description, it is clear that if you are doing SPU event profiling, the value N is put into the cntr[0].cnt entry since there is only one event. Thus in cell_global_start_spu() you use spu_cycle_reset as the argument to the lfsr calculation routine to get the LFSR value that is N from the end of the sequence. > > > > The obvious problem is step (c), running a loop potentially 64 > > thousand > > times in kernel space will have a noticeable impact on other threads. > > > > I propose instead that user space perform the above 4 steps, and > > provide > > the kernel with two inputs: (1) the value to load in the LFSR and (2) > > the periodic frequency / time interval at which to empty the hardware > > trace buffer, perform sample analysis, and send the data to the > > oprofile > > subsystem. > > > > There should be no security issues with this approach. If the LFSR > > value > > is calculated incorrectly, either it will be too short, causing the > > trace > > array to overfill and data to be dropped, or it will be too long, and > > there will be fewer samples. Likewise, the kernel periodic poll > > can be > > too long, again causing overflow, or too frequent, causing only timer > > execution overhead. > > > > Various data is collected by the kernel while processing the > > periodic timer, > > this approach would also allow the profiling tools to control the > > frequency of this collection. More frequent collection results in > > more > > accurate sample data, with the linear cost of poll execution > > overhead. > > > > Frequency changes can be handled either by the profile code setting > > collection at a higher than necessary rate, or by interacting with > > the > > governor to limit the speeds. > > > > Optionally, the kernel can add a record indicating that some data was > > likely dropped if it is able to read all 256 entries without > > underflowing > > the array. This can be used as hint to user space that the kernel > > time > > was too long for the collection rate. > > Moving the sample rate computation to user space sounds like the right > idea, but why not have a more drastic version of it: No, I do not agree. The user/kernel API pass N where N is the number of events between samples. We are not at liberty to just change the API. We we did do this, we fully expect that John Levon will push back saying why make an architecture specific API change when it isn't necessary. Please define "drastic" in this context. Do you mean make the table bigger i.e. more then 256 precomputed elements? I did 256 since Arnd seemed to think that would be a reasonable size. Based on his example How much kernel space are we willing to use to same some computati
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thu, Feb 08, 2007 at 06:21:56PM +0100, Arnd Bergmann wrote: [...] > Moving the sample rate computation to user space sounds like the right > idea, but why not have a more drastic version of it: > > Right now, all products that support this feature run at the same clock > rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand > this correctly, the value depends only on the frequency, so we could simply > hardcode this in the kernel, and print out a warning message if we ever > encounter a different frequency, right? Just for the record... CAB is running with 2.8 GHz. At least all the boards I have seen. Adrian - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Thursday 08 February 2007 15:18, Milton Miller wrote: > 1) sample rate setup > > In the current patch, the user specifies a sample rate as a time > interval. > The kernel is (a) calling cpufreq to get the current cpu frequency, > (b) > converting the rate to a cycle count, (c) converting this to a 24 bit > LFSR count, an iterative algorithm (in this patch, starting from > one of 256 values so a max of 2^16 or 64k iterations), (d) > calculating > an trace unload interval. In addition, a cpufreq notifier is > registered > to recalculate on frequency changes. > > The obvious problem is step (c), running a loop potentially 64 > thousand > times in kernel space will have a noticeable impact on other threads. > > I propose instead that user space perform the above 4 steps, and > provide > the kernel with two inputs: (1) the value to load in the LFSR and (2) > the periodic frequency / time interval at which to empty the hardware > trace buffer, perform sample analysis, and send the data to the > oprofile > subsystem. > > There should be no security issues with this approach. If the LFSR > value > is calculated incorrectly, either it will be too short, causing the > trace > array to overfill and data to be dropped, or it will be too long, and > there will be fewer samples. Likewise, the kernel periodic poll > can be > too long, again causing overflow, or too frequent, causing only timer > execution overhead. > > Various data is collected by the kernel while processing the > periodic timer, > this approach would also allow the profiling tools to control the > frequency of this collection. More frequent collection results in > more > accurate sample data, with the linear cost of poll execution > overhead. > > Frequency changes can be handled either by the profile code setting > collection at a higher than necessary rate, or by interacting with > the > governor to limit the speeds. > > Optionally, the kernel can add a record indicating that some data was > likely dropped if it is able to read all 256 entries without > underflowing > the array. This can be used as hint to user space that the kernel > time > was too long for the collection rate. Moving the sample rate computation to user space sounds like the right idea, but why not have a more drastic version of it: Right now, all products that support this feature run at the same clock rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand this correctly, the value depends only on the frequency, so we could simply hardcode this in the kernel, and print out a warning message if we ever encounter a different frequency, right? > The current patch specifically identifies that only single > elf objects are handled. There is no code to handle dynamic > linked libraries or overlays. Nor is there any method to > present samples that may have been collected during context > switch processing, they must be discarded. I thought it already did handle overlays, what did I miss here? > My proposal is to change what is presented to user space. Instead > of trying to translate the SPU address to the backing file > as the samples are recorded, store the samples as the SPU > context and address. The context switch would record tid, > pid, object id as it does now. In addition, if this is a > new object-id, the kernel would read elf headers as it does > today. However, it would then proceed to provide accurate > dcookie information for each loader region and overlay. Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. The kernel still needs to provide the overlay identifiers though. > To identify which overlays are active, (instead of the present > read on use and search the list to translate approach) the > kernel would record the location of the overlay identifiers > as it parsed the kernel, but would then read the identification > word and would record the present value as an sample from > a separate but related stream. The kernel could maintain > the last value for each overlay and only send profile events > for the deltas. right. > This approach trades translation lookup overhead for each > recorded sample for a burst of data on new context activation. > In addition it exposes the sample point of the overlay identifier > vs the address collection. This allows the ambiguity to be > exposed to user space. In addition, with the above proposed > kernel timer
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Michael, Thanks very much for the advice. Both issues have been solved now, with your help. -Maynard Michael Ellerman wrote: On Wed, 2007-02-07 at 09:41 -0600, Maynard Johnson wrote: Carl Love wrote: Subject: Add support to OProfile for profiling Cell BE SPUs From: Maynard Johnson <[EMAIL PROTECTED]> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code. Signed-off-by: Carl Love <[EMAIL PROTECTED]> Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig I've discovered more problems with the kref handling for the cached_info object that we store in the spu_context. :-( When the OProfile module initially determines that no cached_info yet exists for a given spu_context, it creates the cached_info, inits the cached_info's kref (which increments the refcount) and does a kref_get (for SPUFS' ref) before passing the cached_info reference off to SUPFS to store into the spu_context. When OProfile shuts down or the SPU job ends, OProfile gives up its ref to the cached_info with kref_put. Then when SPUFS destroys the spu_context, it also gives up its ref. HOWEVER . . . . If OProfile shuts down while the SPU job is still active _and_ _then_ is restarted while the job is still active, OProfile will find that the cached_info exists for the given spu_context, so it won't go through the process of creating it and doing kref_init on the kref. Under this scenario, OProfile does not own a ref of its own to the cached_info, and should not be doing a kref_put when done using the cached_info -- but it does, and so does SPUFS when the spu_context is destroyed. The end result (with the code as currently written) is that an extra kref_put is done when the refcount is already down to zero. To fix this, OProfile needs to detect when it finds an existing cached_info already stored in the spu_context. Then, instead of creating a new one, it sets a reminder flag to be used later when it's done using the cached info to indicate whether or not it needs to call kref_put. I think all you want to do is when oprofile finds the cached_info already existing, it does a kref_get(). After all it doesn't have a reference to it, so before it starts using it it must inc the ref count. Unfortunately, there's another problem (one that should have been obvious to me). The cached_info's kref "release" function is destroy_cached_info(), defined in the OProfile module. If the OProfile module is unloaded when SPUFS destroys the spu_context and calls kref_put on the cached_info's kref -- KABOOM! The destroy_cached_info function (the second arg to kref_put) is not in memory, so we get a paging fault. I see a couple options to solve this: 1) Don't store the cached_info in the spu_context. Basically, go back to the simplistic model of creating/deleting the cached_info on every SPU task activation/deactivation. 2) If there's some way to do this, force the OProfile module to stay loaded until SPUFS has destroyed its last spu_context that holds a cached_info object. There is a mechanism for that, you just have each cached_info inc the module's refcount. Another option would be to have a mapping, in the oprofile code, from spu_contexts to cached_infos, ie. a hash table or something. cheers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 6, 2007, at 5:02 PM, Carl Love wrote: This is the first update to the patch previously posted by Maynard Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL". This repost fixes the line wrap issue that Ben mentioned. Also the kref handling for the cached info has been fixed and simplified. There are still a few items from the comments being discussed specifically how to profile the dynamic code for the SPFS context switch code and how to deal with dynamic code stubs for library support. Our proposal is to assign the samples from the SPFS and dynamic library code to an anonymous sample bucket. The support for properly handling the symbol extraction in these cases would be deferred to a later SDK. There is also a bug in profiling overlay code that we are investigating. I'd like to talk about about both some of the concepts, including what is logged and what the kernel API is, and provide some directed comments on the code in the patch as it stands. [Ok, the background and concepts portion is big enough, I'll send this to the several lists for comment, and the code comments to just cbe-oss-dev] First, I'll give some background and my understanding of both the hardware, and some of the linux interface. I'm not consulting any manuals, so I could be mistaken. I'm basing this on a combination of past reading of the kernel, this patch, a discussion with Arnd, and my knowledge of the PowerPC processor and other IBM processors. Hopefully this will be of use to other reviewers. Background: The cell broadband architecture consists of PowerPC processors and synergistic processing units, or SPUs. The current chip has a multi-threaded PowerPC core and 8 SPUs. Each SPU has an execution core (running its own instruction set), a 256kB private memory (local store), and DMA engine that can access the coherent memory domain of the PowerPC processor(s). Multiple chips can be connected together in a single coherent SMP system. The addresses provided to the DMA engine are effective, translated through virtual to real address spaces like the PowerPC MMU. The SPUs are intended to run application threads, and require the PowerPC to perform exception handling and other operating system tasks. Because of the limited address space of the SPU (18 bits), the compilers (gcc) have support for code overlays. The ELF format defines these overlays including file offsets, sizes, load address, and a unique word to identify the overlay. The toolchain has support to check that an overlay is present and setup the DMA engine to load it if it is not present. In Linux, the SPUs are controlled through spufs. Once a SPU context is created, the local store and registers can be modified through the file system. A linux thread makes a syscall requesting the context to run. This call is blocking to that thread; it waits for an event from the SPU (either an exception or a reschedule). In this regard the syscall can be thought of as a cross instruction set function call. The contents of the local store are initialized and modified via read/write or mmap and memcopy of a spufs file. Specifically, pages are not mapped into the local store, they are copied into it. The SPU context is tied to the creating mm; this provides a clear context for the DMA engine (controlled by the program in the SPU). To assist with the use of the SPUs, and to provide some portability to other environments, a library, libpse, is available and widely used. Among other items, it provides an elf loader for SPU elf objects. Applications may mmap external objects or embed the elf objects into their executable (as data). The contained objects copied to the SPU local store as dictated by the elf header. To assist other tools, spufs provides an object-id file. Before this patch, it has been treated as an opaque number. Although not present in the first release of libpse, current versions write the virtual address of the elf header to this file when the loader is called. This patch is trying to enable oprofile to collect and analyze SPU execution, using the trace collection hardware of the processor. The trace collection hardware consists of a LFSR (linear-feedback shift register) counter and an array 256 bits wide and 256 entries deep. When the counter matches the programed value, it causes an entry to be written to the trace array and the counter to be reloaded. On chip logic tracks how many entries have been written and not yet read. When programed to trace the SPUs, the trace entry consists of the upper 16 bits of the 18 bit program counter for each SPU; the 8 SPUs split over the two words. By their nature, LFSRs require a minimum of hardware (typically 3 exclusive-or gates and a shift register), but the count sequence is pseudo-random. The oprofile system is designed to collect a stream of trace data and context information to and provide it to user space for post processing and analysis. While at the lowest l
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Wed, 2007-02-07 at 09:41 -0600, Maynard Johnson wrote: > Carl Love wrote: > > > > >Subject: Add support to OProfile for profiling Cell BE SPUs > > > >From: Maynard Johnson <[EMAIL PROTECTED]> > > > >This patch updates the existing arch/powerpc/oprofile/op_model_cell.c > >to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory > >was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling > >code. > > > >Signed-off-by: Carl Love <[EMAIL PROTECTED]> > >Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> > > > >Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig > > > > > I've discovered more problems with the kref handling for the cached_info > object that we store in the spu_context. :-( > > When the OProfile module initially determines that no cached_info yet > exists for a given spu_context, it creates the cached_info, inits the > cached_info's kref (which increments the refcount) and does a kref_get > (for SPUFS' ref) before passing the cached_info reference off to SUPFS > to store into the spu_context. When OProfile shuts down or the SPU job > ends, OProfile gives up its ref to the cached_info with kref_put. Then > when SPUFS destroys the spu_context, it also gives up its ref. HOWEVER > . . . . If OProfile shuts down while the SPU job is still active _and_ > _then_ is restarted while the job is still active, OProfile will find > that the cached_info exists for the given spu_context, so it won't go > through the process of creating it and doing kref_init on the kref. > Under this scenario, OProfile does not own a ref of its own to the > cached_info, and should not be doing a kref_put when done using the > cached_info -- but it does, and so does SPUFS when the spu_context is > destroyed. The end result (with the code as currently written) is that > an extra kref_put is done when the refcount is already down to zero. To > fix this, OProfile needs to detect when it finds an existing cached_info > already stored in the spu_context. Then, instead of creating a new one, > it sets a reminder flag to be used later when it's done using the cached > info to indicate whether or not it needs to call kref_put. I think all you want to do is when oprofile finds the cached_info already existing, it does a kref_get(). After all it doesn't have a reference to it, so before it starts using it it must inc the ref count. > Unfortunately, there's another problem (one that should have been > obvious to me). The cached_info's kref "release" function is > destroy_cached_info(), defined in the OProfile module. If the OProfile > module is unloaded when SPUFS destroys the spu_context and calls > kref_put on the cached_info's kref -- KABOOM! The destroy_cached_info > function (the second arg to kref_put) is not in memory, so we get a > paging fault. I see a couple options to solve this: > 1) Don't store the cached_info in the spu_context. Basically, go > back to the simplistic model of creating/deleting the cached_info on > every SPU task activation/deactivation. > 2) If there's some way to do this, force the OProfile module to > stay loaded until SPUFS has destroyed its last spu_context that holds a > cached_info object. There is a mechanism for that, you just have each cached_info inc the module's refcount. Another option would be to have a mapping, in the oprofile code, from spu_contexts to cached_infos, ie. a hash table or something. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Carl Love wrote: Subject: Add support to OProfile for profiling Cell BE SPUs From: Maynard Johnson <[EMAIL PROTECTED]> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code. Signed-off-by: Carl Love <[EMAIL PROTECTED]> Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig I've discovered more problems with the kref handling for the cached_info object that we store in the spu_context. :-( When the OProfile module initially determines that no cached_info yet exists for a given spu_context, it creates the cached_info, inits the cached_info's kref (which increments the refcount) and does a kref_get (for SPUFS' ref) before passing the cached_info reference off to SUPFS to store into the spu_context. When OProfile shuts down or the SPU job ends, OProfile gives up its ref to the cached_info with kref_put. Then when SPUFS destroys the spu_context, it also gives up its ref. HOWEVER . . . . If OProfile shuts down while the SPU job is still active _and_ _then_ is restarted while the job is still active, OProfile will find that the cached_info exists for the given spu_context, so it won't go through the process of creating it and doing kref_init on the kref. Under this scenario, OProfile does not own a ref of its own to the cached_info, and should not be doing a kref_put when done using the cached_info -- but it does, and so does SPUFS when the spu_context is destroyed. The end result (with the code as currently written) is that an extra kref_put is done when the refcount is already down to zero. To fix this, OProfile needs to detect when it finds an existing cached_info already stored in the spu_context. Then, instead of creating a new one, it sets a reminder flag to be used later when it's done using the cached info to indicate whether or not it needs to call kref_put. Unfortunately, there's another problem (one that should have been obvious to me). The cached_info's kref "release" function is destroy_cached_info(), defined in the OProfile module. If the OProfile module is unloaded when SPUFS destroys the spu_context and calls kref_put on the cached_info's kref -- KABOOM! The destroy_cached_info function (the second arg to kref_put) is not in memory, so we get a paging fault. I see a couple options to solve this: 1) Don't store the cached_info in the spu_context. Basically, go back to the simplistic model of creating/deleting the cached_info on every SPU task activation/deactivation. 2) If there's some way to do this, force the OProfile module to stay loaded until SPUFS has destroyed its last spu_context that holds a cached_info object. I thought about putting the cached_info's kref "release" function in SPUFS, but this just won't work. This implies that SPUFS needs to know about the structure of the cached_info, e.g., that it contains the vma_map member that needs to be freed. But even with that information, it's not enough, since the vma_map member consists of list of vma_maps, which is why we have the vma_map_free() function. So SPUFS would still need access to vma_map_free() from the OProfile module. Opinions from others would be appreciated. Thanks. -Maynard +/* Container for caching information about an active SPU task. + * + */ +struct cached_info { + struct vma_to_fileoffset_map * map; + struct spu * the_spu; /* needed to access pointer to local_store */ + struct kref cache_ref; +}; + +static struct cached_info * spu_info[MAX_NUMNODES * 8]; + +static void destroy_cached_info(struct kref * kref) +{ + struct cached_info * info; + info = container_of(kref, struct cached_info, cache_ref); + vma_map_free(info->map); + kfree(info); +} + +/* Return the cached_info for the passed SPU number. + * + */ +static struct cached_info * get_cached_info(struct spu * the_spu, int spu_num) +{ + struct cached_info * ret_info = NULL; + unsigned long flags = 0; + if (spu_num >= num_spu_nodes) { + printk(KERN_ERR "SPU_PROF: " + "%s, line %d: Invalid index %d into spu info cache\n", + __FUNCTION__, __LINE__, spu_num); + goto out; + } + spin_lock_irqsave(&cache_lock, flags); + if (!spu_info[spu_num] && the_spu) + spu_info[spu_num] = (struct cached_info *) + spu_get_profile_private(the_spu->ctx); + + ret_info = spu_info[spu_num]; + spin_unlock_irqrestore(&cache_lock, flags); + out: + return ret_info; +} + + +/* Looks for cached info for the passed spu. If not found, the + * cached info is created for the passed spu. + * Returns 0 for success; otherwise, -1 for error. + */ +static int +prepare_cached_spu_info(struct spu * spu, unsigned int objectId) +{ + unsigned lon
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
This is the first update to the patch previously posted by Maynard Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL". This repost fixes the line wrap issue that Ben mentioned. Also the kref handling for the cached info has been fixed and simplified. There are still a few items from the comments being discussed specifically how to profile the dynamic code for the SPFS context switch code and how to deal with dynamic code stubs for library support. Our proposal is to assign the samples from the SPFS and dynamic library code to an anonymous sample bucket. The support for properly handling the symbol extraction in these cases would be deferred to a later SDK. There is also a bug in profiling overlay code that we are investigating. Subject: Add support to OProfile for profiling Cell BE SPUs From: Maynard Johnson <[EMAIL PROTECTED]> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code. Signed-off-by: Carl Love <[EMAIL PROTECTED]> Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig === --- linux-2.6.20-rc1.orig/arch/powerpc/configs/cell_defconfig 2007-01-18 16:43:14.230540320 -0600 +++ linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig2007-02-01 17:21:46.928875304 -0600 @@ -1403,7 +1403,7 @@ # Instrumentation Support # CONFIG_PROFILING=y -CONFIG_OPROFILE=y +CONFIG_OPROFILE=m # CONFIG_KPROBES is not set # Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h 2007-02-03 15:56:01.094856152 -0600 @@ -0,0 +1,78 @@ + /* + * Cell Broadband Engine OProfile Support + * + * (C) Copyright IBM Corporation 2006 + * + * Author: Maynard Johnson <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef PR_UTIL_H +#define PR_UTIL_H + +#include +#include +#include +#include + +static inline int number_of_online_nodes(void) +{ + u32 cpu; u32 tmp; + int nodes = 0; + for_each_online_cpu(cpu) { + tmp = cbe_cpu_to_node(cpu) + 1; + if (tmp > nodes) + nodes++; + } + return nodes; +} + +/* Defines used for sync_start */ +#define SKIP_GENERIC_SYNC 0 +#define SYNC_START_ERROR -1 +#define DO_GENERIC_SYNC 1 + +struct vma_to_fileoffset_map +{ + struct vma_to_fileoffset_map *next; + unsigned int vma; + unsigned int size; + unsigned int offset; + unsigned int guard_ptr; + unsigned int guard_val; +}; + +/* The three functions below are for maintaining and accessing + * the vma-to-fileoffset map. + */ +struct vma_to_fileoffset_map * create_vma_map(const struct spu * spu, u64 objectid); +unsigned int vma_map_lookup(struct vma_to_fileoffset_map *map, unsigned int vma, + const struct spu * aSpu); +void vma_map_free(struct vma_to_fileoffset_map *map); + +/* + * Entry point for SPU profiling. + * cycles_reset is the SPU_CYCLES count value specified by the user. + */ +void start_spu_profiling(unsigned int cycles_reset); + +void stop_spu_profiling(void); + + +/* add the necessary profiling hooks */ +int spu_sync_start(void); + +/* remove the hooks */ +int spu_sync_stop(void); + +/* Record SPU program counter samples to the oprofile event buffer. */ +void spu_sync_buffer(int spu_num, unsigned int * samples, +int num_samples); + +void set_profiling_frequency(unsigned int freq_khz, unsigned int cycles_reset); + +#endif// PR_UTIL_H Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c 2007-02-05 09:32:25.708937424 -0600 @@ -0,0 +1,203 @@ +/* + * Cell Broadband Engine OProfile Support + * + * (C) Copyright IBM Corporation 2006 + * + * Authors: Maynard Johnson <[EMAIL PROTECTED]> + * Carl Love <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include "pr_util.h" + +#define TRACE_ARRAY_SIZE 1024 +#define SCALE_SHIFT 14 + +static u3