On Fri, 2008-05-16 at 16:22 +0200, Arnd Bergmann wrote: > On Thursday 15 May 2008, Carl Love wrote: > > On Thu, 2008-05-15 at 17:39 +0200, Arnd Bergmann wrote: > > > > > > I noticed now that you are indexing arrays by SPU number. This is not > > > a good idea, because you make assumptions about the system that may > > > not be true. Better pass around 'struct spu' pointers and put those > > > values in there if you need them. > > > > Hmm, well, we have arrays for last_guard_val[], spu_info[] as well that > > are indexed by spu number. So this would seem to be a more general > > issue then just spu_ctx_seen[]. Not sure exactly what you are > > suggesting with passing around 'struct spu' as a solution. Are you > > suggesting that a field be added to the structure for the spu context > > seen flag? Not sure that is a good idea. We will need to clarify how > > you propose to fix this not only for spu_ctx_sw_seen but the other > > arrays as well that are already being indexed by spu number. > > I'd suggest you add fields to struct spu, either directly to it, if > it's short, or a pointer to your own struct. > > As I said, let's not do that now. > > > So, you can either add > > a context switch sequence to a given CPU buffer or you can add an SPU > > program counter to the CPU buffer not both at the same time. The lock > > is held until the entire SPU context switch entry is added to the CPU > > queue to make sure entries are not intermingled. > > My point was that oprofile collecting samples for the CPU could possibly > add entries to the same queue, which would race with this.
Ah, I see what you are getting at. That we would be collecting profiling data on a a CPU (i.e. a PPU thread) and SPU cycle profiling data at the same time. No, due to hardware constraints on SPU cycle profiling, the hardware cannot be configured to do both PPU profiling and SPU CYCLE profiling at a time. I am working on SPU event profiling. The hardware will only support event profiling on one SPU per node at a time. I will have to think about it more but my initial thought is supporting PPU profiling and SPU event profiling will again not work due to hardware constraints. > > > When the trace buffer is read, each 128 bit entry contains the 16 bit > > SPU PC for each of the eight SPUs on the node. Secondly, we really want > > to keep the timing of the context switches and storing the SPU PC values > > as close as possible. Clearly there is some noise as the switch will > > occur while the trace buffer is filling. > > can't you just empty the trace buffer every time you encounter a context > switch, even in the absence of a timer interrupt? That would prevent > all of the noise. That is a good thought. It should help to keep the timing more accurate and prevent the issue mentioned below about processing multiple trace buffers of data before processing the context switch info. > > > Storing the all of the SPU PC > > values, into the current CPU buffer causes two issues. One is that one > > of the buffers will be much fuller then the rest increasing the > > probability of overflowing the CPU buffer and dropping data. As it is, > > in user land, I have to increase the size of the CPU buffers to > > accommodate four SPUs worth of data in a given CPU buffer. Secondly, > > the SPU context switch info for a given SPU would be going to a > > different CPU buffer then the data. > > In practice, the calling CPUs should be well spread around by the way > that spufs works, so I don't think it's a big problem. > Did you observe this problem, or just expect it to happen? I did see the problem initially. The first version of oprofile_add_value() did not take the CPU buffer argument. Rather internally the oprofile_add_value() stored the data to the current CPU buffer by calling smp_processor_id(). When processing the trace buffer, the function would extract the PC value for each of the 8 SPUs on the node, then store them in the current CPU buffer. It would do this for all of the samples in the trace buffer, typically about 250 (Maximum is 1024). So you would put 8 * 350 samples all into the same CPU buffer and nothing in the other CPU buffers for that node. In order to ensure I don't drop any samples, I had to increase the size of each CPU buffer more then when I distribute them to the two CPU buffers in each node. I changed the oprofile_add_value() to take a CPU buffer to better utilize the CPU buffers and to avoid the ordering issue with the SPU context switch data. I didn't do a detailed study to see exactly how much more the CPU buffer size needed to be increased as the SPU context switch ordering was the primary issue I was trying to fix. The default CPU buffer size must be increased because four processors (SPUS) worth of data is being stored in each CPU buffer and each entry takes two locations to store (the special escape code, then the actual data). > > > Depending on the timing of the CPU > > buffer syncs to the kernel buffer you might get the trace buffer emptied > > multiple times before an SPU context switch event was sync'd to the > > kernel buffer. Thus causing undesired additional skew in the data. > > good point. would this also get solved if you flush the trace buffer > on a context switch? I suppose you need to make sure that the samples > still end up ordered correctly throughout the CPU buffers. Is that > possible? Yes, this would be avoided by processing the trace buffer on a context switch. We have the added benefit that it should help minimize the skew between the data collection and context switch information. > > > Any thoughts about moving oprofile_add_value() to buffer_sync.c and > > having it grab the buffer_mutex lock while it inserts values directly > > into the kernel buffer? At the moment it looks the easiest. > > That would mean that again you can't call it from interrupt context, > which is the problem you were trying to solve with the work queue > in the previous version of your patch. Yes, forgot about that little detail, again. Argh! > > > I can see a few other solutions but they involve creating per SPU arrays > > for initially storing the switch info and SPU data and then having each > > CPU flush a subset of the SPU data to its CPU buffer. A lot more > > storage and overhead we want. > > It might work if you do a per SPU buffer and generalize the way > that per-CPU buffers are flushed to the kernel buffer so it works > with either one. If we flush on SPU context switches, we are just left with how best to manage storing the data. I see two choice, allocating more memory so we can store data into a per SPU buffer then figure out how to flush the data to the kernel buffer. Or just increase the per cpu buffer size so we can store all of the nodes SPU data into a single CPU buffer. Either way we use more memory. In the first approach, I would need to allocate SPU arrays large enough to store data for the worst case, the trace buffer was full. This would be 8*1024 entries. Typically only a 1/3 would be needed as the hrtimer is setup to go off at about a 1/3 full. This gives some buffer in case it takes time to get the timer function call scheduled. The second case doubling the size of the per cpu buffers to handle the typical case of the trace buffer being 1/3 full would correspond to allocating an additional 2*1024/3 = 682 entries. This would be less memory then for the first approach and be simpler to implement. Given the code, it would be easy to measure what the minimum number of buffers needed is with the current patch that spreads the entries evenly across all of the buffers. Then it is just a one line hack to put the data all into the current CPU. Then re-tune to find the minimum number of CPU buffers. I will do the experiments as I might be helpful to see what the memory cost would be. > > > > > + /* The SPU_PROFILING_CODE escape sequence must proceed > > > > + * the SPU context switch info. > > > > + */ > > > > + for_each_online_cpu(cpu) { > > > > + oprofile_add_value(ESCAPE_CODE, cpu); > > > > + oprofile_add_value(SPU_PROFILING_CODE, cpu); > > > > + oprofile_add_value((unsigned long int) > > > > + num_spu_nodes, cpu); > > > > + } > > > > > > > > > > You are no longer holding a lock while adding the events, which > > > may be correct as long as no switch_events come in, but it's > > > still inconsistent. Do you mind just adding the spinlock here > > > as well? > > > > > > > The claim is that the lock is not needed because we have not yet > > registered for notification of the context switches as mentioned above. > > Hence there is no race to worry about. Registration for the switch > > event notification happens right after this loop. > > Right, that's what I understood as well. My point was that it's more > consistent if you always call the function with the same locks held, > in case somebody changes or tries to understand your code. Yup, not a big deal. I had thought about doing it for completeness sake but then opted not to figuring I would get dinged for unnecessarily grabbing a lock and doing extra work. I will put it in. That is what I get for trying to second guess the reviewers. :-) > > Arnd <>< _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev