The test case we found is under 'extreme' duress (intense loading on an MPC8572)...with many applications.... using A LOT of SPE instructions...
---- If you look at the context switch code (in latest code entry_32.S), I believe the context switch performs a SAVE_NVGPR() - which in our interpretation (in ppc_asm.h) - only saves the lower 32 bits of the GPR (stw/lwz)... This is only a guess of where the problem lies - based upon the single SPE instruction that seemingly got misinterpreted, and shifts the data By '1 byte' (and this code gets executed successfully MANY more times at lower bandwidths - than failures seen at higher bandwidths)... ---- I am not sure how to proceed...we know how to recreate with our application, but we would love to know how to change (safely) the pt_regs to "long long" for the GPRs and then safely move all 64bits of each GPR into these doubles... We could then re-test and see if this helps? Tom >> -----Original Message----- >> From: Michael Neuling [mailto:mi...@neuling.org] >> Sent: Tuesday, May 05, 2009 8:02 PM >> To: Morrison, Tom >> Cc: Kumar Gala; linuxppc-dev@ozlabs.org >> Subject: Re: MSR_SPE - being turned off... >> >> > Hi Kumar/Michael... >> > >> > Sorry, I really didn't explain myself very well... >> > >> > The Problem (answer to Michael): >> > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D = >> 3D= >> > =3D=3D=3D=3D=3D=3D=3D >> > We started using a new compiler that upon -O2 optimization - added >> > heavy SPE related instructions into our applications (where the older >> > compiler might not use as many). Once this was done, we started=20 >> > experiencing problems with data being 'shifted' and/or corrupted=20 >> > throughout the applications which didn't immediately cause problems, >> > but either scribbled on someone else's memory and/or bad results... >> > We knew where one of the offending scribbles started (by the >> shifting=20 >> > by 1 byte of a structure) and found by comparing binaries with 'older' >> > compiler vs. this one that the only major difference was the >> 'density'=20 >> > of the SPE instructions... >> > >> > As to your question, Kumar:=20 >> > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D = >> 3D= >> > =3D=3D >> > Naively, I explicitly enabled the SPE in a BSP 'early_init' program=20 >> > (as well as enabling Machine Checks) - which is what I meant by >> > Enabling SPE... >> >> Yeah, you don't want to do this. It'll potentially break your >> application. >> >> I'm not that familiar with the CPU you are using but I'm guessing that >> you can't write the MSR from user space anyway. >> >> > Michael explained that it is 'normal' if we asynchronously polled >> > the MSR (in an application and/or in the kernel) that it might be >> > disabled at the moment, but that you do a 'lazy switch' that=20 >> > enables it...and gets turned on when an SPE exception comes in... >> > >> > ...ok...I can live with that... >> > >> > -------where I was really going--------- >> > >> > This is where I was trying to go. A developer at our company (who no >> > longer works for us) - did some research/development on the SPE=20 >> > functionality, in the hopes that we could create an optimized library. >> > The results were successful, but because of some of the restrictions=20 >> > (including 8 byte alignment for some instructions) - we decided not >> > to incorporate this library into our application(s) >> > >> > But, this developer in his results, indicated that he believed our >> > kernels were NOT properly saving/restoring the upper 32bits of the >> > GPR (which can/will be used in the SPE instructions)... Thus, if the >> > upper 32bits were not saved (and restored when the application got >> > the SPE to operate on)...then, he thought there would be problems. >> > He unfortunately, was unable to finish his work and fix these 'bugs' >> > before he left our company... >> > >> > Again, I am only going on his results, and not my own investigations >> > (I am not sure where to start to find this problem to begin with)... >> > >> > So, I was REALLY asking - has anybody else run into this type of >> > problem, and/or the Linux community has recognized this problem and >> > has fixed this? >> >> If GPRs where getting corrupted in userspace, that would be a serious >> bug and would be noticed by someone pretty quickly. >> >> We'd really need a test case to get anywhere with this report. >> >> Mikey _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev