The test case we found is under 'extreme' duress 
(intense loading on an MPC8572)...with many applications....
using A LOT of SPE instructions...

----

If you look at the context switch code (in latest code entry_32.S), 
I believe the context switch performs a SAVE_NVGPR() - which in our 
interpretation (in ppc_asm.h) - only saves the lower 32 bits of 
the GPR (stw/lwz)...

This is only a guess of where the problem lies - based upon the single
SPE instruction that seemingly got misinterpreted, and shifts the data
By '1 byte' (and this code gets executed successfully MANY more times 
at lower bandwidths - than failures seen at higher bandwidths)...

----

I am not sure how to proceed...we know how to recreate with our 
application, but we would love to know how to change (safely) 
the pt_regs to "long long" for the GPRs and then safely move
all 64bits of each GPR into these doubles...

We could then re-test and see if this helps?

Tom



>> -----Original Message-----
>> From: Michael Neuling [mailto:mi...@neuling.org]
>> Sent: Tuesday, May 05, 2009 8:02 PM
>> To: Morrison, Tom
>> Cc: Kumar Gala; linuxppc-dev@ozlabs.org
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> > Hi Kumar/Michael...
>> >
>> > Sorry, I really didn't explain myself very well...
>> >
>> > The Problem (answer to Michael):
>> >
>>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
>> 3D=
>> > =3D=3D=3D=3D=3D=3D=3D
>> > We started using a new compiler that upon -O2 optimization - added
>> > heavy SPE related instructions into our applications (where the
older
>> > compiler might not use as many). Once this was done, we started=20
>> > experiencing problems with data being 'shifted' and/or corrupted=20
>> > throughout the applications which didn't immediately cause
problems,
>> > but either scribbled on someone else's memory and/or bad results...
>> > We knew where one of the offending scribbles started (by the
>> shifting=20
>> > by 1 byte of a structure) and found by comparing binaries with
'older'
>> > compiler vs. this one that the only major difference was the
>> 'density'=20
>> > of the SPE instructions...
>> >
>> > As to your question, Kumar:=20
>> >
>>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
>> 3D=
>> > =3D=3D
>> > Naively, I explicitly enabled the SPE in a BSP 'early_init'
program=20
>> > (as well as enabling Machine Checks) - which is what I meant by
>> > Enabling SPE...
>> 
>> Yeah, you don't want to do this.  It'll potentially break your
>> application.
>> 
>> I'm not that familiar with the CPU you are using but I'm guessing
that
>> you can't write the MSR from user space anyway.
>> 
>> > Michael explained that it is 'normal' if we asynchronously polled
>> > the MSR (in an application and/or in the kernel) that it might be
>> > disabled at the moment, but that you do a 'lazy switch' that=20
>> > enables it...and gets turned on when an SPE exception comes in...
>> >
>> > ...ok...I can live with that...
>> >
>> > -------where I was really going---------
>> >
>> > This is where I was trying to go. A developer at our company (who
no
>> > longer works for us) - did some research/development on the SPE=20
>> > functionality, in the hopes that we could create an optimized
library.
>> > The results were successful, but because of some of the
restrictions=20
>> > (including 8 byte alignment for some instructions) - we decided not
>> > to incorporate this library into our application(s)
>> >
>> > But, this developer in his results, indicated that he believed our
>> > kernels were NOT properly saving/restoring the upper 32bits of the
>> > GPR (which can/will be used in the SPE instructions)... Thus, if
the
>> > upper 32bits were not saved (and restored when the application got
>> > the SPE to operate on)...then, he thought there would be problems.
>> > He unfortunately, was unable to finish his work and fix these
'bugs'
>> > before he left our company...
>> >
>> > Again, I am only going on his results, and not my own
investigations
>> > (I am not sure where to start to find this problem to begin
with)...
>> >
>> > So, I was REALLY asking - has anybody else run into this type of
>> > problem, and/or the Linux community has recognized this problem and
>> > has fixed this?
>> 
>> If GPRs where getting corrupted in userspace, that would be a serious
>> bug and would be noticed by someone pretty quickly.
>> 
>> We'd really need a test case to get anywhere with this report.
>> 
>> Mikey
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to