Does Dev Tree WORK with [EMAIL PROTECTED] <#address/size> = <2/1>

2008-08-11 Thread Morrison, Tom
I am sorry, but I've butted my head against a tree for over a 
week and some things just aren't making sense...especially how
the prom parse code is working to exact / resolve physical 
addresses to then ioremap...

a) Setup, I have a working MPC8548E board using 2.6.23.8 (ARCH=ppc)
   with PHYS/PTE_64BIT enabled (with the proper patches)).

b) Goal: we want to move our board to a generic 2.6.23 version that 
   Freescale has produced to support the MPC8572DS (and eventually 
   use that kernel to build our BSP for our next board).

c) I have successfully gotten to work a pure '32-bit' dev tree 
   (PHYS/PTE_64BIT not defined - and the CCSRBAR @ 0xE000_)
The #address/#size <1/1> is '1' (as well as the #size is '1')

d) I have modified this dev tree to support the 36bit mode addressing
   (with the CCSRBAR and PCIExpress defined to in the last 4Gig (instead
of the default first 4Gig - e.g.: 0xC_E000_)...

e) I then set the soc to have #address-cells to '2' (#size-cells is '1')
   (and added the additional values accordingly to the ranges...

f) The prom_parse code has a problem with translating these addresses

The main question is: has anyone ever tested this with soc #address/size
being <2/1> in this fashion?

I am more than happy to send my text dev tree - but I don't want to clog
up the group with this info - if there are some fundamental questions 
to be answered before then...

Thanks in advance!

Sincerely,


Tom Morrison
Principal Software Engineer
email: [EMAIL PROTECTED] 
www.empirix.com



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Does Dev Tree WORK with [EMAIL PROTECTED] <#address/size> = <2/1>

2008-08-12 Thread Morrison, Tom
Thank you...I will take that recommendation...

I will try that and get some results before I complete my 
response to Becky this morning...


T


-Original Message-
From: Kumar Gala [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 12, 2008 8:42 AM
To: Morrison, Tom
Cc: ppc-dev list; Becky Bruce; Paul Mackerras
Subject: Re: Does Dev Tree WORK with [EMAIL PROTECTED] <#address/size> =
<2/1>

I recommend you look at <2/2> instead of <2/1>.  <2/1> is just a  
degenerate case and it doesn't really get you much value.

- k
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Does Dev Tree WORK with [EMAIL PROTECTED] <#address/size> = <2/1>

2008-08-12 Thread Morrison, Tom
Thank you Becky (and Kumar) for all the informationand help!

To answer your questions, yes, we are using 4GB++ of memory
(and plan more in the near future). But, for the initial bring
up, I reduced the memory to 2Gig. Further, I have modified u-boot 
to NOT modify the memory reg properties (see below my snippet)...

Question: what other 'devices' does u-boot put down that I care 
about I have modified u-boot to put the correct memory structure? 

Also, are you saying there are additional patches for 
prom parsing code for this to work right - or are you 
talking about in general for the 4Gig memory??

Here is a dts snippet with some of the interesting parts:

/{  model = "MPC8548_CHEETAH";
compatible = "MPC8548_CHEETAH";
#address-cells = <2>;
#size-cells = <2>;
  
memory { device_type = "memory";
  reg = <0   0 8000>; // 2 GIG @ 0x0
};

[EMAIL PROTECTED] {
#address-cells = <1>;
#size-cells = <1>;
#interrupt-cells = <2>;
device_type = "soc";
ranges = <1000 c 6df0 000ff000>;
reg = <000C 6df0 0 001000>; // CCSRBAR

.
[EMAIL PROTECTED] {
device_type = "serial";
compatible = "ns16550";
reg = <4500 100>;   // reg base, size
clock-frequency = <0>;  // should we fill in in
uboot?
interrupts = <2a 2>;
interrupt-parent = <&mpic>;
};

[EMAIL PROTECTED] {
device_type = "serial";
compatible = "ns16550";
reg = <4600 100>;   // reg base, size
clock-frequency = <0>;  // should we fill in in
uboot?
interrupts = <2a 2>;
interrupt-parent = <&mpic>;
};

.
};

Now, it looks like I am successfully parsing and translating the 
address to the expected address for the default stdout (0xc_6df0_4600)!

FWIW, I have identified that another problem with the code configuring
the CCSRBAR (and am sure I'll figure that one out soon because we have 
a working solution in the arch/ppc directory).

While I have your expert attention, I'd like to have you comment about
what potentially could be right/wrong with my definitions for the pci
express settings...

   a) Do I put those ranges in the ranges for the parent soc device
(also)?

   b) Do the below correctly define a 2 Gig PCI memory Window starting 
  at 0xC_6F00_ (to 0xC_EF00_) and PCI IO 16M Window starting
  at 0xC_6E00_ (to 0xC_6F00_)?

-
  /* PCI Express */
  [EMAIL PROTECTED] {
compatible = "fsl,mpc8548-pcie";
device_type = "pci";
#interrupt-cells = <1>;
#size-cells = <2>;
#address-cells = <3>;
reg = ;
bus-range = <0 ff>;
ranges = <0200 0 c 6f00 c 6f00 0 8000
   0100  0  000c 6E00 0 0100>;

-

Thank you for all your help/comments...

Sincerely,

Tom Morrison
-Original Message-
From: Becky Bruce [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 11, 2008 6:26 PM
To: Morrison, Tom
Cc: linuxppc-dev@ozlabs.org; Paul Mackerras
Subject: Re: Does Dev Tree WORK with [EMAIL PROTECTED] <#address/size> =
<2/1>


On Aug 11, 2008, at 4:37 PM, Morrison, Tom wrote:

> I am sorry, but I've butted my head against a tree for over a
> week and some things just aren't making sense...especially how
> the prom parse code is working to exact / resolve physical
> addresses to then ioremap...
>
> a) Setup, I have a working MPC8548E board using 2.6.23.8 (ARCH=ppc)
>   with PHYS/PTE_64BIT enabled (with the proper patches)).

So, how much RAM are you trying to use?  If it's 4GB+, there are still
patches you need that haven't been released yet but should be out in
the next week or so (I'm in the middle of pulling everything up to
top-of-tree and re-testing).

>
>
> b) Goal: we want to move our board to a generic 2.6.23 version that
>   Freescale has produced to support the MPC8572DS (and eventually
>   use that kernel to build our BSP for our next board).
>
> c) I have successfully gotten to work a pure '32-bit' dev tree
>   (PHYS/PTE_64BIT not defined - and the CCSRBAR @ 0xE000_)
>The #address/#size <1/1> is '1' (as well as the #size is '1')
>
> d) I have modified this dev tree to supp

uboot version for 8572CDS

2008-06-03 Thread Morrison, Tom
I am wondering if there is a development branch that contains 

support for the Freescale 8572CDS...Looking in the latest uboot

version (1.3.3) there is support for sbc8641 & sbc8548 - but

no specific support for a sbc8572? 

 

Tom Morrison
Principal Software Engineer

EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: [EMAIL PROTECTED]   
www.empirix.com  




 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Which Git Tree to pull from?

2007-07-24 Thread Morrison, Tom
I am a little confused here. I've been working in an older (2.6.11) 

ppc release and haven't been paying much attention to where I 

should be getting the latest / stable git tree (for powerpc).

 

I am working on an e500/8548 branch - it looks like Kumar's 

Git tree looks relatively new with lots of new patches...?

 

Any/all advice would be appreciated!

 

Tom Morrison

Principal S/W Engineer

Empirix, Inc (www.empirix.com)

[EMAIL PROTECTED]

(781) 266 - 3567

 

 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Trying to use Device Tree...and getting continuous interrupts from attached 88e1145

2007-08-03 Thread Morrison, Tom
All,

Connected to eth1 (etsec2) of my mpc8548 cpu is a 88E1145 and I 
am trying to get the core functionality running with the device tree
paradigm - I know the sense of the 88E1145 is active-low for my 
mpc8548 board and have it working with an older 2.6.11++ kernel.  

I built this new kernel with the marvell driver - it seemingly 
does all the same things we did in the 2.6.11 kernel in separate 
spots...

Here is the appropriate parts of my device tree for this part of the
core...

>>  [EMAIL PROTECTED] {
>>  #address-cells = <1>;
>>  #size-cells = <0>;
>>  device_type = "mdio";
>>  compatible = "gianfar"; 
>>  reg = <24520 20>;
>>  phy1: [EMAIL PROTECTED] {
>>  interrupt-parent = <&mpic>;
>>  interrupts = <37 1>;
>>  reg = <11>;
>>  device_type = "ethernet-phy";
>>  };
>>  };  
>>  [EMAIL PROTECTED] {
>>  #address-cells = <1>;
>>  #size-cells = <0>;
>>  device_type = "network";
>>  model = "eTSEC";
>>  compatible = "gianfar";
>>  reg = <25000 1000>;
>>  local-mac-address = [ 00 00 00 00 00 00 ];
>>  interrupts = <23 2 24 2 28 2>;
>>  interrupt-parent = <&mpic>;
>>  phy-handle = <&phy1>;
>>  };
>>  mpic: [EMAIL PROTECTED] {
>>  clock-frequency = <0>;
>>  interrupt-controller;
>>  #address-cells = <0>;
>>  #interrupt-cells = <2>;
>>  reg = <4 4>;
>>  built-in;
>>  compatible = "chrp,open-pic";
>>  device_type = "open-pic";
>>big-endian;
>>  };

The device tree seems to be parsed OK:

>> of_irq_map_one: dev=/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],
index=0
>>  intsize=2 intlen=2
>> of_irq_map_raw: par=/[EMAIL PROTECTED]/[EMAIL PROTECTED],intspec=[0x0037
>>  0x0001...],ointsize=2
>> of_irq_map_raw: ipar=/[EMAIL PROTECTED]/[EMAIL PROTECTED], size=2
>> mpic: xlate (2 cells: 0x0037 0x0001) to line 0x37 sense 0x8

Now, that looks OK! Those are what I would expect. And when the 
mdio/phy are probed, configured, and the 88E1145 interrupt (EXT7 
(0x37H)) is enabled, the interrupt never (seemingly) gets cleared,
and basically hangs the entire box up and eventually it panics!

I don't even have an external phy(SFP) connected to this 88e1148 phy..

I am at a lost - is there something I am missing in device tree? 
Help mr. wizard (Kumar?)...

Tom
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Trying to use Device Tree...and getting continuous interrupts from attached 88e1145

2007-08-13 Thread Morrison, Tom
It turns out that Andy was right and I had not understand the 
NEW MPIC format for the definition of the external interrupts.
This was different than the 2.6.11++ kernel...

Thank you Ben & Andy for your suggestions, unfortunately,
I had to learn the hard way that it was more fundamental
than I had imagined...

I continue to have problems, but I will itemize those in a
separate email.

Tom Morrison


-Original Message-
From: Andy Fleming [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 06, 2007 2:10 PM
To: Morrison, Tom
Cc: linuxppc-dev@ozlabs.org
Subject: Re: Trying to use Device Tree...and getting continuous
interrupts from attached 88e1145


>>> [EMAIL PROTECTED] {
>>> #address-cells = <1>;
>>> #size-cells = <0>;
>>> device_type = "mdio";
>>> compatible = "gianfar"; 
>>> reg = <24520 20>;
>>> phy1: [EMAIL PROTECTED] {
>>> interrupt-parent = <&mpic>;
>>> interrupts = <37 1>;


> How recent of a kernel are you using?  The current kernel assigns 
> the  external interrupts to be the low 12 interrupts, which would make

> your interrupt assignment wrong.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Early UART setup on 2.6 kernel for mpc85xx

2007-08-13 Thread Morrison, Tom
>> In order to debug the kernel 2.6, I want setup serial port with 
>> UART on mpc85xx as early as possible. I add the register access 
>> code at the beginning of  platform_init(). For example, I try 
>> to write THR register(0xe0004500). However the system just 
>> hanging there with this line.



If you are using a relatively new kernel like I am starting 
up with - you don't need to add anything - you can use the
"Early Debugging/Early Console" which defines PPC_EARLY_DEBUG
You can find this in the kernel hacking options when you 
go in and configure your linux kernel.

This causes the udbg serial driver to be initialized, and 
99% of the early debug output is already put to the screen.
This hands the serial port over to the console driver later
on in the boot, and it works great (good job whoever wrote
this piece - more than a helpful tool!).

FWIW, you really can't debug the earliest init code 
because most of that is in assembly. Get a JTAG emulator
(BDI or Lauterbach) and start stepping through.

Tom Morrison


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


On the other side of early_debug...

2007-08-13 Thread Morrison, Tom
It looks like I am getting up to the point where the 
rootfs has been NFS mounted correctly. Whew, but the
first thing the INIT program does it try to open the
/dev/console device - and it goes oops:

>> kernel BUG at drivers/char/tty_io.c:781!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> Empirix MPC848 Cheetah Board
>> Modules linked in:
>> NIP: c01390d4 LR: c013abe4 CTR: 
>> REGS: effc3d50 TRAP: 0700   Not tainted
(2.6.23-r3_tom-g46b28357-dirty)
>> MSR: 00021000   CR: 24242422  XER: 
>> TASK = effc1a80[1] 'init' THREAD: effc2000
>> GPR00: 0001 effc3e00 effc1a80  effc3de0 0001 

>>   c02b60c8
>> GPR08: 1aa3  c030b730  22242422 1001f2f8 fff0

>>   007fff94
>> GPR16: 2000   effc3e30 c17c2528 c17c2530 0130

>>   0128
>> GPR24: c02c 0001 0001   00029000 c17c2400
>>  0001
>> NIP [c01390d4] tty_ldisc_put+0x44/0x70
>> LR [c013abe4] release_dev+0x4ec/0x6a0
>> Call Trace:
>>  [effc3e00] [c021d800] __mutex_unlock_slowpath+0x40/0xd4 (unreliable)
>>  [effc3e20] [c013abe4] release_dev+0x4ec/0x6a0
>>  [effc3ee0] [c013adac] tty_release+0x14/0x28
>>  [effc3ef0] [c0068844] __fput+0x178/0x19c
>>  [effc3f10] [c0066a5c] filp_close+0x54/0xac
>>  [effc3f30] [c0066b34] sys_close+0x80/0xc0
>>  [effc3f40] [c000c54c] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c000110 7cd0 0f00 7fa000a6 7c000146 1d43004c 3d20c031
3929b730
>> 7d4a4a14 816a0048 212b 7c095914 <0f00> 396b 806a0044
916a0048

Now, it seems the root cause is that the tty's 
console_init was NEVER being called. 

I am wondering if there was a reason or rhyme for this?


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Discontiguous Memory for MPC85xx Series

2007-08-14 Thread Morrison, Tom
I am working on a separate project where there is a
discontiguous physical memory map . In looking at
solutions, it looks like the way to go is to 
use something like a SPARSEMEM option. 

FWIW, it even mentions something about discontinuous 
memory in the Device Tree example about the 970 CPU
(Documentation/powerpc/booting-without-of.txt)

I am wondering if there is an implementation for
SPARSEMEM or DISCONTIGMEM being worked on in the 
powerpc branch that I could use for the mpc85xx family? 

Or does somebody have another suggestion on how 
I would do this?

Thanks in advance you can suggest!

Tom Morrison
Principal S/W Engineer
Empirix, Inc (www.empirix.com)
[EMAIL PROTECTED]
(781) 266 - 3567

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Problem with PHYS_64BIT on E500 Core (2.6.23.1)

2007-11-06 Thread Morrison, Tom
I have a MPC8548E Board in which with an earlier version of 
the kernel (2.6.11++), we customized head_e500.S and other 
files to support the PHYS_64BIT & PTE_64BIT based upon 
the work done for PPC64. It works very well.

I am attempting to update our kernel to the latest and have
gotten the basic system up & running (with some hacks/problems
that I won't until I am finished). We are using cuboot.85xx 
image because our u-boot does NOT support dtb.

I noticed that the head_fsl_booke.S had the Large Physical 
Address support, and I ported the other changes required, but
I get nowhere close to the code before the processor hangs.
I have tracked it down to where it is booting into the vmlinux
(which I assume is into the head_fsl_booke.S). We haven't 
hooked a debugger up to this yet - but I am positive that it
isn't making out of this initialization code in the initial 
assembly code.

The question is: Has anyone actually tried this to do this yet?

Thanks in advance for your responses!

Tom Morrison
Principal S/W Engineer
Empirix, Inc (www.empirix.com)
[EMAIL PROTECTED]
(781) 266 - 3567

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


MSR_SPE - being turned off...

2009-05-04 Thread Morrison, Tom
I have both a MPC8548 SBC and MPC8572 system that are running different
flavors of the 

same Linux - 2.6.23.

 

I explicitly am turning it on very early on. Later, I have an
application that is compiled 

with SPE instructions (e.g.: evstdd) , and there is where the problems
happen. If I explicitly

make sure there are NO SPE instructions in the application, nothing bad
happens! 

 

I am polling the MSR - and it seems the SPE is turned OFF? 

 

What have I done wrong and/or has there been fixes in later kernels that
I should be aware of that might help this issue?

 

Tom Morrison
Principal Software Engineer

EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: tmorri...@empirix.com   
www.empirix.com  




 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

RE: MSR_SPE - being turned off...

2009-05-05 Thread Morrison, Tom
Hi Kumar/Michael...

Sorry, I really didn't explain myself very well...

The Problem (answer to Michael):

We started using a new compiler that upon -O2 optimization - added
heavy SPE related instructions into our applications (where the older
compiler might not use as many). Once this was done, we started 
experiencing problems with data being 'shifted' and/or corrupted 
throughout the applications which didn't immediately cause problems,
but either scribbled on someone else's memory and/or bad results...
We knew where one of the offending scribbles started (by the shifting 
by 1 byte of a structure) and found by comparing binaries with 'older'
compiler vs. this one that the only major difference was the 'density' 
of the SPE instructions...

As to your question, Kumar: 
===
Naively, I explicitly enabled the SPE in a BSP 'early_init' program 
(as well as enabling Machine Checks) - which is what I meant by
Enabling SPE...

Michael explained that it is 'normal' if we asynchronously polled 
the MSR (in an application and/or in the kernel) that it might be 
disabled at the moment, but that you do a 'lazy switch' that 
enables it...and gets turned on when an SPE exception comes in...

...ok...I can live with that...

---where I was really going-

This is where I was trying to go. A developer at our company (who no
longer works for us) - did some research/development on the SPE 
functionality, in the hopes that we could create an optimized library.
The results were successful, but because of some of the restrictions 
(including 8 byte alignment for some instructions) - we decided not
to incorporate this library into our application(s)

But, this developer in his results, indicated that he believed our
kernels were NOT properly saving/restoring the upper 32bits of the 
GPR (which can/will be used in the SPE instructions)... Thus, if the
upper 32bits were not saved (and restored when the application got
the SPE to operate on)...then, he thought there would be problems.
He unfortunately, was unable to finish his work and fix these 'bugs'
before he left our company...

Again, I am only going on his results, and not my own investigations
(I am not sure where to start to find this problem to begin with)...

So, I was REALLY asking - has anybody else run into this type of
problem, 
and/or the Linux community has recognized this problem and has fixed
this?

--

I hope I am a little clearer in the history / and outline of the 
problem I am trying to solve this time?

Thanks in advance!

Tom Morrison


>> -Original Message-
>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> Sent: Tuesday, May 05, 2009 7:08 AM
>> To: Morrison, Tom
>> Cc: linuxppc-dev@ozlabs.org
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> 
>> On May 4, 2009, at 5:25 PM, Morrison, Tom wrote:
>> 
>> > I have both a MPC8548 SBC and MPC8572 system that are running
>> > different flavors of the
>> > same Linux - 2.6.23.
>> >
>> > I explicitly am turning it on very early on. Later, I have an
>> > application that is compiled
>> > with SPE instructions (e.g.: evstdd) , and there is where the
>> > problems happen. If I explicitly
>> > make sure there are NO SPE instructions in the application, nothing
>> > bad happens!
>> >
>> > I am polling the MSR - and it seems the SPE is turned OFF?
>> >
>> > What have I done wrong and/or has there been fixes in later kernels
>> > that I should be aware of that might help this issue?
>> 
>> Can you explain what you mean by explicitly am turning it on very
>> early on.
>> 
>> I can't think of anything that has changed w/regards to SPE handling.
>> 
>> - k
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-05 Thread Morrison, Tom
Ok...taken out...

>> -Original Message-
>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> Sent: Tuesday, May 05, 2009 5:18 PM
>> To: Morrison, Tom
>> Cc: linuxppc-dev@ozlabs.org; Michael Neuling
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> 
>> On May 5, 2009, at 7:56 AM, Morrison, Tom wrote:
>> 
>> > Hi Kumar/Michael...
>> >
>> > Sorry, I really didn't explain myself very well...
>> >
>> > The Problem (answer to Michael):
>> > 
>> > We started using a new compiler that upon -O2 optimization - added
>> > heavy SPE related instructions into our applications (where the
older
>> > compiler might not use as many). Once this was done, we started
>> > experiencing problems with data being 'shifted' and/or corrupted
>> > throughout the applications which didn't immediately cause
problems,
>> > but either scribbled on someone else's memory and/or bad results...
>> > We knew where one of the offending scribbles started (by the
shifting
>> > by 1 byte of a structure) and found by comparing binaries with
'older'
>> > compiler vs. this one that the only major difference was the
'density'
>> > of the SPE instructions...
>> >
>> > As to your question, Kumar:
>> > ===
>> > Naively, I explicitly enabled the SPE in a BSP 'early_init' program
>> > (as well as enabling Machine Checks) - which is what I meant by
>> > Enabling SPE...
>> 
>> Are you setting MSR_SPE in your own board code?  If so stop doing so.
>> There isn't any need or reason to be doing that.  MSR_SPE will get
set
>> when an application starts using SPE code and the kernel will manage
>> it properly.
>> 
>> - k
>> 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-05 Thread Morrison, Tom
The test case we found is under 'extreme' duress 
(intense loading on an MPC8572)...with many applications
using A LOT of SPE instructions...



If you look at the context switch code (in latest code entry_32.S), 
I believe the context switch performs a SAVE_NVGPR() - which in our 
interpretation (in ppc_asm.h) - only saves the lower 32 bits of 
the GPR (stw/lwz)...

This is only a guess of where the problem lies - based upon the single
SPE instruction that seemingly got misinterpreted, and shifts the data
By '1 byte' (and this code gets executed successfully MANY more times 
at lower bandwidths - than failures seen at higher bandwidths)...



I am not sure how to proceed...we know how to recreate with our 
application, but we would love to know how to change (safely) 
the pt_regs to "long long" for the GPRs and then safely move
all 64bits of each GPR into these doubles...

We could then re-test and see if this helps?

Tom



>> -Original Message-
>> From: Michael Neuling [mailto:mi...@neuling.org]
>> Sent: Tuesday, May 05, 2009 8:02 PM
>> To: Morrison, Tom
>> Cc: Kumar Gala; linuxppc-dev@ozlabs.org
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> > Hi Kumar/Michael...
>> >
>> > Sorry, I really didn't explain myself very well...
>> >
>> > The Problem (answer to Michael):
>> >
>>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
>> 3D=
>> > =3D=3D=3D=3D=3D=3D=3D
>> > We started using a new compiler that upon -O2 optimization - added
>> > heavy SPE related instructions into our applications (where the
older
>> > compiler might not use as many). Once this was done, we started=20
>> > experiencing problems with data being 'shifted' and/or corrupted=20
>> > throughout the applications which didn't immediately cause
problems,
>> > but either scribbled on someone else's memory and/or bad results...
>> > We knew where one of the offending scribbles started (by the
>> shifting=20
>> > by 1 byte of a structure) and found by comparing binaries with
'older'
>> > compiler vs. this one that the only major difference was the
>> 'density'=20
>> > of the SPE instructions...
>> >
>> > As to your question, Kumar:=20
>> >
>>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
>> 3D=
>> > =3D=3D
>> > Naively, I explicitly enabled the SPE in a BSP 'early_init'
program=20
>> > (as well as enabling Machine Checks) - which is what I meant by
>> > Enabling SPE...
>> 
>> Yeah, you don't want to do this.  It'll potentially break your
>> application.
>> 
>> I'm not that familiar with the CPU you are using but I'm guessing
that
>> you can't write the MSR from user space anyway.
>> 
>> > Michael explained that it is 'normal' if we asynchronously polled
>> > the MSR (in an application and/or in the kernel) that it might be
>> > disabled at the moment, but that you do a 'lazy switch' that=20
>> > enables it...and gets turned on when an SPE exception comes in...
>> >
>> > ...ok...I can live with that...
>> >
>> > ---where I was really going-
>> >
>> > This is where I was trying to go. A developer at our company (who
no
>> > longer works for us) - did some research/development on the SPE=20
>> > functionality, in the hopes that we could create an optimized
library.
>> > The results were successful, but because of some of the
restrictions=20
>> > (including 8 byte alignment for some instructions) - we decided not
>> > to incorporate this library into our application(s)
>> >
>> > But, this developer in his results, indicated that he believed our
>> > kernels were NOT properly saving/restoring the upper 32bits of the
>> > GPR (which can/will be used in the SPE instructions)... Thus, if
the
>> > upper 32bits were not saved (and restored when the application got
>> > the SPE to operate on)...then, he thought there would be problems.
>> > He unfortunately, was unable to finish his work and fix these
'bugs'
>> > before he left our company...
>> >
>> > Again, I am only going on his results, and not my own
investigations
>> > (I am not sure where to start to find this problem to begin
with)...
>> >
>> > So, I was REALLY asking - has anybody else run into this type of
>> > problem, and/or the Linux community has recognized this problem and
>> > has fixed this?
>> 
>> If GPRs where getting corrupted in userspace, that would be a serious
>> bug and would be noticed by someone pretty quickly.
>> 
>> We'd really need a test case to get anywhere with this report.
>> 
>> Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-06 Thread Morrison, Tom
Kumar,
 
What about the case of a context switch (i.e.: when things are setup
in registers for the SPE, but then a context switch happens before
the SPE is executed)? 
 
As to load_up_spe & give_up_spe, it was pointed out to me tonight by a co-worker
to look at how things are saved in those routines, I definitely will look at 
this again, 
and see how it is done...
 
This is happening for us on an 8572 SMP. We are trying to get it to happen 
on 8548 (and single core 8572), but we haven't been able to push this part 
of the application as hard as it is being pushed on 8572...but we will keep 
trying
 
thank you for your patience and suggestions on this...and I will keep working it
 
Tom 



From: Kumar Gala [mailto:ga...@kernel.crashing.org]
Sent: Wed 5/6/2009 12:23 AM
To: Morrison, Tom
Cc: Michael Neuling; linuxppc-dev@ozlabs.org
Subject: Re: MSR_SPE - being turned off... 




On May 5, 2009, at 7:42 PM, Morrison, Tom wrote:

> The test case we found is under 'extreme' duress
> (intense loading on an MPC8572)...with many applications
> using A LOT of SPE instructions...
>
> 
>
> If you look at the context switch code (in latest code entry_32.S),
> I believe the context switch performs a SAVE_NVGPR() - which in our
> interpretation (in ppc_asm.h) - only saves the lower 32 bits of
> the GPR (stw/lwz)...
>
> This is only a guess of where the problem lies - based upon the single
> SPE instruction that seemingly got misinterpreted, and shifts the data
> By '1 byte' (and this code gets executed successfully MANY more times
> at lower bandwidths - than failures seen at higher bandwidths)...
>
> 
>
> I am not sure how to proceed...we know how to recreate with our
> application, but we would love to know how to change (safely)
> the pt_regs to "long long" for the GPRs and then safely move
> all 64bits of each GPR into these doubles...
>
> We could then re-test and see if this helps?
>
> Tom

If you use SPE in an application the full 64-bits are saved and 
restored it just split into two locations (one for the lower 32-bits 
and one for the upper 32-bits).

Look at load_up_spe and giveup_spe in arch/powerpc/kernel/
head_fsl_booke.S

On the 8572 are you running w/SMP?  What kernel version are you using 
if so?  Do you see the same issue on the MPC8548?

- k


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-06 Thread Morrison, Tom
I'm sorry I forgot to put that, this issue was found with our 
currently running kernel 2.6.23.final (what comes with the 
Freescale LTIB BSP package dated 05/23/2009). 

I am sorry if I don't understand your statement that the SMP might
be broken on this kernel, because I tried to analyze the kernel that 
came with the latest BSP LTIB [ackage from Freescale (dated 12/18/2009 
(where we got the 4.2.171 compiler from)), and the associated 'switch 
context' code is exactly the same. Unfortunately, I have not started 
the process of porting my current platform's BSP to this new kernel - 
otherwise, I would have done the test on that platform (this also 
requires a new version of u-boot in order to test correctly))..

I may have mis-interpreted something and/or I am sure I don't 
understand everything about the SMP resource management (and 
associated SPE management), so thank you for any insight you 
may have on this front...

Tom

>> -Original Message-
>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> Sent: Wednesday, May 06, 2009 8:32 AM
>> To: Morrison, Tom
>> Cc: Michael Neuling; linuxppc-dev@ozlabs.org
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> 
>> On May 6, 2009, at 3:31 AM, Morrison, Tom wrote:
>> 
>> > Kumar,
>> >
>> > What about the case of a context switch (i.e.: when things are
setup
>> > in registers for the SPE, but then a context switch happens before
>> > the SPE is executed)?
>> 
>> context switches will be fine.  What we normally do is keep track of
>> which user app used SPE last and when some other app needs it we
clear
>> MSR_SPE for the old app, save its registers.  Than we load up the
>> registers for the new app and set MSR_SPE.  When the old app context
>> switches in it will get an SPE unavail exception at the point it
>> executes its next SPE insn and we will repeat the process.
>> 
>> > As to load_up_spe & give_up_spe, it was pointed out to me tonight
by
>> > a co-worker
>> > to look at how things are saved in those routines, I definitely
will
>> > look at this again,
c>> > and see how it is done...
>> >
>> > This is happening for us on an 8572 SMP. We are trying to get it to
>> > happen
>> > on 8548 (and single core 8572), but we haven't been able to push
>> > this part
>> > of the application as hard as it is being pushed on 8572...but we
>> > will keep trying
>> 
>> Again, what kernel version for 8572?  Its possible old SMP kernels
are
>> broken on 8572.
>> 
>> - k
>> 
>> > 
>> >

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-06 Thread Morrison, Tom
After sitting with the developer of the application for a while, we may
have 
two separate issues...

a) Alignment (aka: alignment exceptions) - Looking at how it handles it 
And attempts to 

b) For aligned data - we still contend that if you have enough tasks
working

>> -Original Message-
>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> Sent: Wednesday, May 06, 2009 8:44 AM
>> To: Morrison, Tom
>> Cc: Michael Neuling; linuxppc-dev@ozlabs.org
>> Subject: Re: MSR_SPE - being turned off...
>> 
>> Can you describe the # of processes you are running in your test.  Is
>> it possible for you to try the tests w/2.6.29 from kernel.org?
>> 
>> - k
>> 
>> On May 6, 2009, at 7:42 AM, Morrison, Tom wrote:
>> 
>> > I'm sorry I forgot to put that, this issue was found with our
>> > currently running kernel 2.6.23.final (what comes with the
>> > Freescale LTIB BSP package dated 05/23/2009).
>> >
>> > I am sorry if I don't understand your statement that the SMP might
>> > be broken on this kernel, because I tried to analyze the kernel
that
>> > came with the latest BSP LTIB [ackage from Freescale (dated
12/18/2009
>> > (where we got the 4.2.171 compiler from)), and the associated
'switch
>> > context' code is exactly the same. Unfortunately, I have not
started
>> > the process of porting my current platform's BSP to this new kernel
-
>> > otherwise, I would have done the test on that platform (this also
>> > requires a new version of u-boot in order to test correctly))..
>> >
>> > I may have mis-interpreted something and/or I am sure I don't
>> > understand everything about the SMP resource management (and
>> > associated SPE management), so thank you for any insight you
>> > may have on this front...
>> >
>> > Tom
>> >
>> >>> -Original Message-


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: MSR_SPE - being turned off...

2009-05-06 Thread Morrison, Tom
Sorry, let me try again...

>> -Original Message-
After sitting with the developer of the application for a while, 

a) Alignment (aka: alignment exceptions) - Looking at how it 
handles the instruction - it interprets these SPE as common
   instructions & then resets the 'upper' 32bits.

   I was just made aware that on 9/14/2007 - Kumar submitted a
 patch that handles these instructions correctly (we don't
 have that version - I am in the process of trying to port it 
   to my current version of the kernel (to see if part of
problem).

   In general, this is a VERY disturbing thing. We 'turn on 
 SPE' in the compiler (-mspe=yes)(a). We are NOT explicitly 
 using SPE instructions in our application(b), BUT(c), the
4.2.171
 compiler (having origins from Code Sourcery (via Freescale))
upon
 optimizations put SPE instructions in without any regard for 
  alignment (which instead of making the code faster - might
actually
  make the code slower)? It's a little disturbing to me.

Stay tuned for more details about my port - and seeing if some
of my problems go away..

b) We still contend if you have multiple tasks using a (VERY) high 
Density of SPE instructions - and the system is taxed heavily
(with lots of context switches) - there is the possibility that
a task will get unlucky and the registers setup will NOT there 
after the context switches back (if some other task does something
else with the entire 64bits).



Tom

>> 
>> >> -Original Message-
>> >> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> >> Sent: Wednesday, May 06, 2009 8:44 AM
>> >> To: Morrison, Tom
>> >> Cc: Michael Neuling; linuxppc-dev@ozlabs.org
>> >> Subject: Re: MSR_SPE - being turned off...
>> >>
>> >> Can you describe the # of processes you are running in your test.
Is
>> >> it possible for you to try the tests w/2.6.29 from kernel.org?
>> >>
>> >> - k
>> >>
>> >> On May 6, 2009, at 7:42 AM, Morrison, Tom wrote:
>> >>
>> >> > I'm sorry I forgot to put that, this issue was found with our
>> >> > currently running kernel 2.6.23.final (what comes with the
>> >> > Freescale LTIB BSP package dated 05/23/2009).
>> >> >
>> >> > I am sorry if I don't understand your statement that the SMP
might
>> >> > be broken on this kernel, because I tried to analyze the kernel
that
>> >> > came with the latest BSP LTIB [ackage from Freescale (dated
>> 12/18/2009
>> >> > (where we got the 4.2.171 compiler from)), and the associated
>> 'switch
>> >> > context' code is exactly the same. Unfortunately, I have not
started
>> >> > the process of porting my current platform's BSP to this new
kernel
>> -
>> >> > otherwise, I would have done the test on that platform (this
also
>> >> > requires a new version of u-boot in order to test correctly))..
>> >> >
>> >> > I may have mis-interpreted something and/or I am sure I don't
>> >> > understand everything about the SMP resource management (and
>> >> > associated SPE management), so thank you for any insight you
>> >> > may have on this front...
>> >> >
>> >> > Tom
>> >> >
>> >> >>> -Original Message-
>> 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


How to debug a hung multi-core system....

2009-05-20 Thread Morrison, Tom
All,

First off, we turned SPE off completely in our build - so we 
could debug a much deeper problem that seems to be occurring 
in our application (before we try to find a potential test 
case for corruption of GPR registers).

We have had this problem for 3 weeks, and just recently have 
come down to a single test case that makes it fail (although 
extremely complicated test case)...

Setup:   
   Master Blade (8548E) with Linux 2.6.23 (and custom BSP)
   Slave Blade (8572E) with Linux 2.6.23 (and similar custom BSP).

The Master Blade works flawlessly (and also works in a slave 
capacity too flawlessly). The single 'slave' 8572E blades 
communicates with the 'master' blade over TCP/IP & PCI Express
(and is running a similar application)...

Running Single Core on slave 8572E (nosmp option on command line) 
the application works in all conditions (from modestly loaded to 
well oversubscribed/pegged CPU).

In Multi-core option, the application also works flawlessly. The 
problem comes when we oversubscribe our application and push 
this 'slave' blade to the extreme edge of processing (falling 
behind in our processing...etc). 

Eventually, sometime between 5-15 minutes, this board becomes 
hung (where the console becomes completely unresponsive and 
you cannot 'ping' the box).

I have a JTAG WindRiver ICE and connect to this blade after it 
is hung, and it appears that both cores are running to some 
extent:

   Core 1 seems to be Idle loop - happily doing nothing 
(and not servicing TCP and/or the console)...

   Core 0 seems to be 'stuck' at the "InstructionStorage" 
Exception. And it seems to be going 'nowhere' fast

SRR0 seems to point to this same spot (0xc6C0)
SRR1 value is 0x00021200 

I am at a loss to see how the kernel (and/or our kernel BSP) 
cause this exception, and I am even more of a loss on figuring 
out an application could cause this exception...

Anybody have any ideas - and/or ways to re-configure our 
setup to obtain more data? Or does this sound familiar to 
a bug somebody has already found in the kernel?

We are even having trouble defining a test program that can
cause (on purpose) the 'InstructionStorage' Exception (does 
anybody have an simple 'c' (or ppc assembly) program that 
causes this exception (so we can run in user application land
and see if the symptoms are similar))?

Thank you in advance for any / all help you can provide
because I am completely stumped on even how to proceed!

Sincerely,

Tom Morrison
Principal Software Engineer


EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: tmorri...@empirix.com 
www.empirix.com



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: How to debug a hung multi-core system....

2009-05-21 Thread Morrison, Tom


 

>> -Original Message-

>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]

>> Sent: Thursday, May 21, 2009 9:13 AM

>> To: Morrison, Tom

>> Cc: linuxppc-dev@ozlabs.org; Young, Andrew; Brown, Jeff

>> Subject: Re: How to debug a hung multi-core system

>> 

>> 

>> On May 20, 2009, at 6:17 PM, Morrison, Tom wrote:

 

[Morrison, Tom] 

 



  

>> >

>> >   Core 1 seems to be Idle loop - happily doing nothing

>> >(and not servicing TCP and/or the console)...

>> >

>> >   Core 0 seems to be 'stuck' at the "InstructionStorage"

>> >Exception. And it seems to be going 'nowhere' fast

>> >

>> > SRR0 seems to point to this same spot (0xc6C0)

>> > SRR1 value is 0x00021200

>> >

>> > I am at a loss to see how the kernel (and/or our kernel BSP)

>> > cause this exception, and I am even more of a loss on figuring

>> > out an application could cause this exception...

>> 

>> This is a bit odd as we shouldn't see an ISI from 0xc6C0.

>> 

>> Are you able to single step Core0?  Can you dump the contents of the

>> TLBs on Core0

 

[Morrison, Tom] 

 

[Morrison, Tom] 

 



 

Yes, very odd...

 

And I am able to get TLB entries from the core that is in 

Instruction Storage Exception, I made

 

[Morrison, Tom] 

>BKM>tat

 

Entry  EPN  RPNTID  TMASK   WIMGE  TSIZ U0:3  X0:1   PID  TS
PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL

 

IT0  C000  00 000 0A 0 0 0 0
0UPDDDDDDDI 

IT1  C000  00 000 0A 0 0 0 0
0UPDDDDDDDI 

IT2  C000  00 000 0A 0 0 0 0
0UPDDDDDDDI 

IT3  C000  00 000 0A 0 0 0 0
0UPDDDDDDDI 

DT0  0011C000  00 000 06 0 0 0 0
0UPDDDDDDDI 

DT1  D435C000 2000 00 000 1E 0 0 0 0
0UPDDDDDDDI 

DT2  0011C000  00 000 06 0 0 0 0
0UPDDDDDDDI 

DT3  D435C000 2000 00 000 1E 0 0 0 0
0UPDDDDDDDI 

LT0  C000  00 0FF 04 9 0 0 0
0PPEEDEEDDV 

LT1  D000 0100 00 0FF 04 9 0 0 0
0PPEEDEEDDV 

LT2  E000 0200 00 0FF 04 9 0 0 0
0PPEEDEEDDV 

LT3  39A4 027FF700 0D 000 06 E A 3 0
1USDDDEEDDI 

LT4  F924E000 7C054500 BA 000 0B E 0 3 0
0PSEEDEEDDV 

LT5  82A9F000 46664C00 FB 000 1A F 4 2 0
0USEEDDEDDI 

LT6  8000 1F00 F2 0FF 1D 9 B 3 0
0USDEDEEEDV 

LT7  6400 1F00 B3 07F 02 8 B 0 0
1USDEDDEEDV 

LT8  E5BF1000 995EA900 96 000 0C D 8 0 0
1USDEEEEDDV 

LT9  7F3BF000 C6DF7300 DF 000 15 1 2 3 0
1USEDDEEEDI 

LT10 917C7000 EEA67F00 7F 000 17 C 5 3 0
1PSEEEEEEDI 

LT11 6B00 F570 BC 03F 04 7 D 0 0
1PSEEEEEEDV 

LT12 712DB000 F1B59100 2A 000 19 C F 1 0
1PSEEEEDEDV 

LT13  F000 7F 0FF 07 B 0 0 0
1PSDDEEEEDV 

LT14 A300 FDD0 C5 03F 16 7 E 3 0
1PSEEEDDEDV 

LT15 F7F0 B0B8 82 00F 1F 5 F 0 0
1PPEEDDDDDV

 

To answer your 2nd question - we have about 10 processes, and

about 60-70 threads total (30+ for the main processing process)...

 

 

>> > Anybody have any ideas - and/or ways to re-configure our

>> > setup to obtain more data? Or does this sound familiar to

>> > a bug somebody has already found in the kernel?

>> >

>> > We are even 

RE: How to debug a hung multi-core system....

2009-05-21 Thread Morrison, Tom
What do you mean by 'odd' mappings (the EPN or RPNor ??)


Entry  EPN  RPNTID  TMASK   WIMGE  TSIZ U0:3  X0:1
---
LT8  E5BF1000 995EA900 96 000 0C D 8 0



PID  TS  PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL
01USDEEEEDDV

We are using a 36bit address (mainly to remap our I/O and local bus
devices
to outside the 32bit addressing space)...

t



>> -Original Message-
>> From: Kumar Gala [mailto:ga...@kernel.crashing.org]
>> Sent: Thursday, May 21, 2009 10:45 AM
>> To: Morrison, Tom
>> Cc: linuxppc-dev@ozlabs.org; Young, Andrew; Brown, Jeff; Geary Sean-
>> R60898
>> Subject: Re: How to debug a hung multi-core system
>> 
>> > [Morrison, Tom]
>> > >BKM>tat
>> >
>> > Entry  EPN  RPNTID  TMASK   WIMGE  TSIZ U0:3  X0:1
>> > PID  TS  PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL
>> >
>> 
>> > LT0  C000  00 0FF 04 9 0 0
>> > 00PPEEDEEDDV
>> > LT1  D000 0100 00 0FF 04 9 0 0
>> > 00PPEEDEEDDV
>> > LT2  E000 0200 00 0FF 04 9 0 0
>> > 00PPEEDEEDDV
>> 
>> > LT4  F924E000 7C054500 BA 000 0B E 0 3
>> > 00PSEEDEEDDV
>> 
>> > LT6  8000 1F00 F2 0FF 1D 9 B 3
>> > 00USDEDEEEDV
>> > LT7  6400 1F00 B3 07F 02 8 B 0
>> > 01USDEDDEEDV
>> > LT8  E5BF1000 995EA900 96 000 0C D 8 0
>> > 01USDEEEEDDV
>> 
>> > LT11 6B00 F570 BC 03F 04 7 D 0
>> > 01PSEEEEEEDV
>> > LT12 712DB000 F1B59100 2A 000 19 C F 1
>> > 01PSEEEEDEDV
>> > LT13  F000 7F 0FF 07 B 0 0
>> > 01PSDDEEEEDV
>> > LT14 A300 FDD0 C5 03F 16 7 E 3
>> > 01PSEEEDDEDV
>> > LT15 F7F0 B0B8 82 00F 1F 5 F 0
>> > 01PPEEDDDDDV
>> 
>> Do you know what the Entry field means?  Are you guys putting your
own
>> mappings into the TLB?  LT4..LT15 (with VAL = V) seem very odd to me.
>> 
>> - k
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: How to debug a hung multi-core system....

2009-05-21 Thread Morrison, Tom
Just had a little conference with several co-workers...to go over
results

We think that LT0 (the one that maps the kernel) has been corrupted:

   Entry  EPN  RPNTID  TMASK   WIMGE  TSIZ U0:3  X0:1
   ---
   LT0  C000  00 0FF 04 9 0 0

   PID  TS  PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL
   ---
   00PPEEDEEDDV

Is absolutely wrong - this is TLB for the kernel - and as you can see 
...it does NOT have execution privileges (and in fact the user space 
HAS executive privileges for this area (complete opposite of what it 
should be)...

This is why it is stuck AT that instruction (can't even single step
from that location)..

(one of) The first problem(s) is how can/when did this TLB get
corrupted!

Tom

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Kernel bug in 2.6.23...was: RE: How to debug a hung multi-core system....

2009-05-28 Thread Morrison, Tom
Kumar,

To follow up on our postings from late last week...
(which I was expecting a response (but never got) from you)...

-

We (well, mostly a very bright engineer who was very persistent) 
have(has) found the origin of how the kernel TLB got corrupted.

We tracked down the problem to a programming bug in the DataStorage
exception handler for our kernel (2.6.23). We have looked at newer
kernels, and have noticed that this piece of processing has changed, 
but let me explain to you what happened (and the conditions that 
caused the problem on our MPC8572E (running SMP)...

If you follow the logic of in this version of the kernel, it reads 
the SPRN_DEAR into register R10, and then does some operations 
(including a tlbsx operation (which uses R10)), and then attempts
to update the associated PTE entry.

Well, if you have REALLY bad luck, sometime between the time you 
took this exception and try to update the PTE for this page, the 
other core has decided to invalidate this page's PTE. The good 
part is the kernel recognizes this unlucky case.

Unfortunately, in this 'bad luck' case, a kernel bug was 
Introduced. The kernel uses R10 for some processing (puts
the physical address associated with this virtual page) and 
then branches up 'above' the tlbsx operation to try again 

...without restoring R10 to the SPRN_DEAR required by the tlbsx
operation...

This means, that even though the kernel recognized this exceptional
problem, it NEVER did the right thing, and instead, the kernel would 
(attempt) to modify the unlucky TLB virtual address that corresponds 
to the physical address of the original DataStorage exception.

The only way we caught this is that we also had a second piece of 
'bad luck' by having that physical address map to the virtual address
of the kernel (0xC000), and thus, when it loops back to try again,
it gets the kernel page(s) from the tlbsx operation, and modifies 
permissions on the kernel pages and thus causing an InstructionStore 
Exception (forever).

We fixed this in our kernel by just restoring R10 to SPRN_DEAR value
just before it loops back, something like this:


  
mtspr   SPRN_MAS1, r13
tlbwe

/* because we did NOT find in PTE */
/* r10 was changed - so we need   */
/* to re-load it here to work */
mfspr   r10, SPRN_DEAR/* restore the faulting
address */
b   5b  /* Try again */
 


That's the short and long of it...and 4 weeks of very stressful
problems...

I am wondering why nobody has found this problem before - are we the
first to be this unlucky? I am not sure that is a good thing!

Comments? Suggestions? What else should I be doing with this
information?

Tom Morrison
Principal Software Engineer
EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: tmorri...@empirix.com 
www.empirix.com


>> -----Original Message-
>> From: Morrison, Tom
>> Sent: Thursday, May 21, 2009 11:24 AM
>> To: Morrison, Tom; Kumar Gala
>> Cc: linuxppc-dev@ozlabs.org; Young, Andrew; Brown, Jeff; Geary Sean-
>> R60898
>> Subject: RE: How to debug a hung multi-core system
>> 
>> Just had a little conference with several co-workers...to go over
results
>> 
>> We think that LT0 (the one that maps the kernel) has been corrupted:
>> 
>>Entry  EPN  RPNTID  TMASK   WIMGE  TSIZ U0:3  X0:1
>>
---
>>LT0  C000  00 0FF 04 9 0 0
>> 
>>PID  TS  PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL
>>
---
>>00PPEEDEEDDV
>> 
>> Is absolutely wrong - this is TLB for the kernel - and as you can see
>> ...it does NOT have execution privileges (and in fact the user space
>> HAS executive privileges for this area (complete opposite of what it
>> should be)...
>> 
>> This is why it is stuck AT that instruction (can't even single step
>> from that location)..
>> 
>> (one of) The first problem(s) is how can/when did this TLB get
corrupted!
>> 
>> Tom

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


PCI HotPlug and Adding Resources after Linux Boots

2009-09-22 Thread Morrison, Tom
I am not exactly sure who to direct this question to (general Linux kernel or 
LinuxPPC),
so I am directing to both - in hopes that someone will recognize this problem - 
and perhaps
give me some suggestions on how to proceed...

I am running Linux (2.6.23x (and 2.6.27.x)) on a MPC8572 based system.

I have an 8616 switch that has a Port (6) connected to a FPGA that is
NOT loaded at before Linux boots (note: this port is configured for HOTPLUG
events - which we do get after FPGA  is loaded). We are NOT using a
static device tree map (because the devices in the system are very dynamic).

We use instead the pci auto scan mechanism(s) to scan/assign resources
(including into the BAR registers) at bootup to all of the devices that are
attached to this MPC8572...

Here is the port that is attached to the device (note: there are NO
resources assigned at this point this port):

-
02:06.0 PCI bridge: PLX Technology, Inc.: Unknown device 8616 (rev bb) (prog-if 
00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=02, secondary=05, subordinate=05, sec-latency=0
Capabilities: [40] Power Management version 3
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/2 
Enable+
Capabilities: [68] #10 [0162]
Capabilities: [a4] #0d []

r...@slave7 ~ # lspci -t
-+-[01]---00.0-[02-05]--+-01.0
 |  +-04.0-[03]--
 |  +-05.0-[04]--
 |  \-06.0-[05]-

-

Later, after I detect there is an FPGA to load - I load it. At completion of the
loading of the FPGA - the 8616  detects the FPGA - and creates a HotPlug
event that the PCI Express HotPlug Driver handles:
-

r...@slave7 ~ # pciehp: pcie_isr: intr_loc 8
pciehp: pciehp:  Presence/Notify input change.
pciehp: Card present on Slot(0005_0070)
pciehp: Surprise Removal
pciehp: hpc_get_power_status: SLOTCTRL 80 value read 8
pciehp: hpc_get_attention_status: SLOTCTRL 80, value read 8
pciehp: board_added: slot device, slot offset, hp slot = 0, 0 ,0
pciehp: hpc_check_lnk_status: lnk_status = 2021
PCI: Found :05:00.0 [1172/0004] 00ff00 00
PCI: Calling quirk c0012d3c for :05:00.0
program_fw_provided_values: Could not get hotplug parameters
entering assign resources (size: 200)
PCI: Failed to allocate mem resource #0:2000...@0 for :05:00.0
bus pci: add device :05:00.0
entering uevent
pci: Trying to Match Device :05:00.0 with Driver pcieport-driver
pci: Trying to Match Device :05:00.0 with Driver serial
pci: Trying to Match Device :05:00.0 with Driver pexntb
pciehp: hpc_get_power_status: SLOTCTRL 80 value read 8
pciehp: hpc_get_attention_status: SLOTCTRL 80, value read 8

02:06.0 PCI bridge: PLX Technology, Inc.: Unknown device 8616 (rev bb) (prog-if 
00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=02, secondary=05, subordinate=05, sec-latency=0
Capabilities: [40] Power Management version 3
Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/2 
Enable+
Capabilities: [68] #10 [0162]
Capabilities: [a4] #0d []

05:00.0 Class ff00: Altera Corporation: Unknown device 0004 (rev 01)
Subsystem: Altera Corporation: Unknown device 0004
Flags: fast devsel
Capabilities: [50] Message Signalled Interrupts: 64bit+ Queue=0/5 
Enable-
Capabilities: [78] Power Management version 3
Capabilities: [80] #10 [0001]

r...@slave7 ~ # lspci -t
-+-[01]---00.0-[02-05]--+-01.0
 |  +-04.0-[03]--
 |  +-05.0-[04]--
 |  \-06.0-[05]00.0
 \-[00]---00.0

-

So, as you can see - the device has been read - and it requires 32M of 
resources, but
because its parent doesn't have any resources allocated - it seemingly can't 
allocate and
use any additional resources.

How do I 'customize' and/or add resources at this point for this device (using 
semi-standard mechanisms)?

Thanks in advance for any/all ideas...


I


Tom Morrison
Principal Software Engineer
EMPIRIX
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670
email: tmorri...@empirix.com
www.empirix.com



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: PCI HotPlug and Adding Resources after Linux Boots

2009-09-23 Thread Morrison, Tom
Thank you for taking the time in confirming some of the 
potential strategies I was already thinking about...:-)

I am going to try to pre-allocate some resources for
the specific port/slot/bridge device (because we know
at all times that port #6 (bus #5) is going to have
a FPGA end device associated with it...

Any other thoughts - whenever you can make them - would
be more than welcome!

Tom


>> -Original Message-
>> From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
>> Sent: Wednesday, September 23, 2009 5:42 AM
>> To: Morrison, Tom
>> Cc: linuxppc-...@ozlabs.org; linux-ker...@vger.kernel.org
>> Subject: Re: PCI HotPlug and Adding Resources after Linux Boots
>> 
>> On Tue, 2009-09-22 at 15:36 -0400, Morrison, Tom wrote:
>> > I am not exactly sure who to direct this question to (general Linux
>> > kernel or LinuxPPC),
>> 
>> PCI Hotplug is reasonably arch specific at the moment so I suppose
>> here's is as good as anywhere else to ask :-)
>> 
>> > so I am directing to both - in hopes that someone will recognize this
>> > problem - and perhaps
>> >
>> > give me some suggestions on how to proceed...
>> 
>> There's a few things you can do, though I don't have time just right now
>> to give you a detailed answer. I'll try again later.
>> 
>> In the meantime, some of the answers could be around not using full
>> automatic resource assignment, but instead, pre-initializing the top
>> bridge with some resources that are going to be enough for the device.
>> 
>> You can also try to get the bridge to re-allocate. There's various funky
>> locking issues with doing that though as long as it's during boot time,
>> it's not too much of a problem.
>> 
>> There are other more or less hackish ways to do it, but I'll have to
>> give it more thought.
>> 
>> I'm quite stretched at the moment so if you don't hear back from me in
>> the upcoming few days, don't hesitate to ping me again.
>> 
>> Cheers,
>> Ben.
>> >
>> > I am running Linux (2.6.23x (and 2.6.27.x)) on a MPC8572 based system.
>> 
[Morrison, Tom] 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: PCI HotPlug and Adding Resources after Linux Boots

2009-09-25 Thread Morrison, Tom
Ben/all,

I have had some luck reserving some memory (via in this autoscan
(which reserves the resources)) which in this hack code - I reserve
an appropriate amount of resources for the bridge (by detecting 
this special type of port).

The problem comes later on - when the Hotplug event comes - and it
still can't allocate the resources...

A member of this group (who is away from the office right now) had 
the following comments:

>> In your case, if the only problem that you are running into is 
>> that the resources cannot be assigned to the FPGA, it may be 
>> sufficient to hardcode the forwarding addresses and subordinate 
>> bridge number for port 6 of your 8616.  The reason you are seeing 
>> that error message is because the parent bridge device for the 
>> detected FPGA (port 6 of the 8616) does not have forwarded 
>> resource regions that match what the FPGA is trying to claim.

I guess the question I am asking is my friend's statement possibly
true. And, if he is correct - I am a little confused as to exactly 
where/how do I determine and configure this forwarding 
addresses/subordinate bridge number (I'm really a newbie at this
level of PCI configuration).

Thank you in advance to any/all who can help me figure this out!

Tom



>> -Original Message-

[Morrison, Tom] 
>> 
>> There's a few things you can do, though I don't have time just right now
>> to give you a detailed answer. I'll try again later.
>> 
>> In the meantime, some of the answers could be around not using full
>> automatic resource assignment, but instead, pre-initializing the top
>> bridge with some resources that are going to be enough for the device.
>> 
>> You can also try to get the bridge to re-allocate. There's various funky
>> locking issues with doing that though as long as it's during boot time,
>> it's not too much of a problem.
>> 
>> There are other more or less hackish ways to do it, but I'll have to
>> give it more thought.
>> 
>> I'm quite stretched at the moment so if you don't hear back from me in
>> the upcoming few days, don't hesitate to ping me again.
>> 
>> Cheers,
>> Ben.
>> 
>> >
>> > I am running Linux (2.6.23x (and 2.6.27.x)) on a MPC8572 based system.
>> >
>> >
>> >
>> > I have an 8616 switch that has a Port (6) connected to a FPGA that is
>> >
>> > NOT loaded at before Linux boots (note: this port is configured for
>> > HOTPLUG
>> >
>> > events - which we do get after FPGA  is loaded). We are NOT using a
>> >
>> > static device tree map (because the devices in the system are very
>> > dynamic).
>> >
>> >
>> > We use instead the pci auto scan mechanism(s) to scan/assign
>> > resources
>> >
>> > (including into the BAR registers) at bootup to all of the devices
>> > that are
>> >
>> > attached to this MPC8572...
>> >
>> >
>> >
>> > Here is the port that is attached to the device (note: there are NO
>> >
>> > resources assigned at this point this port):
>> >
>> >
>> >
>> > ---
>> --
>> >
>> > 02:06.0 PCI bridge: PLX Technology, Inc.: Unknown device 8616 (rev bb)
>> > (prog-if 00 [Normal decode])
>> >
>> > Flags: bus master, fast devsel, latency 0
>> >
>> > Bus: primary=02, secondary=05, subordinate=05, sec-latency=0
>> >
>> > Capabilities: [40] Power Management version 3
>> >
>> > Capabilities: [48] Message Signalled Interrupts: 64bit+
>> > Queue=0/2 Enable+
>> >
>> > Capabilities: [68] #10 [0162]
>> >
>> > Capabilities: [a4] #0d []
>> >
>> >
>> >
>> > r...@slave7 ~ # lspci -t
>> >
>> > -+-[01]---00.0-[02-05]--+-01.0
>> >
>> >  |  +-04.0-[03]--
>> >
>> >  |  +-05.0-[04]--
>> >
>> >  |  \-06.0-[05]-
>> >
>> >
>> >
>> > ---
>> --
>> >
>> >
>> >
>> > Later, after I detect there is an FPGA to load - I load it. At
>> > completion of the
>> >
>> > loading of the FPGA - the 8616  detects the FPGA - and creates a
>> > Hot

PCI Express between MPC8572 & PLX8616

2008-11-04 Thread Morrison, Tom
Has anybody every used this chip in their design??

 

FYI, we have a custom board that I am bringing up right now that has a 

MPC8572E that has its PCIE1 connected to the subject line PLX8616 

PCI Express Switch...

 

As part of the boot, we successfully discover config the internal bridge


(its in root complex mode) inside the MPC8572E, and from 'many' 

view points (hardware LED's and status registers) - there seems to 

be 4 lanes in between the Switch and the CPU...

 

But, at the first external config cycle - something goes very wrong - 

the response (or lack of that ==> 0x) is returned, as well as 

and the associated CCSR's PCIE1 interface is completely 

reset (we implement and attempt to recover this interface 

via Freescale's CPU Errata #4) - but beyond that - it 

never recovers and re-sync's these 4 lanes up

 

 

 

Has anybody had anything similar to this happen to them 

in this type of situation...

 

We are grasping at straws, and yet to get any further than this...

 

Thanks in advance...

 

Tom Morrison
Principal Software Engineer

EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: [EMAIL PROTECTED]   
www.empirix.com  




 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Patch to fix problem? was: PCI Express between MPC8572 & PLX8616

2008-11-06 Thread Morrison, Tom
Sorry for such a wide query...but I did some searching of the linux trees...
and I am at a loss to find what the below email refers to in terms of a fix 
for pci expressand I am having a hard time finding it in all of the commits?
 
any pointers to the right tree and/or the specific fix would be greatly 
appreciated!



 
==> a friend respond to the original subjectline..
 
> Has anybody every used this chip in their design??

No, but I have 2 other PLX PCI-E switches that would not behave.

> FYI, we have a custom board that I am bringing up right now
> that has a
> MPC8572E that has its PCIE1 connected to the subject line PLX8616
> PCI Express Switch...

So you have the cpu as root-complex attached to the PLX upstream port
and the PLX downstream ports going somewhere else.

> As part of the boot, we successfully discover config the
> internal bridge
> (its in root complex mode) inside the MPC8572E, and from 'many'
> view points (hardware LED's and status registers) - there seems to
> be 4 lanes in between the Switch and the CPU...

Are you doing this by using a PCI-E analyzer, or by what you see U-boot
doing.  BTW, in my case, the LEDs always worked but nothing else did.
So the lane good LEDs would light up, but the whole interface was
hopelessly broken.

> But, at the first external config cycle - something goes very wrong -
> the response (or lack of that รจ 0x) is returned, as well as
> and the associated CCSR's PCIE1 interface is completely
> reset (we implement and attempt to recover this interface
> via Freescale's CPU Errata #4) - but beyond that - it
> never recovers and re-sync's these 4 lanes up

By external configuration cycle, do you mean when the cpu goes out to
configure the items on the secondary bus side of the plx switch?  or
the secondary side of the internal pci-e bridge.  In either case, this
could be the same problem I just went through.  A workaround was recently
(like maybe 2 weeks ago) put into the kernel to address this on powerpc
systems, so you might try pulling down a recent kernel via git and see
if that helps.  I would have copied the mailing list, but for some
reason it is rejecting my mail at the moment.

I can say that I tested this fix and it definately fixed all the
secondary ports on my system.  I used a pci-e analyzer so it was a
little easier to see what was going on, but in the end it was clear that
there was something wrong in the powerpc ports, b/c I did not see this
with the same parts on an x86 host.



 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


8572E - machine check pin (MCP0)

2008-11-24 Thread Morrison, Tom
Running 2.6.23.25 kernel...

I have an external watchdog timer that is going off - and pulsing into
the MCP0 of the 8572E. I get the printk indicating that the MCP0 went
off - the problem is - how do I clear the condition that caused this
because my hardware engineer swears that the pulse is ONLY 250ms - and
after resetting several status registers (mcpsumr & rst
(because my hardware engineer swears that the pulse is ONLY 250ms long
(and I have a delay after my printk of 250ms)) - so I am pretty sure

I am resetting the conditions mcpsumr (also, extra: the rstsr),
but after writing mcpsumr - and reading back - it still has 
the mcp0 bit set?

Where else do I need to reset the status - I think I am doing it
right...
but it isn't clearing the exception - and it 'dies' the next time
through
this (why is another problem - but first, I'd like to know why the
condition
is NOT being cleared...at all)...

A couple of possibilities is that because the external MCP0 condition
is actually a pulse - another machine check could be clocked in on the
'falling' edge - but this pulse is long gone before I even come close
to attempting clearing the mcp0 exception in mcpsumr?

FWIW, I also have the same signal pulsing to my UDE - and at least after
resetting the UDE condition, it 'looks' reset - before immediately
getting
another UDE interrupt (potentially for the falling edge)...

I am very confused here...

Thanks for any advice...

Tom
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: 8572E - machine check pin (MCP0)

2008-11-25 Thread Morrison, Tom
I wrote:

>> I have an external watchdog timer that is going off - and pulsing
into
>> the MCP0 of the 8572E. I get the printk indicating that the MCP0 went
>> off - the problem is - how do I clear the condition that caused this
>> because my hardware engineer swears that the pulse is ONLY 250ms -
and
>> after resetting several status registers (mcpsumr & rst
>> (because my hardware engineer swears that the pulse is ONLY 250ms
long
>> (and I have a delay after my printk of 250ms)) - so I am pretty sure
>>
>> I am resetting the conditions mcpsumr (also, extra: the rstsr),
>> but after writing mcpsumr - and reading back - it still has
>> the mcp0 bit set?





Trent wrote:


>SRESET# also sets MCP0 and MCP1, maybe that is on?
>I'd also check the EMCP bit in SPRN_HID0 (on core 0 for MCP0).

I think SRESET is a separate signal - and even if it was ON
(which it shouldn't be) - it should show up in the MCPSUMR 
Register (and I am clearing that condition)...

I am getting the first machine check (with an indication that 
the MCP0 is pulled) - I don't think you can get a Machine check without 
SPRN_HID0's EMCP being set? 

The only thing that I am thinking is that I have two edges, and
after returning from the machine check (first time) the ME bit is
NOT enabled, so when the falling edge of that pulse occurs, it 
causes another machine check - which because ME bit is NOT set
it causes a checkstop - and it goes away...

That explains why it hangs for the second machine check - although
it still 'starts' into the machine check handler before dying 
very early in its execution (before it does a full dump of registers)...

very strange stuff here...

Tom

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


MPC8572 - IPR Register

2008-12-16 Thread Morrison, Tom
We are having a problem with an external interrupt not actually being
received / detected on the MPC8572. 

This external device 'believes' that it has sent an interrupt
(over PCIe) to the MPC8572 and we believe that the associated
ExVPR register has correctly unmasked/configured this correctly.

But, still NO interrupt...

If you read the documentation about this configuration register, it
indicates that there is some type of "IPR" register internal to the
8572 that indicates if an interrupt has been received by the PIC...

We want to read that IPR register to verify that:
   a) the external device has sent the interrupt 
and we have configured something wrong in the chip

   b) there is no pending interrupt (thus none received) from
this external device...

Is there any way (hook (indirection) or crook (aka: secret register))
that would allow us to read this register? From all my investigations
it looks like there isn't a 'straight forward' / documented way to 
do so...I am hoping you guys have gone beyond the 'straight forward' 
means and have found a way...

Thanks in Advance...

Sincerely...

Tom Morrison
Principal S/W Engineer 
Tmorrison (at) empirix (dot) com
www.empirix.com


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev