Re: [casper] SOLVED: ROACH 2's suddenly freezing left and right

2013-03-15 Thread G Jones
Hi Henno,
I'll send the model separately. Typically the crash occurs within a minute
or two, which corresponds to ~10-30 register/bram read/writes.
Glenn


On Fri, Mar 15, 2013 at 10:25 AM, Henno Kriel  wrote:

> Hi Glenn
>
> Is it possible to send me you model file?
>
> I have a fairly sizable design running with these changes, that has many
> register, shared BRAMs and snap blocks, without issues.
>
> You mentioned that the design crashes after a while - could you give me a
> more precise indication of the time span?
>
> Regards
> Henno
>
> On Fri, Mar 15, 2013 at 3:28 PM, G Jones  wrote:
>
>> Hi,
>> It should have occurred to me sooner, but I checked through the commit
>> logs for mlib_devel and remembered I had updated from ska-sa a couple of
>> weeks ago to get the bugfix for the rcs block. In doing so, I had also
>> pulled down this commit:
>>
>>
>> https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
>> "Simplified the EPB to OPB 32bit bus cycle and now supports legacy byte
>> enable support for ROACH 1 modules on ROACH 2."
>>
>> which sounds suspicious since the problem seemed to be related to reading
>> writing brams/software registers.
>>
>> Indeed, when I switched over to the commit right before that one and
>> compiled the same test design, I ended up with a boffile that has not yet
>> crashed (the bad bof would have certainly crashed by now).
>>
>> The design is simply two ADC5Gs connected to a snapshot blocks. The ADCs
>> are clocked at 2880 MHz, so the FPGA is running at 180 MHz.  I'm not sure
>> if the problem is some interaction between the ADC5Gs and this commit, or
>> the clock rate or what.
>>
>> Henno, can you double check the code in this commit and see if you can
>> ascertain where the bug might be?
>>
>> Glenn
>>
>> On Thu, Mar 14, 2013 at 12:00 PM, G Jones wrote:
>>
>>> Hi,
>>> For some unknown reason, boffiles I generate with my toolflow cause
>>> ROACH 2's to freeze up after a few minutes (I think related to I/O to
>>> software registers and shared BRAMs rather than any specific amount of
>>> time). I don't know of any changes I made to my toolflow since the
>>> last time I compiled working boffiles. Previously working boffiles
>>> still work, but recompiled designs do not work. The symptom is that
>>> the python katcp client stops responding. SSHing to the ROACH and
>>> running ps shows that tcpborphserver3 is no longer running. It finally
>>> occurred to me to check dmesg, and on all crashed ROACHs, I see this
>>> in the demsg:
>>>
>>> ...
>>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value
>>> 1
>>> attempting led toggle
>>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value
>>> 0
>>> attempting led toggle
>>> About to toggle cpu_rdy pinMachine check in kernel mode.
>>> Data Read PLB Error
>>> Oops: Machine check, sig: 7 [#1]
>>> PowerPC 44x Platform
>>> Modules linked in:
>>> NIP: 0fea4048 LR: 0fea3f88 CTR: 0004
>>> REGS: ef00bf10 TRAP: 0214   Not tainted  (3.7.0-rc2+)
>>> MSR: 0002d000   CR: 2224  XER: 
>>> TASK = efb54060[516] 'tcpborphserver3' THREAD: ef00a000
>>> GPR00:  bfcb7290 48031e20 10628bf9 4802c010 0004 0018
>>> 7f7f7f7f
>>> GPR08:  10628bf0 10628ba0 0fea3f80 2222 1006ba18 
>>> 
>>> GPR16:       
>>> 
>>> GPR24:    0004 10628bf9 10628bf9 0ff91ff4
>>> 4802c011
>>> NIP [0fea4048] 0xfea4048
>>> LR [0fea3f88] 0xfea3f88
>>> Call Trace:
>>> ---[ end trace 59d28c137ef7dde2 ]---
>>>
>>> roach VMA close
>>> roach release mem called
>>>
>>> -
>>>
>>> If I then try to reboot the ROACH with shutdown -r now, it hardfreezes
>>> and requires a power cycle to get it running again.
>>>
>>> Any ideas where to look for this problem?
>>>
>>> Thanks,
>>> Glenn
>>>
>>
>>
>
>
> --
> Henno Kriel
>
> DSP Engineer
> Digital Back End
> meerKAT
>
> SKA South Africa
> Third Floor
> The Park
> Park Road (off Alexandra Road)
> Pinelands
> 7405
> Western Cape
> South Africa
>
> Latitude: -33.94329 (South); Longitude: 18.48945 (East).
>
> (p) +27 (0)21 506 7300
> (p) +27 (0)21 506 7365 (direct)
> (f) +27 (0)21 506 7375
> (m) +27 (0)84 504 5050
>


Re: [casper] SOLVED: ROACH 2's suddenly freezing left and right

2013-03-15 Thread G Jones
Hi Wes,
The problem shows up with both the latest pull from ska-sa and an earlier
one from a couple of months ago. The crashing bof crashes with both and the
working bof works with both. That's why I'm wondering if it's some
interaction with the ADC5G since I presume your designs are mostly with the
katADC?

Has anyone else compiled/run bofs using ADC5G with these latest changes?

Glenn


On Fri, Mar 15, 2013 at 10:19 AM, Wesley New  wrote:

> Hi Glenn,
>
> We are running many bof files with that change and are doing plenty of
> register and bram reads and writes and have not experienced any issues with
> these bus accesses. What version of TCPBorphServer are you running?
>
> Wes
>
>
> On Fri, Mar 15, 2013 at 3:28 PM, G Jones  wrote:
>
>> Hi,
>> It should have occurred to me sooner, but I checked through the commit
>> logs for mlib_devel and remembered I had updated from ska-sa a couple of
>> weeks ago to get the bugfix for the rcs block. In doing so, I had also
>> pulled down this commit:
>>
>>
>> https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
>> "Simplified the EPB to OPB 32bit bus cycle and now supports legacy byte
>> enable support for ROACH 1 modules on ROACH 2."
>>
>> which sounds suspicious since the problem seemed to be related to reading
>> writing brams/software registers.
>>
>> Indeed, when I switched over to the commit right before that one and
>> compiled the same test design, I ended up with a boffile that has not yet
>> crashed (the bad bof would have certainly crashed by now).
>>
>> The design is simply two ADC5Gs connected to a snapshot blocks. The ADCs
>> are clocked at 2880 MHz, so the FPGA is running at 180 MHz.  I'm not sure
>> if the problem is some interaction between the ADC5Gs and this commit, or
>> the clock rate or what.
>>
>> Henno, can you double check the code in this commit and see if you can
>> ascertain where the bug might be?
>>
>> Glenn
>>
>> On Thu, Mar 14, 2013 at 12:00 PM, G Jones wrote:
>>
>>> Hi,
>>> For some unknown reason, boffiles I generate with my toolflow cause
>>> ROACH 2's to freeze up after a few minutes (I think related to I/O to
>>> software registers and shared BRAMs rather than any specific amount of
>>> time). I don't know of any changes I made to my toolflow since the
>>> last time I compiled working boffiles. Previously working boffiles
>>> still work, but recompiled designs do not work. The symptom is that
>>> the python katcp client stops responding. SSHing to the ROACH and
>>> running ps shows that tcpborphserver3 is no longer running. It finally
>>> occurred to me to check dmesg, and on all crashed ROACHs, I see this
>>> in the demsg:
>>>
>>> ...
>>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value
>>> 1
>>> attempting led toggle
>>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value
>>> 0
>>> attempting led toggle
>>> About to toggle cpu_rdy pinMachine check in kernel mode.
>>> Data Read PLB Error
>>> Oops: Machine check, sig: 7 [#1]
>>> PowerPC 44x Platform
>>> Modules linked in:
>>> NIP: 0fea4048 LR: 0fea3f88 CTR: 0004
>>> REGS: ef00bf10 TRAP: 0214   Not tainted  (3.7.0-rc2+)
>>> MSR: 0002d000   CR: 2224  XER: 
>>> TASK = efb54060[516] 'tcpborphserver3' THREAD: ef00a000
>>> GPR00:  bfcb7290 48031e20 10628bf9 4802c010 0004 0018
>>> 7f7f7f7f
>>> GPR08:  10628bf0 10628ba0 0fea3f80 2222 1006ba18 
>>> 
>>> GPR16:       
>>> 
>>> GPR24:    0004 10628bf9 10628bf9 0ff91ff4
>>> 4802c011
>>> NIP [0fea4048] 0xfea4048
>>> LR [0fea3f88] 0xfea3f88
>>> Call Trace:
>>> ---[ end trace 59d28c137ef7dde2 ]---
>>>
>>> roach VMA close
>>> roach release mem called
>>>
>>> -
>>>
>>> If I then try to reboot the ROACH with shutdown -r now, it hardfreezes
>>> and requires a power cycle to get it running again.
>>>
>>> Any ideas where to look for this problem?
>>>
>>> Thanks,
>>> Glenn
>>>
>>
>>
>


Re: [casper] SOLVED: ROACH 2's suddenly freezing left and right

2013-03-15 Thread Henno Kriel
Hi Glenn

Is it possible to send me you model file?

I have a fairly sizable design running with these changes, that has many
register, shared BRAMs and snap blocks, without issues.

You mentioned that the design crashes after a while - could you give me a
more precise indication of the time span?

Regards
Henno

On Fri, Mar 15, 2013 at 3:28 PM, G Jones  wrote:

> Hi,
> It should have occurred to me sooner, but I checked through the commit
> logs for mlib_devel and remembered I had updated from ska-sa a couple of
> weeks ago to get the bugfix for the rcs block. In doing so, I had also
> pulled down this commit:
>
>
> https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
> "Simplified the EPB to OPB 32bit bus cycle and now supports legacy byte
> enable support for ROACH 1 modules on ROACH 2."
>
> which sounds suspicious since the problem seemed to be related to reading
> writing brams/software registers.
>
> Indeed, when I switched over to the commit right before that one and
> compiled the same test design, I ended up with a boffile that has not yet
> crashed (the bad bof would have certainly crashed by now).
>
> The design is simply two ADC5Gs connected to a snapshot blocks. The ADCs
> are clocked at 2880 MHz, so the FPGA is running at 180 MHz.  I'm not sure
> if the problem is some interaction between the ADC5Gs and this commit, or
> the clock rate or what.
>
> Henno, can you double check the code in this commit and see if you can
> ascertain where the bug might be?
>
> Glenn
>
> On Thu, Mar 14, 2013 at 12:00 PM, G Jones  wrote:
>
>> Hi,
>> For some unknown reason, boffiles I generate with my toolflow cause
>> ROACH 2's to freeze up after a few minutes (I think related to I/O to
>> software registers and shared BRAMs rather than any specific amount of
>> time). I don't know of any changes I made to my toolflow since the
>> last time I compiled working boffiles. Previously working boffiles
>> still work, but recompiled designs do not work. The symptom is that
>> the python katcp client stops responding. SSHing to the ROACH and
>> running ps shows that tcpborphserver3 is no longer running. It finally
>> occurred to me to check dmesg, and on all crashed ROACHs, I see this
>> in the demsg:
>>
>> ...
>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 1
>> attempting led toggle
>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 0
>> attempting led toggle
>> About to toggle cpu_rdy pinMachine check in kernel mode.
>> Data Read PLB Error
>> Oops: Machine check, sig: 7 [#1]
>> PowerPC 44x Platform
>> Modules linked in:
>> NIP: 0fea4048 LR: 0fea3f88 CTR: 0004
>> REGS: ef00bf10 TRAP: 0214   Not tainted  (3.7.0-rc2+)
>> MSR: 0002d000   CR: 2224  XER: 
>> TASK = efb54060[516] 'tcpborphserver3' THREAD: ef00a000
>> GPR00:  bfcb7290 48031e20 10628bf9 4802c010 0004 0018
>> 7f7f7f7f
>> GPR08:  10628bf0 10628ba0 0fea3f80 2222 1006ba18 
>> 
>> GPR16:       
>> 
>> GPR24:    0004 10628bf9 10628bf9 0ff91ff4
>> 4802c011
>> NIP [0fea4048] 0xfea4048
>> LR [0fea3f88] 0xfea3f88
>> Call Trace:
>> ---[ end trace 59d28c137ef7dde2 ]---
>>
>> roach VMA close
>> roach release mem called
>>
>> -
>>
>> If I then try to reboot the ROACH with shutdown -r now, it hardfreezes
>> and requires a power cycle to get it running again.
>>
>> Any ideas where to look for this problem?
>>
>> Thanks,
>> Glenn
>>
>
>


-- 
Henno Kriel

DSP Engineer
Digital Back End
meerKAT

SKA South Africa
Third Floor
The Park
Park Road (off Alexandra Road)
Pinelands
7405
Western Cape
South Africa

Latitude: -33.94329 (South); Longitude: 18.48945 (East).

(p) +27 (0)21 506 7300
(p) +27 (0)21 506 7365 (direct)
(f) +27 (0)21 506 7375
(m) +27 (0)84 504 5050


Re: [casper] SOLVED: ROACH 2's suddenly freezing left and right

2013-03-15 Thread Wesley New
Hi Glenn,

We are running many bof files with that change and are doing plenty of
register and bram reads and writes and have not experienced any issues with
these bus accesses. What version of TCPBorphServer are you running?

Wes


On Fri, Mar 15, 2013 at 3:28 PM, G Jones  wrote:

> Hi,
> It should have occurred to me sooner, but I checked through the commit
> logs for mlib_devel and remembered I had updated from ska-sa a couple of
> weeks ago to get the bugfix for the rcs block. In doing so, I had also
> pulled down this commit:
>
>
> https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
> "Simplified the EPB to OPB 32bit bus cycle and now supports legacy byte
> enable support for ROACH 1 modules on ROACH 2."
>
> which sounds suspicious since the problem seemed to be related to reading
> writing brams/software registers.
>
> Indeed, when I switched over to the commit right before that one and
> compiled the same test design, I ended up with a boffile that has not yet
> crashed (the bad bof would have certainly crashed by now).
>
> The design is simply two ADC5Gs connected to a snapshot blocks. The ADCs
> are clocked at 2880 MHz, so the FPGA is running at 180 MHz.  I'm not sure
> if the problem is some interaction between the ADC5Gs and this commit, or
> the clock rate or what.
>
> Henno, can you double check the code in this commit and see if you can
> ascertain where the bug might be?
>
> Glenn
>
> On Thu, Mar 14, 2013 at 12:00 PM, G Jones  wrote:
>
>> Hi,
>> For some unknown reason, boffiles I generate with my toolflow cause
>> ROACH 2's to freeze up after a few minutes (I think related to I/O to
>> software registers and shared BRAMs rather than any specific amount of
>> time). I don't know of any changes I made to my toolflow since the
>> last time I compiled working boffiles. Previously working boffiles
>> still work, but recompiled designs do not work. The symptom is that
>> the python katcp client stops responding. SSHing to the ROACH and
>> running ps shows that tcpborphserver3 is no longer running. It finally
>> occurred to me to check dmesg, and on all crashed ROACHs, I see this
>> in the demsg:
>>
>> ...
>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 1
>> attempting led toggle
>> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 0
>> attempting led toggle
>> About to toggle cpu_rdy pinMachine check in kernel mode.
>> Data Read PLB Error
>> Oops: Machine check, sig: 7 [#1]
>> PowerPC 44x Platform
>> Modules linked in:
>> NIP: 0fea4048 LR: 0fea3f88 CTR: 0004
>> REGS: ef00bf10 TRAP: 0214   Not tainted  (3.7.0-rc2+)
>> MSR: 0002d000   CR: 2224  XER: 
>> TASK = efb54060[516] 'tcpborphserver3' THREAD: ef00a000
>> GPR00:  bfcb7290 48031e20 10628bf9 4802c010 0004 0018
>> 7f7f7f7f
>> GPR08:  10628bf0 10628ba0 0fea3f80 2222 1006ba18 
>> 
>> GPR16:       
>> 
>> GPR24:    0004 10628bf9 10628bf9 0ff91ff4
>> 4802c011
>> NIP [0fea4048] 0xfea4048
>> LR [0fea3f88] 0xfea3f88
>> Call Trace:
>> ---[ end trace 59d28c137ef7dde2 ]---
>>
>> roach VMA close
>> roach release mem called
>>
>> -
>>
>> If I then try to reboot the ROACH with shutdown -r now, it hardfreezes
>> and requires a power cycle to get it running again.
>>
>> Any ideas where to look for this problem?
>>
>> Thanks,
>> Glenn
>>
>
>


[casper] SOLVED: ROACH 2's suddenly freezing left and right

2013-03-15 Thread G Jones
Hi,
It should have occurred to me sooner, but I checked through the commit logs
for mlib_devel and remembered I had updated from ska-sa a couple of weeks
ago to get the bugfix for the rcs block. In doing so, I had also pulled
down this commit:

https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
"Simplified the EPB to OPB 32bit bus cycle and now supports legacy byte
enable support for ROACH 1 modules on ROACH 2."

which sounds suspicious since the problem seemed to be related to reading
writing brams/software registers.

Indeed, when I switched over to the commit right before that one and
compiled the same test design, I ended up with a boffile that has not yet
crashed (the bad bof would have certainly crashed by now).

The design is simply two ADC5Gs connected to a snapshot blocks. The ADCs
are clocked at 2880 MHz, so the FPGA is running at 180 MHz.  I'm not sure
if the problem is some interaction between the ADC5Gs and this commit, or
the clock rate or what.

Henno, can you double check the code in this commit and see if you can
ascertain where the bug might be?

Glenn

On Thu, Mar 14, 2013 at 12:00 PM, G Jones  wrote:

> Hi,
> For some unknown reason, boffiles I generate with my toolflow cause
> ROACH 2's to freeze up after a few minutes (I think related to I/O to
> software registers and shared BRAMs rather than any specific amount of
> time). I don't know of any changes I made to my toolflow since the
> last time I compiled working boffiles. Previously working boffiles
> still work, but recompiled designs do not work. The symptom is that
> the python katcp client stops responding. SSHing to the ROACH and
> running ps shows that tcpborphserver3 is no longer running. It finally
> occurred to me to check dmesg, and on all crashed ROACHs, I see this
> in the demsg:
>
> ...
> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 1
> attempting led toggle
> About to toggle cpu_rdy pin<7>r2case_event(): Got type 11, code 8, value 0
> attempting led toggle
> About to toggle cpu_rdy pinMachine check in kernel mode.
> Data Read PLB Error
> Oops: Machine check, sig: 7 [#1]
> PowerPC 44x Platform
> Modules linked in:
> NIP: 0fea4048 LR: 0fea3f88 CTR: 0004
> REGS: ef00bf10 TRAP: 0214   Not tainted  (3.7.0-rc2+)
> MSR: 0002d000   CR: 2224  XER: 
> TASK = efb54060[516] 'tcpborphserver3' THREAD: ef00a000
> GPR00:  bfcb7290 48031e20 10628bf9 4802c010 0004 0018
> 7f7f7f7f
> GPR08:  10628bf0 10628ba0 0fea3f80 2222 1006ba18 
> 
> GPR16:       
> 
> GPR24:    0004 10628bf9 10628bf9 0ff91ff4
> 4802c011
> NIP [0fea4048] 0xfea4048
> LR [0fea3f88] 0xfea3f88
> Call Trace:
> ---[ end trace 59d28c137ef7dde2 ]---
>
> roach VMA close
> roach release mem called
>
> -
>
> If I then try to reboot the ROACH with shutdown -r now, it hardfreezes
> and requires a power cycle to get it running again.
>
> Any ideas where to look for this problem?
>
> Thanks,
> Glenn
>