Re: [casper] ROACH2 dies on fpga.read(...)

2013-07-17 Thread Ryan Monroe

Hey Glenn, all,

Just to follow up on this, I reverted back to the old version, as 
indicated by Glenn here.  This solved my problem.  Thanks glenn!


--Ryan Monroe
626.773.0805

On 07/12/2013 07:26 AM, G Jones wrote:

Below is a message I wrote with more about the problems we had at
NRAO, which did not make it to the list. By the way, others at NRAO
are using a recent version of the repository and have had better luck,
but based on your experience I wonder if there is still some subtle
issue with marginal signals or timing on some boards.

Glenn

Previous message:

The problem was because of some errors that crept into the ska-sa
repository. I had to revert to a commit BEFORE this one
https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
  (easy to remember since the hash starts with 'bad' :)
Something about this EPB to OPB optimization they did messes things
up. In theory they reverted these changes, but I found it still was
present last time I looked. And this is of course the least fun kind
of problem to keep checking if it's still there...
Note I had other issues with ROACH1s with this commit too.


On Thu, Jul 11, 2013 at 8:23 PM, Ryan Monroe  wrote:

Thanks!  Sounds good

Also: I take back the deterministic part.  The other roach started having
the problem too, and it might have something to do with read lengths.  More
to follow (eventually)

--Ryan Monroe
626.773.0805


On 07/11/2013 05:20 PM, John Ford wrote:

Hi Ryan.  We had this problem, which appeared to be a "lockup".  I think
that Glenn and some others corresponded about it, and it was due to trying
to read/write bytes instead of words over the opb bus with a buggy kernel
or a buggy library.

You might search through the mailing list for Glenn's name in about
November of last year.

John



Hey all,

I'm trying to test out a new bit file (it uses the "pcore" feature and
has 4 black boxes under the hood for what it's worth). *On one ROACH2 it
works just fine* (in the context of this problem).

On the other one, for ~1/5 of the registers, upon reading that register
the ROACH2 stops responding to all katcp commands.  From dmesg, it looks
like tcpborphserver is crashing.  It appears that the registers which
kill it are deterministic across programmings. It also looks like the
registers which fail are all shared_brams, but there is nothing
exceptional about the ones which fail, imho

Attached are the results of a python script on the two roaches, and a
dmesg output of the failed board.  In addition, pictures of the
configuration for both roaches.

Anyone seen this before?

--
--Ryan Monroe
626.773.0805









Re: [casper] ROACH2 dies on fpga.read(...)

2013-07-12 Thread G Jones
Below is a message I wrote with more about the problems we had at
NRAO, which did not make it to the list. By the way, others at NRAO
are using a recent version of the repository and have had better luck,
but based on your experience I wonder if there is still some subtle
issue with marginal signals or timing on some boards.

Glenn

Previous message:

The problem was because of some errors that crept into the ska-sa
repository. I had to revert to a commit BEFORE this one
https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad
 (easy to remember since the hash starts with 'bad' :)
Something about this EPB to OPB optimization they did messes things
up. In theory they reverted these changes, but I found it still was
present last time I looked. And this is of course the least fun kind
of problem to keep checking if it's still there...
Note I had other issues with ROACH1s with this commit too.


On Thu, Jul 11, 2013 at 8:23 PM, Ryan Monroe  wrote:
> Thanks!  Sounds good
>
> Also: I take back the deterministic part.  The other roach started having
> the problem too, and it might have something to do with read lengths.  More
> to follow (eventually)
>
> --Ryan Monroe
> 626.773.0805
>
>
> On 07/11/2013 05:20 PM, John Ford wrote:
>>
>> Hi Ryan.  We had this problem, which appeared to be a "lockup".  I think
>> that Glenn and some others corresponded about it, and it was due to trying
>> to read/write bytes instead of words over the opb bus with a buggy kernel
>> or a buggy library.
>>
>> You might search through the mailing list for Glenn's name in about
>> November of last year.
>>
>> John
>>
>>
>>> Hey all,
>>>
>>> I'm trying to test out a new bit file (it uses the "pcore" feature and
>>> has 4 black boxes under the hood for what it's worth). *On one ROACH2 it
>>> works just fine* (in the context of this problem).
>>>
>>> On the other one, for ~1/5 of the registers, upon reading that register
>>> the ROACH2 stops responding to all katcp commands.  From dmesg, it looks
>>> like tcpborphserver is crashing.  It appears that the registers which
>>> kill it are deterministic across programmings. It also looks like the
>>> registers which fail are all shared_brams, but there is nothing
>>> exceptional about the ones which fail, imho
>>>
>>> Attached are the results of a python script on the two roaches, and a
>>> dmesg output of the failed board.  In addition, pictures of the
>>> configuration for both roaches.
>>>
>>> Anyone seen this before?
>>>
>>> --
>>> --Ryan Monroe
>>> 626.773.0805
>>>
>>>
>>
>
>



Re: [casper] ROACH2 dies on fpga.read(...)

2013-07-11 Thread Ryan Monroe

Thanks!  Sounds good

Also: I take back the deterministic part.  The other roach started 
having the problem too, and it might have something to do with read 
lengths.  More to follow (eventually)


--Ryan Monroe
626.773.0805

On 07/11/2013 05:20 PM, John Ford wrote:

Hi Ryan.  We had this problem, which appeared to be a "lockup".  I think
that Glenn and some others corresponded about it, and it was due to trying
to read/write bytes instead of words over the opb bus with a buggy kernel
or a buggy library.

You might search through the mailing list for Glenn's name in about
November of last year.

John



Hey all,

I'm trying to test out a new bit file (it uses the "pcore" feature and
has 4 black boxes under the hood for what it's worth). *On one ROACH2 it
works just fine* (in the context of this problem).

On the other one, for ~1/5 of the registers, upon reading that register
the ROACH2 stops responding to all katcp commands.  From dmesg, it looks
like tcpborphserver is crashing.  It appears that the registers which
kill it are deterministic across programmings. It also looks like the
registers which fail are all shared_brams, but there is nothing
exceptional about the ones which fail, imho

Attached are the results of a python script on the two roaches, and a
dmesg output of the failed board.  In addition, pictures of the
configuration for both roaches.

Anyone seen this before?

--
--Ryan Monroe
626.773.0805









Re: [casper] ROACH2 dies on fpga.read(...)

2013-07-11 Thread John Ford
Hi Ryan.  We had this problem, which appeared to be a "lockup".  I think
that Glenn and some others corresponded about it, and it was due to trying
to read/write bytes instead of words over the opb bus with a buggy kernel
or a buggy library.

You might search through the mailing list for Glenn's name in about
November of last year.

John


> Hey all,
>
> I'm trying to test out a new bit file (it uses the "pcore" feature and
> has 4 black boxes under the hood for what it's worth). *On one ROACH2 it
> works just fine* (in the context of this problem).
>
> On the other one, for ~1/5 of the registers, upon reading that register
> the ROACH2 stops responding to all katcp commands.  From dmesg, it looks
> like tcpborphserver is crashing.  It appears that the registers which
> kill it are deterministic across programmings. It also looks like the
> registers which fail are all shared_brams, but there is nothing
> exceptional about the ones which fail, imho
>
> Attached are the results of a python script on the two roaches, and a
> dmesg output of the failed board.  In addition, pictures of the
> configuration for both roaches.
>
> Anyone seen this before?
>
> --
> --Ryan Monroe
> 626.773.0805
>
>