Re: [casper] ROACH2 dies on fpga.read(...)
Hey Glenn, all, Just to follow up on this, I reverted back to the old version, as indicated by Glenn here. This solved my problem. Thanks glenn! --Ryan Monroe 626.773.0805 On 07/12/2013 07:26 AM, G Jones wrote: Below is a message I wrote with more about the problems we had at NRAO, which did not make it to the list. By the way, others at NRAO are using a recent version of the repository and have had better luck, but based on your experience I wonder if there is still some subtle issue with marginal signals or timing on some boards. Glenn Previous message: The problem was because of some errors that crept into the ska-sa repository. I had to revert to a commit BEFORE this one https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad (easy to remember since the hash starts with 'bad' :) Something about this EPB to OPB optimization they did messes things up. In theory they reverted these changes, but I found it still was present last time I looked. And this is of course the least fun kind of problem to keep checking if it's still there... Note I had other issues with ROACH1s with this commit too. On Thu, Jul 11, 2013 at 8:23 PM, Ryan Monroe wrote: Thanks! Sounds good Also: I take back the deterministic part. The other roach started having the problem too, and it might have something to do with read lengths. More to follow (eventually) --Ryan Monroe 626.773.0805 On 07/11/2013 05:20 PM, John Ford wrote: Hi Ryan. We had this problem, which appeared to be a "lockup". I think that Glenn and some others corresponded about it, and it was due to trying to read/write bytes instead of words over the opb bus with a buggy kernel or a buggy library. You might search through the mailing list for Glenn's name in about November of last year. John Hey all, I'm trying to test out a new bit file (it uses the "pcore" feature and has 4 black boxes under the hood for what it's worth). *On one ROACH2 it works just fine* (in the context of this problem). On the other one, for ~1/5 of the registers, upon reading that register the ROACH2 stops responding to all katcp commands. From dmesg, it looks like tcpborphserver is crashing. It appears that the registers which kill it are deterministic across programmings. It also looks like the registers which fail are all shared_brams, but there is nothing exceptional about the ones which fail, imho Attached are the results of a python script on the two roaches, and a dmesg output of the failed board. In addition, pictures of the configuration for both roaches. Anyone seen this before? -- --Ryan Monroe 626.773.0805
Re: [casper] ROACH2 dies on fpga.read(...)
Below is a message I wrote with more about the problems we had at NRAO, which did not make it to the list. By the way, others at NRAO are using a recent version of the repository and have had better luck, but based on your experience I wonder if there is still some subtle issue with marginal signals or timing on some boards. Glenn Previous message: The problem was because of some errors that crept into the ska-sa repository. I had to revert to a commit BEFORE this one https://github.com/ska-sa/mlib_devel/commit/bad95b18fe79146d288607e5fe3c0360c071c2ad (easy to remember since the hash starts with 'bad' :) Something about this EPB to OPB optimization they did messes things up. In theory they reverted these changes, but I found it still was present last time I looked. And this is of course the least fun kind of problem to keep checking if it's still there... Note I had other issues with ROACH1s with this commit too. On Thu, Jul 11, 2013 at 8:23 PM, Ryan Monroe wrote: > Thanks! Sounds good > > Also: I take back the deterministic part. The other roach started having > the problem too, and it might have something to do with read lengths. More > to follow (eventually) > > --Ryan Monroe > 626.773.0805 > > > On 07/11/2013 05:20 PM, John Ford wrote: >> >> Hi Ryan. We had this problem, which appeared to be a "lockup". I think >> that Glenn and some others corresponded about it, and it was due to trying >> to read/write bytes instead of words over the opb bus with a buggy kernel >> or a buggy library. >> >> You might search through the mailing list for Glenn's name in about >> November of last year. >> >> John >> >> >>> Hey all, >>> >>> I'm trying to test out a new bit file (it uses the "pcore" feature and >>> has 4 black boxes under the hood for what it's worth). *On one ROACH2 it >>> works just fine* (in the context of this problem). >>> >>> On the other one, for ~1/5 of the registers, upon reading that register >>> the ROACH2 stops responding to all katcp commands. From dmesg, it looks >>> like tcpborphserver is crashing. It appears that the registers which >>> kill it are deterministic across programmings. It also looks like the >>> registers which fail are all shared_brams, but there is nothing >>> exceptional about the ones which fail, imho >>> >>> Attached are the results of a python script on the two roaches, and a >>> dmesg output of the failed board. In addition, pictures of the >>> configuration for both roaches. >>> >>> Anyone seen this before? >>> >>> -- >>> --Ryan Monroe >>> 626.773.0805 >>> >>> >> > >
Re: [casper] ROACH2 dies on fpga.read(...)
Thanks! Sounds good Also: I take back the deterministic part. The other roach started having the problem too, and it might have something to do with read lengths. More to follow (eventually) --Ryan Monroe 626.773.0805 On 07/11/2013 05:20 PM, John Ford wrote: Hi Ryan. We had this problem, which appeared to be a "lockup". I think that Glenn and some others corresponded about it, and it was due to trying to read/write bytes instead of words over the opb bus with a buggy kernel or a buggy library. You might search through the mailing list for Glenn's name in about November of last year. John Hey all, I'm trying to test out a new bit file (it uses the "pcore" feature and has 4 black boxes under the hood for what it's worth). *On one ROACH2 it works just fine* (in the context of this problem). On the other one, for ~1/5 of the registers, upon reading that register the ROACH2 stops responding to all katcp commands. From dmesg, it looks like tcpborphserver is crashing. It appears that the registers which kill it are deterministic across programmings. It also looks like the registers which fail are all shared_brams, but there is nothing exceptional about the ones which fail, imho Attached are the results of a python script on the two roaches, and a dmesg output of the failed board. In addition, pictures of the configuration for both roaches. Anyone seen this before? -- --Ryan Monroe 626.773.0805
Re: [casper] ROACH2 dies on fpga.read(...)
Hi Ryan. We had this problem, which appeared to be a "lockup". I think that Glenn and some others corresponded about it, and it was due to trying to read/write bytes instead of words over the opb bus with a buggy kernel or a buggy library. You might search through the mailing list for Glenn's name in about November of last year. John > Hey all, > > I'm trying to test out a new bit file (it uses the "pcore" feature and > has 4 black boxes under the hood for what it's worth). *On one ROACH2 it > works just fine* (in the context of this problem). > > On the other one, for ~1/5 of the registers, upon reading that register > the ROACH2 stops responding to all katcp commands. From dmesg, it looks > like tcpborphserver is crashing. It appears that the registers which > kill it are deterministic across programmings. It also looks like the > registers which fail are all shared_brams, but there is nothing > exceptional about the ones which fail, imho > > Attached are the results of a python script on the two roaches, and a > dmesg output of the failed board. In addition, pictures of the > configuration for both roaches. > > Anyone seen this before? > > -- > --Ryan Monroe > 626.773.0805 > >