Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote:
> This is a reportedly fully-functional model that shouldn't require any major > changes in order to operate. However, this has clearly not been the case in > at least two independent situations (us and Peter). This begs the question: > what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest > We, at BYU, have made painstakingly sure that our IP addressing schemes, > switch ports, and scripts are all configured correctly (thanks to David > MacMahon for that, btw), but we still have hit the proverbial brick wall of > 10-GbE overflow. When I last corresponded with David, he explained that he > remembers having a similar issue before, but can't recall exactly what the > problem was. Really? I recall saying that I often forget about increasing the MTU of the 10 GbE switch and NICs. I don't recall saying that I had a similar issue before but couldn't remember the problem. > In any case, the fact that by turning down the ADC clock prior to start-up > prevents the 10-GbE core from overflowing is a major lead for us at BYU > (we've been spinning our wheels on this issue for several months now). By no > means are we proposing mid-run ADC clock modifications, but this appears to > be a very subtle (and quite sinister, in my opinion) bug. > > Any thoughts as to what might be going on? I cannot explain the 10 GbE overflow that you and Peter are experiencing. I have pushed some updates to the rb-papergpu.git repository listed on the PAPER Correlator Manifest page. The paper_feng_init.rb script now verifies that the ADC clocks are locked and provides options for issuing a software sync (only recommended for lab use) and for not storing the time of synchronization in redis (also only recommended for lab use). The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) while they are held in reset. Since you are using the paper_feng_init.rb script, this should not be happening (unless something has gone wrong during the running of that script) because that script specifically and explicitly disables the tx_valid signal before putting the cores into reset and it takes the cores out of reset before enabling the tx_valid signal. So assuming that this is not the cause of the overflows, there must be something else that is causing the 10 GbE cores to be unable to transmit data fast enough to keep up with the data stream it is being fed. Two things that could cause this are 1) running the design faster than the 200 MHz sample clock that it was built for and/or 2) some link issue that prevents the core from sending data. Unfortunately, I think both of those ideas are also pretty far fetched given all you've done to try to get the system working. I wonder whether there is some difference in the ROACH2 firmware (u-boot version or CPLD programming) or PPC Linux setup or tcpborhpserver revision or ???. Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data to make sure that it looks OK? Dave