Hi, Richard,

On Oct 27, 2014, at 9:25 AM, Richard Black wrote:

> This is a reportedly fully-functional model that shouldn't require any major 
> changes in order to operate. However, this has clearly not been the case in 
> at least two independent situations (us and Peter). This begs the question: 
> what's so different about our use of PAPER?

I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one 
being used by the PAPER correlator currently fielded in South Africa.  It is 
definitely a fully functional model.  That image (and all source files for it) 
is available from the git repo listed on the PAPER Correlator Manifest page of 
the CASPER Wiki:

https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest

> We, at BYU, have made painstakingly sure that our IP addressing schemes, 
> switch ports, and scripts are all configured correctly (thanks to David 
> MacMahon for that, btw), but we still have hit the proverbial brick wall of 
> 10-GbE overflow.  When I last corresponded with David, he explained that he 
> remembers having a similar issue before, but can't recall exactly what the 
> problem was.

Really?  I recall saying that I often forget about increasing the MTU of the 10 
GbE switch and NICs.  I don't recall saying that I had a similar issue before 
but couldn't remember the problem.

> In any case, the fact that by turning down the ADC clock prior to start-up 
> prevents the 10-GbE core from overflowing is a major lead for us at BYU 
> (we've been spinning our wheels on this issue for several months now). By no 
> means are we proposing mid-run ADC clock modifications, but this appears to 
> be a very subtle (and quite sinister, in my opinion) bug.
> 
> Any thoughts as to what might be going on?

I cannot explain the 10 GbE overflow that you and Peter are experiencing.  I 
have pushed some updates to the rb-papergpu.git repository listed on the PAPER 
Correlator Manifest page.  The paper_feng_init.rb script now verifies that the 
ADC clocks are locked and provides options for issuing a software sync (only 
recommended for lab use) and for not storing the time of synchronization in 
redis (also only recommended for lab use).

The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) 
while they are held in reset.  Since you are using the paper_feng_init.rb 
script, this should not be happening (unless something has gone wrong during 
the running of that script) because that script specifically and explicitly 
disables the tx_valid signal before putting the cores into reset and it takes 
the cores out of reset before enabling the tx_valid signal.  So assuming that 
this is not the cause of the overflows, there must be something else that is 
causing the 10 GbE cores to be unable to transmit data fast enough to keep up 
with the data stream it is being fed.  Two things that could cause this are 1) 
running the design faster than the 200 MHz sample clock that it was built for 
and/or 2) some link issue that prevents the core from sending data.  
Unfortunately, I think both of those ideas are also pretty far fetched given 
all you've done to try to get the system working.  I wonder whether there is 
some difference in the ROACH2 firmware (u-boot version or CPLD programming) or 
PPC Linux setup or tcpborhpserver revision or ???.

Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data to 
make sure that it looks OK?

Dave


Reply via email to