David,

We'll take another close look at what model we are actually using, just to
be safe.

I went back and looked at our e-mails, and sure enough, you're right. You
were referring to the MTU issue as being the problem you tend to suppress
all memory of. It was just that you stated it in a separate paragraph, so,
out-of-context, I extrapolated that you have had the same problem before.
My bad for dragging your good name through the mud. :)

We will also update our local repositories, in the event some bizarre race
condition exists on our end.

I didn't know that the buffer could fill up while reset was asserted. We'll
definitely have to check up on that too.

We haven't tried dumping raw ADC data yet since we have been trying to get
the data link working first. After that, we were planning to inject signal
and examine outputs.

Thanks,

Richard Black

On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon <dav...@astro.berkeley.edu>
wrote:

> Hi, Richard,
>
> On Oct 27, 2014, at 9:25 AM, Richard Black wrote:
>
> > This is a reportedly fully-functional model that shouldn't require any
> major changes in order to operate. However, this has clearly not been the
> case in at least two independent situations (us and Peter). This begs the
> question: what's so different about our use of PAPER?
>
> I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is
> the one being used by the PAPER correlator currently fielded in South
> Africa.  It is definitely a fully functional model.  That image (and all
> source files for it) is available from the git repo listed on the PAPER
> Correlator Manifest page of the CASPER Wiki:
>
> https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest
>
> > We, at BYU, have made painstakingly sure that our IP addressing schemes,
> switch ports, and scripts are all configured correctly (thanks to David
> MacMahon for that, btw), but we still have hit the proverbial brick wall of
> 10-GbE overflow.  When I last corresponded with David, he explained that he
> remembers having a similar issue before, but can't recall exactly what the
> problem was.
>
> Really?  I recall saying that I often forget about increasing the MTU of
> the 10 GbE switch and NICs.  I don't recall saying that I had a similar
> issue before but couldn't remember the problem.
>
> > In any case, the fact that by turning down the ADC clock prior to
> start-up prevents the 10-GbE core from overflowing is a major lead for us
> at BYU (we've been spinning our wheels on this issue for several months
> now). By no means are we proposing mid-run ADC clock modifications, but
> this appears to be a very subtle (and quite sinister, in my opinion) bug.
> >
> > Any thoughts as to what might be going on?
>
> I cannot explain the 10 GbE overflow that you and Peter are experiencing.
> I have pushed some updates to the rb-papergpu.git repository listed on the
> PAPER Correlator Manifest page.  The paper_feng_init.rb script now verifies
> that the ADC clocks are locked and provides options for issuing a software
> sync (only recommended for lab use) and for not storing the time of
> synchronization in redis (also only recommended for lab use).
>
> The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1)
> while they are held in reset.  Since you are using the paper_feng_init.rb
> script, this should not be happening (unless something has gone wrong
> during the running of that script) because that script specifically and
> explicitly disables the tx_valid signal before putting the cores into reset
> and it takes the cores out of reset before enabling the tx_valid signal.
> So assuming that this is not the cause of the overflows, there must be
> something else that is causing the 10 GbE cores to be unable to transmit
> data fast enough to keep up with the data stream it is being fed.  Two
> things that could cause this are 1) running the design faster than the 200
> MHz sample clock that it was built for and/or 2) some link issue that
> prevents the core from sending data.  Unfortunately, I think both of those
> ideas are also pretty far fetched given all you've done to try to get the
> system working.  I wonder whether there is some difference in the ROACH2
> firmware (u-boot version or CPLD programming) or PPC Linux setup or
> tcpborhpserver revision or ???.
>
> Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data
> to make sure that it looks OK?
>
> Dave
>
>

Reply via email to