Hi Jason,
Thanks for your pointers; I am currently not actually using the FPGA.
Just focusing on being able to talk to the powerpc reliably at the moment.
The system does also crash when using NFS, but as I said and you noted;
it is more difficult to trace them directly back to EMACS related kernel
functions. It may very well be a secondary symptom of something else.
Now your suggested versions for Uboot/CPLD/Monitor are interesting.
We have two roach boards; the newer one that I have been testing is
reporting
U-Boot 2008.10-svn2157 (Jul 31 2009 - 17:15:22)
...
Monitor Revision: 7.3.0
CPLD Revision: 7.5.6
Whereas the older Roach that Wan has been using reports
U-Boot 2008.10-svn1923 (May 29 2009 - 17:22:43)
...
Monitor Revision: 6.5.1429
CPLD Revision: 2.0.5
Leaving this older one aside for reference for now, I have upgraded the U-boot
image on the newer roach to 20090807-uboot-nohack.bin, which is actually from
revision 2212, but seemed to be the closest to the suggested revision I could
find without compiling the image myself.
I was unsuccessfully looking around for how to upgrade the CPLD/Monitor.
Would you be able to point me in the right direction?
I'll test for any improvements with the new uboot now.
Thanks again
Kjetil
Jason Manley wrote:
Also, make sure you're running newwer versions of uboot and the CPLD
image. Bus settings changed some months back and improved stability
significantly.
Uboot will report the versions, and I recommend:
U-Boot 2008.10-svn2226 (Aug 7 2009 - 16:06:44)
...
Monitor Revision: 8.3.1698
CPLD Revision: 8.1.0
at the very least, you should have CPLD Revision 8.0.1588.
The only outstanding bug that regularly affects me is that u-boot
sometimes doesn't detect the PPC's SDRAM on startup. The system then
hangs. Replacing the DIMM with registered memory (same as FPGA DIMM)
apparently fixes this.
Jason
On 04 Nov 2009, at 07:56, Kjetil Wormnes wrote:
Hi all,
For reference I've attached a summary of our problems below, and a
few things I have attempted to do to isolate it. The short of it is
that we are unable to transfer large amounts of data across the
ethernet reliably regardless of;
--kernel version
--whether we are usb mount or nfs mount root file system.
--network protocol used for transfer
The way the crash happens varies, and is not repeatable. Sometime it
seems to be a userspace crash, sometimes it is a kernel panic. I
have been unable to see any real pattern in the crash reports. This
to me seems to indicate that the root cause of the problem may be
common, and either an obscure kernel problem or possibly something
in the interface between the kernel and the hardware or in the
hardware itself.
It wouldn't be a big effort to re implement our software to run on a
remote machine and talk to the ROACH over KATCP, rather than run
locally on the ppc. But since it would require a complete rewrite of
the software, we haven't tested this yet. Perhaps it is worth trying.
The catch is that I am still really unsure whether we are dealing
with many symptoms of the same problem; or many different problems.
Anyway, I would like to thank you for all your input, and will let
you know if and how we find a satisfactory solution.
cheers
Kjetil
Here is the summary:
*The problem*
The system crashes when downloading large files. There appears to be
varying causes for this crash that may or may not have a common
underlying reason.
I have attempted to isolate the problem by
• Downloading using different protocols and software; ssh and two
different ftp servers.
• Mounting the filesystem over NFS as opposed to USB
• Installing well-known and used kernels, and comparing to custom
kernels.
SSH
SSH always crashes with “Invalid MAC on input” or related error
messages. This appears to be a problem with SSH.
*FTP*
System instabilities were observed using two different ftp servers;
proftpd and pure-ftpd.
In the best case, with pure-ftpd was able to download 2-3 files,
each of size about 2GB before system crashing. Looking through the
call stack seemed to indicate that the crash happened in EMAC
interface functions. (ie ethernet).
However, we have no way of knowing whether these crashes are in fact
rather side-effects of the USB subsystem misbehaving. Jason from the
Casper mailing list has once again reconfirmed that USB on powerpcs
is "notoriously unreliable".
*DIFFERENT KERNELS - DIFFERENT PROBLEMS*
Using some kernels (the latest) saw the link unable to come up at
all, while both a custom compiled older kernel (a couple of months
ago) and a downloaded image, "uImage-20091006-mmcfix" both saw the
link come up, but with all the crashes described.
*ELIMINATING USB AS A CAUSE*
To eliminate the effects of USB, I mounted the root filesystem
remotely using NFS. I make a few observations;
*SSH*
Still dies from time to time with the "Invalid MAC" error message.
This was expected as we have already pretty much determined that
this error is ssh-specific and not related to our other worries.
*ETHERNET*
Comes up nicely. System mounts remotely and file access has not
caused any obvious problems. In fact I have not really had any
problems that I can trace directly back to the Ethernet.
That being said, the systems seems to crash after a little while
with this setup also. The error messages have been varying. Only
once has it been a kernel crash, and then, looking at the call stack
it no longer appears to crash inside EMAC access functions.
The download speeds seem quite variable; but this is probably more
likely due to the network since the operating system is over NFS
than the ROACH board itself.
Jason Manley wrote:
Marc Welz or David George built that kernel. They are the best
people to ask about this. I've cc'd them, though I'm not sure
either would have the config file from that release. It might be
easiest to checkout an older svn version.
Might I suggest that instead of recording data to a USB HDD, that
you rather record it across the network to another computer? If
you don't want to use KATCP for dumping the data directly from
your FPGA, you can always mount an NFS network share on your ROACH
and record the data there. The USB on the PPC platforms are
notoriously unreliable.
Jason
On 03 Nov 2009, at 03:05, Kjetil Wormnes wrote:
Hi Jason,
Thank you again for your reply. I can use FTP or even write my
own little raw socket transfer routine, and it seems to work, I
can transfer a few gigabyte-size files.
However, at the end of this, the other problem kicks in; causing
a system crash. I believe this is a kernel problem, as it
exhibits itself differently with different kernels I have tried.
So, putting the ssh problem aside as something that we can work
around and returning to the other request I made;
I am compiling my own kernel because I seem to need to in order
to get EHCI and EXT3 to work properly.
However, when I do, EMAC can't autonegotiate a link, and even
forcing it to something doesn't work. The link comes up, then
drops out again... repeatedly.
The interesting thing is this problem *does not* occur when I
compile my kernel using an svn checkout from a couple of months
ago. Even with the exact same .config file.
At least this is the case as far as I can tell.
Now, in order to be 100% sure that it is in fact a difference in
the source that is causing this problem, rather than just
the .config. I would love it if you could send me the .config
file used to compile the uImage-20091006-mmcfix kernel.
The ethernet interface does appear to be more stable with that
kernel, but unfortunately I can't use it as it doesn't allow USB
2.0 speeds, so if you please, the .config file would be very
useful.
Thanks again for all your help
Kjetil
Jason Manley wrote:
There appears to be some issue with ssh on ROACH with large
transfers. It is definitely not a hardware problem as other
network transfers work fine. Both Andrew Martens and myself
regularly transfer large amounts of data (>1GB) using KATCP.
This ssh bug has become a low priority for us as we concentrate
on other things. If you do not want to try'n debug it yourself,
I recommend you try an FTP server.
Kjetil, you are correct; at present, KATCP does not support
transfer of arbitrary files from filesystem.
Jason
On 02 Nov 2009, at 00:51, Kjetil Wormnes wrote:
Hi Jason,
thank you for your reply. The SUN link was very descriptive.
Firstly, it appears the problem is still there with the kernel
build you suggested/ After a few megabytes, the connection
closes telling me; "Corrupted MAC on input".
But interestingly it seems to have solved another problem that
I was having with one of our ROACH boards. It would be great
if you could send me the .config file for that build so I can
compare it with mine. I have a custom kernel as I like ext3
support and a few other bits and pieces, but have been having
some issues getting the network to establish a stable link.
Now, back to the problem; We have a locally attached harddrive
that we are writing our data to over USB. Occasionally we want
to connect and download these. That's why I am using ssh. I
can't really use KATCP for this, can I?
Thanks again,
Kjetil
Jason Manley wrote:
Um, no, this is probably a different problem. You are getting
these errors while using SSH/SCP, right? The hardware problem
with faulty PHY manifests as one or more of the PHY LEDs
flashing on/ off (there are three red ones next to the PHY
chip). If your link is stable, then I believe the hardware
is fine.
The "MAC" problem appears to be software related, and comes
and goes depending on the kernel build. It does not refer to
the MAC address, but rather ssh's Machine Authentication
Code. Check out http://blogs.sun.com/janp/entry/ssh_messages_code_bad_packet
for some info.
Dave's made various changes to try'n fix it, and increasing
some software buffer has solved it for me. I no longer see
this problem, but it's probably been masked rather than
solved. Also, you never see it using KATCP, which is one
more reason to use that method for larger transfers.
WRT large (>1GB) transfers, remember that it will take a long
time to pull that much data off the FPGA. It does so in
pages of ~4000Bytes at a time. Also make sure you're using
the latest kernel. We discovered a bug in this paging system
during the workshop.
http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-20091006-mmcfix
should be good. I have never tried pulling such volumes
over the SSH shell, but it works fine with KATCP.
I will ask him to comment further.
Jason
On 30 Oct 2009, at 01:25, John Ford wrote:
casper collaborators,
appended below is further info on roach ethernet problems
seen at CSIRO:
any ideas?
If I recall correctly, Alan mentioned this problem at the
workshop, and
the problem was that some of the PHY chips were faulty at
one point. This
may be what's going on. Hopefully someone knows for sure!
John
thanks,
dan
-------- Original Message --------
Subject: Re: SPDO ROACH spectrometer
Date: Fri, 30 Oct 2009 09:19:01 +1100
From: Kjetil Wormnes <kjetil.worm...@csiro.au>
To: Dan Werthimer <d...@ssl.berkeley.edu>
Hi Dan and Wan
I can confirm that we are seeing at least some of the
problems with
another ROACH board as well. This time it is connected
directly to a
computer with a short CATY5 cable.
So maybe this indicates that it is less likely to be a
hardware problem?
Incidentally, the error message that happens when attempting
to download
a large file over sftp is "Corrupted MAC on input".
cheers
Kjetil
Dan Werthimer wrote:
hi wan,
i don't know of anyone who has roach ethernet
problems at 100 Mbit/sec.
i'm cc'ing casper community to see if anyone has any ideas.
in general, it's good to post questions to cas...@lists,
so that everyone can help answer, and everyone can see the
answers,
and the info will be captured in the wiki/email archive.
if you want you can buy or ask digicom if they can send you
another national PHY chip and see if this helps.
also you might want to try using short cable, and/or a cat6
cable.
is your roach connected directly to a computer, or going
through a switch? might be interesting to try a different NIC
or different switch or different computer.
best,
dan
On 10/29/2009 02:47 PM, wan.ch...@csiro.au wrote:
Hi Dan:
I believe you have done a very nice job.
My problem is Ethernet port is not very reliable. Even
running at
100MHz, the Ethernet port will be disconnected at some
times. Normally,
it can resume after reboot whole system.
And I could not transfer big file through ethernet. Small
files like a
few MB are all right. But I could not download 1GB file
from Roach at
all.
So Dan, could this problem be solved by replacing the on
board PHY?
Thanks
Wan