Re: [casper] ROACH hangs before Uboot prompt

2009-11-17 Thread Kjetil Wormnes

Hi Jason,

thanks for your email. All your help getting our systems up and running 
is very much appreciated.


I have upgraded to the latest version of uBoot as per your suggestion.
I also reinstalled the latest filesystem I could find; 
filesystem_etch_2009_10_28_tgtap.bz


Unfortunately, it still dies before the end of the boot sequence.

Both using usbboot and nfsboot the boot sequence gets almost to the end 
when it dies with some malloc problem. The exact cause is hard to 
determine, but I have attached some logs for your reference. (Note that 
these were taken before I upgraded the filesystem, but doing so changed 
nothing).

The main thing to note are:
Memory error message before uboot prompt: Memory error at 0150, 
wrote , read ffeb !

Malloc error at the end of the boot sequence

So, at this stage I have run out of ideas. So unless you have any other 
ideas it may be that the best option is to wait for Wan to come back, 
and he can try to use his newfound knowledge to update everything on 
this board to the same working state as his roach hopefully will be in.


thanks again,k and cheer

Kjetil

Jason Manley wrote:

Hi Kjetil

Since decreasing the bus speed back to the normal 66.67MHz (133MHz  
memory) and using registered DIMMs, we have not had any further memory  
troubles. This has been checked on 5 different ROACH boards from  
production run 1 and 2. Dave's even put one of the boards in a burn-in  
bootloop and after over a thousand boots, still not a single failure.  
But perhaps this sample size is too small to be meaningful.


Preparation H was trying to operate the memory right at the PPC's  
limits (167MHz), and it seems to be an unrealistic target no matter  
how carefully the board is designed. The PPC system just does not run  
reliably at those speeds. We knew this might be a problem from the  
outset and so made provision for booting at slower speeds through the  
use of DIP switches or remotely reconfiguring from the XPORT. I am  
surprised that you're still seeing memory errors. Both registered and  
unregistered DIMMs work for me at these speeds.


I doubt there will be a hardware fix for this. We could rev the board  
and try'n tighten-up the timing, but I am confident in the hardware  
design at this stage and suspect a software issue on your side. For  
one, there is an error in the floating-point-unit test that Uboot  
doesn't like which might be causing some of your troubles. Could you  
please provide a printout of Uboot's error messages? There is now a  
new version of Uboot in SVN (uboot-clkfix-20091113.bin) which has this  
test disabled. Please give that a go.


You do not need any DIMM in the FPGA to boot the PPC. And with  
registered DIMMs and bootstrap option C, you should not be seeing any  
more memory errors. If updating uboot doesn't fix your problem, I  
suggest you send all your ROACHs and memory modules with Wan next week  
and we'll get 'em all up and running with the latest firmware versions  
for you.


Jason

On 13 Nov 2009, at 02:58, Kjetil Wormnes wrote:

  

Hi again Jason,

thanks again. Swapping to the FPGA dimm did indeed get me to the  
uboot prompt (although with the same memory errormessages along the  
way).


However I have not been able to then proceed to boot the kernel, but  
this may be because I now have no memory available to the FPGA. It  
may also be the other bugs you spoke off.


I should also mention that I took the opportunity of upgrading uboot  
to svn2226. The problem still persists, and only using the FPGA dimm  
gets me to the uboot prompt.


To be honest, the solutions here to swap to registered DIMMs or by  
changing to bootstrap C seem more like hacks than anything else, and  
certainly not something that inspires confidence in reliability.


Do you think the underlying problem; ie the poor signal integrity or  
the aggressive bus timing is fixable by bug-fixes to uboot? Or is  
this something that will require an upgrade to the hardware? If this  
is the case, we would love (and need) to know as it means we  
probably will need to delay our program until we can get more  
reliable hardware. If it is a software issue then I would have to  
ask you if you think the fix could get a quite high priority?


But on another note, I pretty desperately need to get this thing  
booting again, even with the reliability issues we had before. This  
is so that I can work on some of the software interfaces while Wan  
is over your way.


So, I guess I might try to go out and buy some registered DIMM.  
Could you please advice me on some of the other specs that are  
important; 512 MB DDR2 at 400 MHz seem very difficult to get hold  
off, and anything else appears to cause the memory errors I  
mentioned. Are these important? Or should it work if I got, say a 1  
GB stick at some faster speed? I noticed the FPGA ram is 800 MHz  
which should be obtainable... that is... if those errors are  
something we can live

Re: [casper] ROACH hangs before Uboot prompt

2009-11-12 Thread Kjetil Wormnes

Hi again Jason,

thanks again. Swapping to the FPGA dimm did indeed get me to the uboot 
prompt (although with the same memory errormessages along the way).


However I have not been able to then proceed to boot the kernel, but 
this may be because I now have no memory available to the FPGA. It may 
also be the other bugs you spoke off.


I should also mention that I took the opportunity of upgrading uboot to 
svn2226. The problem still persists, and only using the FPGA dimm gets 
me to the uboot prompt.


To be honest, the solutions here to swap to registered DIMMs or by 
changing to bootstrap C seem more like hacks than anything else, and 
certainly not something that inspires confidence in reliability.


Do you think the underlying problem; ie the poor signal integrity or the 
aggressive bus timing is fixable by bug-fixes to uboot? Or is this 
something that will require an upgrade to the hardware? If this is the 
case, we would love (and need) to know as it means we probably will need 
to delay our program until we can get more reliable hardware. If it is a 
software issue then I would have to ask you if you think the fix could 
get a quite high priority?


But on another note, I pretty desperately need to get this thing booting 
again, even with the reliability issues we had before. This is so that I 
can work on some of the software interfaces while Wan is over your way.


So, I guess I might try to go out and buy some registered DIMM. Could 
you please advice me on some of the other specs that are important; 512 
MB DDR2 at 400 MHz seem very difficult to get hold off, and anything 
else appears to cause the memory errors I mentioned. Are these 
important? Or should it work if I got, say a 1 GB stick at some faster 
speed? I noticed the FPGA ram is 800 MHz which should be obtainable... 
that is... if those errors are something we can live with for now.


cheers

Kjetil

Jason Manley wrote:

Hi Kjetil.

We have occasionally observed a similar problem here. Uboot tries to  
learn the required memory timing when booting. Sometimes it fails.  
It seems to be due to aggressive bus timing and poor signal integrity.  
Switching to registered DIMMs (like the one the FPGA uses) solves that  
problem, but introduces a new one which appears sporadically later in  
the Uboot boot process.


We've declocked our boards to Bootstrap C and it solved our memory  
issues. Since you've already tried this without success, I suggest you  
try registered DIMMs (put the FPGA dimm in the PPC slot) and see if  
that solves this problem for you.


If declocking doesn't fix this, we will have to work on a Uboot fix to  
enable reliable support for registered DIMMs.


Jason


On 11 Nov 2009, at 03:43, Kjetil Wormnes wrote:

  

Hi all,

new week, new problem.

I tried to boot my Roach board this morning after not touching it  
for about 1.5 weeks. This time however it didn't get to the uBoot  
prompt. It hangs at the memory test. Interestingly there has been no  
change from before when it did boot. Anyway; this is what is  
displayed:




U-Boot 2008.10-svn2212 (Aug  7 2009 - 12:20:58)

CPU:   AMCC PowerPC 440EPx Rev. A at 528 MHz (PLB=132, OPB=66,  
EBC=66 MHz)

  No Security/Kasumi support
  Bootstrap Option C - Boot ROM Location EBC (16 bits)
  32 kB I-Cache 32 kB D-Cache
Board: Roach
I2C:   ready
DTT:   1 is 26 C
DRAM:  (spd v1.0) 512 MB
  
I noticed that C-H Cheng had posted about this exact same problem in  
August this year.

(http://www.mail-archive.com/casper@lists.berkeley.edu/msg00870.html).

In their case the problem seems to have been solved by swapping  
memory stick and upgrading uboot using a JTAG programmer.


As you can see above, I tried using SW3 to force Bootstrap option C.  
This did not make a difference.


I have tried to swap the memory; could not find the exact same so  
got a 1 GB stick at a higher speed. This just caused the system to  
return a Memory Error;


snip
DRAM:  (spd v1.2)  1 GB
Memory error at 0004, wrote , read 0055 !
  
I tried to swap with an identical memory stick from our other Roach  
board. This did not make a difference.


Before I send this board back to be reprogrammed I was wondering if  
anyone would have any other suggestions?


Thank you for all your continuing help

regards

Kjetil




  





[casper] ROACH hangs before Uboot prompt

2009-11-10 Thread Kjetil Wormnes

Hi all,

new week, new problem.

I tried to boot my Roach board this morning after not touching it for 
about 1.5 weeks. This time however it didn't get to the uBoot prompt. It 
hangs at the memory test. Interestingly there has been no change from 
before when it did boot. Anyway; this is what is displayed:



U-Boot 2008.10-svn2212 (Aug  7 2009 - 12:20:58)

CPU:   AMCC PowerPC 440EPx Rev. A at 528 MHz (PLB=132, OPB=66, EBC=66 MHz)
   No Security/Kasumi support
   Bootstrap Option C - Boot ROM Location EBC (16 bits)
   32 kB I-Cache 32 kB D-Cache
Board: Roach
I2C:   ready
DTT:   1 is 26 C
DRAM:  (spd v1.0) 512 MB


I noticed that C-H Cheng had posted about this exact same problem in 
August this year.

(http://www.mail-archive.com/casper@lists.berkeley.edu/msg00870.html).

In their case the problem seems to have been solved by swapping memory 
stick and upgrading uboot using a JTAG programmer.


As you can see above, I tried using SW3 to force Bootstrap option C. 
This did not make a difference.


I have tried to swap the memory; could not find the exact same so got a 
1 GB stick at a higher speed. This just caused the system to return a 
Memory Error;

snip
DRAM:  (spd v1.2)  1 GB
Memory error at 0004, wrote , read 0055 !


I tried to swap with an identical memory stick from our other Roach 
board. This did not make a difference.


Before I send this board back to be reprogrammed I was wondering if 
anyone would have any other suggestions?


Thank you for all your continuing help

regards

Kjetil



Re: [casper] Fwd: Re: SPDO ROACH spectrometer

2009-11-05 Thread Kjetil Wormnes

Hi Jason,

Just out of curiosity, did you get my last email? I noticed that your 
reply was not to the last one I sent. In the last one I detailed some 
tests and the results. It also  showed the uboot and bootstrap config. 
Anyway, it had some attachments and I am unsure how the list handled 
those. I can't seem to find it in the archives, but it may just not have 
been listed yet. If you didn't get that email, let me know and I'll send 
it again.


So, it looks like the clocks and bootstrap is matching (although to be 
honest, I only updated the eeprom to make it boot of H about a week ago).


I checked all the resistors you indicated; all except for one is within 
1 Ohm of 51. One resistor is at 59 ohm. But I don't suspect that should 
really be a problem.


To be honest, I am a bit reluctant to point to hardware problems since 
we have two Roach boards and the chance that we would have two dud ones 
seems slim. It seems much more likely to me that the problem is in the 
software/kernel or firmware. Or alternatively there is something a bit 
unreliable in the hardware that the software/kernel/firmware is not 
handling as well as it could.


I appreciate greatly that you are doing some tests on your hardware. I 
think what I would like to do now is to wait for the new uboot, and 
ensure that *everything*; uboot/kernel/cpld/filesystem/bootargs/physical 
setup is identical between our systems.


Then it would be good if we could develop a well specified simple test; 
scping a 2gb file a few times would probably be fine.


And if there are still mismatches then we can start worrying about 
hardware problems.


I'll be out of action at a course most of next week, but since we are 
waiting for the new uboot anyway that should not be such an issue. Wan 
and/or Aaron may wish to continue this discussion during that week 
though, otherwise I'll be back on the 9th. I've ccd Aaron on this email.


Thanks once again. You are being very helpful and I am feeling that we 
are making progress.


best regards

Kjetil

Jason Manley wrote:

Hi Kjetil

Since you're not using the FPGA at all, that rules out bus issues. I
suspect a memory problem. Please check your memory DIMM as outlined in
my earlier email.

WRT Uboot versions: We'll work on releasing a new Uboot with latest
SVN source to be sure we're all running the same version. Expect an
update next week after we've had a chance to verify that the new
version works correctly.

Please also check your clocks: make sure you're booting with bootstrap
option H with the same bus speeds as listed below (check lines 3 and 5
in Uboot header). If this is the same, then don't worry about updating
the Fusion.

CPU:   AMCC PowerPC 440EPx Rev. A at 495 MHz (PLB=165, OPB=82, EBC=82
MHz)
No Security/Kasumi support
Bootstrap Option H - Boot ROM Location I2C (Addr 0x52)
...

If it's something else, clocks are setup incorrectly. To fix this,
first check that all DIP switches are set to off. If DIP switches are
off and it's booting into config C, you might need to flash your
Fusion (a flag in its eeprom toggles between boot option C and boot
option H).

If it is already boot option H, but speeds are wrong, then the
settings in an I2C EEPROM are wrong. Reset 'em as follows:
   *) update your uboot to latest version.
   *) interrupt Uboot and clear the environment by executing run
clearenv.
   *) Reboot.
   *) Interrupt boot and execute run init_eeprom
   *) reconfigure your mac address by executing setenv ethaddr
02:00:00:aa:bb:cc (where aabbcc is your board's serial number).
   *) save the environment by executing saveenv.
   *) Reboot.



FWIW, Dave's managed to transfer large files (2GB) without problem,
even using SCP, both sending and receiving. Tests ongoing this side.


Jason



On 05 Nov 2009, at 00:15, Kjetil Wormnes wrote:

  

Hi Jason,

Thanks for your pointers; I am currently not actually using the
FPGA. Just focusing on being able to talk to the powerpc reliably at
the moment.

The system does also crash when using NFS, but as I said and you
noted; it is more difficult to trace them directly back to EMACS
related kernel functions. It may very well be a secondary symptom of
something else.

Now your suggested versions for Uboot/CPLD/Monitor are interesting.

We have two roach boards; the newer one that I have been testing is
reporting

U-Boot 2008.10-svn2157 (Jul  31 2009 - 17:15:22)
...
Monitor Revision: 7.3.0
CPLD Revision:7.5.6

Whereas the older Roach that Wan has been using reports

U-Boot 2008.10-svn1923 (May  29 2009 - 17:22:43)
...
Monitor Revision: 6.5.1429
CPLD Revision:2.0.5


Leaving this older one aside for reference for now, I have upgraded
the U-boot image on the newer roach to 20090807-uboot-nohack.bin,
which is actually from revision 2212, but seemed to be the closest
to the suggested revision I could find without compiling the image
myself.


I was unsuccessfully looking around for how to upgrade the CPLD/
Monitor. Would you be able

Re: [casper] Fwd: Re: SPDO ROACH spectrometer

2009-11-04 Thread Kjetil Wormnes

Hi Jason,

Thanks for your pointers; I am currently not actually using the FPGA. 
Just focusing on being able to talk to the powerpc reliably at the moment.


The system does also crash when using NFS, but as I said and you noted; 
it is more difficult to trace them directly back to EMACS related kernel 
functions. It may very well be a secondary symptom of something else.


Now your suggested versions for Uboot/CPLD/Monitor are interesting.

We have two roach boards; the newer one that I have been testing is 
reporting


U-Boot 2008.10-svn2157 (Jul  31 2009 - 17:15:22)
...
Monitor Revision: 7.3.0
CPLD Revision:7.5.6

Whereas the older Roach that Wan has been using reports

U-Boot 2008.10-svn1923 (May  29 2009 - 17:22:43)
...
Monitor Revision: 6.5.1429
CPLD Revision:2.0.5


Leaving this older one aside for reference for now, I have upgraded the U-boot 
image on the newer roach to 20090807-uboot-nohack.bin, which is actually from 
revision 2212, but seemed to be the closest to the suggested revision I could 
find without compiling the image myself.


I was unsuccessfully looking around for how to upgrade the CPLD/Monitor. 
Would you be able to point me in the right direction?


I'll test for any improvements with the new uboot now.

Thanks again

Kjetil

Jason Manley wrote:

Also, make sure you're running newwer versions of uboot and the CPLD
image. Bus settings changed some months back and improved stability
significantly.

Uboot will report the versions, and I recommend:

U-Boot 2008.10-svn2226 (Aug  7 2009 - 16:06:44)
...
Monitor Revision: 8.3.1698
CPLD Revision:8.1.0

at the very least, you should have CPLD Revision 8.0.1588.

The only outstanding bug that regularly affects me is that u-boot
sometimes doesn't detect the PPC's SDRAM on startup. The system then
hangs. Replacing the DIMM with registered memory (same as FPGA DIMM)
apparently fixes this.

Jason

On 04 Nov 2009, at 07:56, Kjetil Wormnes wrote:

  

Hi all,

For reference I've attached a summary of our problems below, and a
few things I have attempted to do to isolate it. The short of it is
that we are unable to transfer large amounts of data across the
ethernet reliably regardless of;
--kernel version
--whether we are usb mount or nfs mount root file system.
--network protocol used for transfer

The way the crash happens varies, and is not repeatable. Sometime it
seems to be a userspace crash, sometimes it is a kernel panic. I
have been unable to see any real pattern in the crash reports. This
to me seems to indicate that the root cause of the problem may be
common, and either an obscure kernel problem or possibly something
in the interface between the kernel and the hardware or in the
hardware itself.

It wouldn't be a big effort to re implement our software to run on a
remote machine and talk to the ROACH over KATCP, rather than run
locally on the ppc. But since it would require a complete rewrite of
the software, we haven't tested this yet. Perhaps it is worth trying.

The catch is that I am still really unsure whether we are dealing
with many symptoms of the same problem; or many different problems.

Anyway, I would like to thank you for all your input, and will let
you know if and how we find a satisfactory solution.

cheers

Kjetil



Here is the summary:



*The problem*
The system crashes when downloading large files. There appears to be
varying causes for this crash that may or may not have a common
underlying reason.

I have attempted to isolate the problem by
• Downloading using different protocols and software; ssh and two
different ftp servers.
• Mounting the filesystem over NFS as opposed to USB
• Installing well-known and used kernels, and comparing to custom
kernels.
SSH
SSH always crashes with “Invalid MAC on input” or related error
messages. This appears to be a problem with SSH.

*FTP*
System instabilities were observed using two different ftp servers;
proftpd and pure-ftpd.

In the best case, with pure-ftpd was able to download 2-3 files,
each of size about 2GB before system crashing. Looking through the
call stack seemed to indicate that the crash happened in EMAC
interface functions. (ie ethernet).

However, we have no way of knowing whether these crashes are in fact
rather side-effects of the USB subsystem misbehaving. Jason from the
Casper mailing list has once again reconfirmed that USB on powerpcs
is notoriously unreliable.

*DIFFERENT KERNELS - DIFFERENT PROBLEMS*
Using some kernels (the latest) saw the link unable to come up at
all, while both a custom compiled older kernel (a couple of months
ago) and a downloaded image, uImage-20091006-mmcfix both saw the
link come up, but with all the crashes described.

*ELIMINATING USB AS A CAUSE*
To eliminate the effects of USB, I mounted the root filesystem
remotely using NFS. I make a few observations;

*SSH*
Still dies from time to time with the Invalid MAC error message.
This was expected as we have already pretty much determined that
this error is ssh

Re: [casper] Fwd: Re: SPDO ROACH spectrometer

2009-11-02 Thread Kjetil Wormnes

Hi Jason,

Thank you again for your reply. I can use FTP or even write my own 
little raw socket transfer routine, and it seems to work, I can 
transfer  a few gigabyte-size files.


However, at the end of this, the other problem kicks in; causing a 
system crash. I believe this is a kernel problem, as it exhibits itself 
differently with different kernels I have tried.


So, putting the ssh problem aside as something that we can work around 
and returning to the other request I made;


I am compiling my own kernel because I seem to need to in order to get 
EHCI and EXT3 to work properly.


However, when I do, EMAC can't autonegotiate a link, and even forcing it 
to something doesn't work. The link comes up, then drops out again... 
repeatedly.


The interesting thing is this problem *does not* occur when I compile my 
kernel using an svn checkout from a couple of months ago. Even with the 
exact same .config file.


At least this is the case as far as I can tell.

Now, in order to be 100% sure that it is in fact a difference in the 
source that is causing this problem, rather than just the .config. I 
would love it if you could send me the .config file used to compile the 
uImage-20091006-mmcfix kernel.


The ethernet interface does appear to be more stable with that kernel, 
but unfortunately I can't use it as it doesn't allow USB 2.0 speeds, so 
if you please, the .config file would be very useful.


Thanks again for all your help


Kjetil


Jason Manley wrote:
There appears to be some issue with ssh on ROACH with large transfers.  
It is definitely not a hardware problem as other network transfers  
work fine. Both Andrew Martens and myself regularly transfer large  
amounts of data (1GB) using KATCP. This ssh bug has become a low  
priority for us as we concentrate on other things. If you do not want  
to try'n debug it yourself, I recommend you try an FTP server.


Kjetil, you are correct; at present, KATCP does not support transfer  
of arbitrary files from filesystem.


Jason

On 02 Nov 2009, at 00:51, Kjetil Wormnes wrote:

  

Hi Jason,

thank you for your reply. The SUN link was very descriptive.

Firstly, it appears the problem is still there with the kernel build  
you suggested/ After a few megabytes, the connection closes telling  
me; Corrupted MAC on input.


But interestingly it seems to have solved another problem that I was  
having with one of our ROACH boards. It would be great if you could  
send me the .config file for that build so I can compare it with  
mine. I have a custom kernel as I like ext3 support and a few other  
bits and pieces, but have been having some issues getting the  
network to establish a stable link.


Now, back to the problem; We have a locally attached harddrive that  
we are writing our data to over USB. Occasionally we want to connect  
and download these. That's why I am using ssh. I can't really use  
KATCP for this, can I?


Thanks again,

Kjetil




Jason Manley wrote:

Um, no, this is probably a different problem. You are getting  
these  errors while using SSH/SCP, right? The hardware problem with  
faulty  PHY manifests as one or more of the PHY LEDs flashing on/ 
off (there  are three red ones next to the PHY chip). If your link  
is stable, then  I believe the hardware is fine.


The MAC problem appears to be software related, and comes and  
goes  depending on the kernel build. It does not refer to the MAC  
address,  but rather ssh's Machine Authentication Code. Check out http://blogs.sun.com/janp/entry/ssh_messages_code_bad_packet 
   for some info.


Dave's made various changes to try'n fix it, and increasing some   
software buffer has solved it for me. I no longer see this  
problem,  but it's probably been masked rather than solved. Also,  
you never see  it using KATCP, which is one more reason to use that  
method for larger  transfers.


WRT large (1GB) transfers, remember that it will take a long time  
to  pull that much data off the FPGA. It does so in pages of  
~4000Bytes at  a time. Also make sure you're using the latest  
kernel. We discovered a  bug in this paging system during the  
workshop. http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-20091006-mmcfix 
   should be good. I have never tried pulling such volumes over the  
SSH  shell, but it works fine with KATCP.


I will ask him to comment further.

Jason


On 30 Oct 2009, at 01:25, John Ford wrote:


  

casper collaborators,

appended below is further info on roach ethernet problems seen  
at  CSIRO:

any ideas?

  
If I recall correctly, Alan mentioned this problem at the  
workshop,  and
the problem was that some of the PHY chips were faulty at one   
point.  This

may be what's going on.  Hopefully someone knows for sure!

John




thanks,

dan

 Original Message 
Subject: Re: SPDO ROACH spectrometer
Date: Fri, 30 Oct 2009 09:19:01 +1100
From: Kjetil Wormnes kjetil.worm...@csiro.au
To: Dan Werthimer d

Re: [casper] Fwd: Re: SPDO ROACH spectrometer

2009-11-01 Thread Kjetil Wormnes

Hi Jason,

thank you for your reply. The SUN link was very descriptive.

Firstly, it appears the problem is still there with the kernel build you 
suggested/ After a few megabytes, the connection closes telling me; 
Corrupted MAC on input.


But interestingly it seems to have solved another problem that I was 
having with one of our ROACH boards. It would be great if you could send 
me the .config file for that build so I can compare it with mine. I have 
a custom kernel as I like ext3 support and a few other bits and pieces, 
but have been having some issues getting the network to establish a 
stable link.


Now, back to the problem; We have a locally attached harddrive that we 
are writing our data to over USB. Occasionally we want to connect and 
download these. That's why I am using ssh. I can't really use KATCP for 
this, can I?


Thanks again,

Kjetil




Jason Manley wrote:
Um, no, this is probably a different problem. You are getting these  
errors while using SSH/SCP, right? The hardware problem with faulty  
PHY manifests as one or more of the PHY LEDs flashing on/off (there  
are three red ones next to the PHY chip). If your link is stable, then  
I believe the hardware is fine.


The MAC problem appears to be software related, and comes and goes  
depending on the kernel build. It does not refer to the MAC address,  
but rather ssh's Machine Authentication Code. Check out http://blogs.sun.com/janp/entry/ssh_messages_code_bad_packet 
  for some info.


Dave's made various changes to try'n fix it, and increasing some  
software buffer has solved it for me. I no longer see this problem,  
but it's probably been masked rather than solved. Also, you never see  
it using KATCP, which is one more reason to use that method for larger  
transfers.


WRT large (1GB) transfers, remember that it will take a long time to  
pull that much data off the FPGA. It does so in pages of ~4000Bytes at  
a time. Also make sure you're using the latest kernel. We discovered a  
bug in this paging system during the workshop. http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-20091006-mmcfix 
  should be good. I have never tried pulling such volumes over the SSH  
shell, but it works fine with KATCP.


I will ask him to comment further.

Jason


On 30 Oct 2009, at 01:25, John Ford wrote:

  

casper collaborators,

appended below is further info on roach ethernet problems seen at  
CSIRO:

any ideas?
  
If I recall correctly, Alan mentioned this problem at the workshop,  
and
the problem was that some of the PHY chips were faulty at one  
point.  This

may be what's going on.  Hopefully someone knows for sure!

John



thanks,

dan

 Original Message 
Subject: Re: SPDO ROACH spectrometer
Date: Fri, 30 Oct 2009 09:19:01 +1100
From: Kjetil Wormnes kjetil.worm...@csiro.au
To: Dan Werthimer d...@ssl.berkeley.edu

Hi Dan and Wan

I can confirm that we are seeing at least some of the problems with
another ROACH board as well. This time it is connected directly to a
computer with a short CATY5 cable.

So maybe this indicates that it is less likely to be a hardware  
problem?
Incidentally, the error message that happens when attempting to  
download

a large file over sftp is Corrupted MAC on input.

cheers

Kjetil

Dan Werthimer wrote:
  

hi wan,

i don't know of anyone who has roach ethernet
problems at 100 Mbit/sec.

i'm cc'ing casper community to see if anyone has any ideas.
in general, it's good to post questions to cas...@lists,
so that everyone can help answer, and everyone can see the answers,
and the info will be captured in the wiki/email archive.

if you want you can buy or ask digicom if they can send you
another national PHY chip and see if this helps.

also you might want to try using short cable, and/or a cat6 cable.
is your roach connected directly to a computer, or going
through a switch?  might be interesting to try a different NIC
or different switch or different computer.

best,

dan


On 10/29/2009 02:47 PM, wan.ch...@csiro.au wrote:



Hi Dan:

I believe you have done a very nice job.

My problem is Ethernet port is not very reliable. Even running at
100MHz, the Ethernet port will be disconnected at some times.  
Normally,

it can resume after reboot whole system.

And I could not transfer big file through ethernet. Small files  
like a
few MB are all right. But I could not download 1GB file from  
Roach at

all.

So Dan, could this problem be solved by replacing the on board PHY?

Thanks

Wan


  
  





  





Re: [casper] SPDO ROACH spectrometer

2009-10-29 Thread Kjetil Wormnes

Hi Dan and Wan

I can confirm that we are seeing at least some of the problems with 
another ROACH board as well. This time it is connected directly to a 
computer with a short CATY5 cable.


So maybe this indicates that it is less likely to be a hardware problem? 
Incidentally, the error message that happens when attempting to download 
a large file over sftp is Corrupted MAC on input.


cheers

Kjetil

Dan Werthimer wrote:

hi wan,

i don't know of anyone who has roach ethernet
problems at 100 Mbit/sec.

i'm cc'ing casper community to see if anyone has any ideas.
in general, it's good to post questions to cas...@lists,
so that everyone can help answer, and everyone can see the answers,
and the info will be captured in the wiki/email archive.

if you want you can buy or ask digicom if they can send you
another national PHY chip and see if this helps.

also you might want to try using short cable, and/or a cat6 cable.
is your roach connected directly to a computer, or going
through a switch?  might be interesting to try a different NIC
or different switch or different computer.

best,

dan


On 10/29/2009 02:47 PM, wan.ch...@csiro.au wrote:
  

Hi Dan:

I believe you have done a very nice job.

My problem is Ethernet port is not very reliable. Even running at 100MHz, the 
Ethernet port will be disconnected at some times. Normally, it can resume after 
reboot whole system.

And I could not transfer big file through ethernet. Small files like a few MB 
are all right. But I could not download 1GB file from Roach at all.

So Dan, could this problem be solved by replacing the on board PHY?

Thanks

Wan







Re: [casper] Slow HD write speed

2009-10-18 Thread Kjetil Wormnes

Hi David,

Thank you so much for that. It was extremely helpful.

We were indeed mounting the drive syncronously, *and* the USB is falling 
back to OHCI. Following your suggestions and removing the sync flag 
seems to have helped a bit, so to recompile the kernel next ...


cheers

Kjetil

David George wrote:

Hi Kjetil.

  
So, we have a ROACH system that we have set up to boot via usbboot 
into a full debian etch filesystem. The problem is that we get 
extremely low write speeds to the disk. In the order of a couple of 
Mbit/s.






The problem might be that your root filesystem is mounted with the 
'sync' flag. Edit your /etc/rcSimple file (Init runs this script first); 
if you have something like:

mount -o remount,rw,sync,noatime /
change it to something like:
mount -o remount,rw,noatime /
This was my mistake - I though it would be a good idea for SD/MMC card 
access to be synchronous. Turns out it wasn't really. Perhaps a 
filesystem update is on the cards.


The other problem is the AMCC PPC440EPX USB seems to misbehave quite 
badly. I have been fiddling around for a day or so, trying to work out 
why one of our flash sticks doesn't work reliably.


Firstly, there is a known issue for the PPC440EPX USB that can lead to 
screw-ups when both OHCI and EHCI Linux drivers are loaded.

https://kerneltrap.org/mailarchive/linux-usb/2008/11/2/3900114
There are fixes (hacks) in the upstream kernel, but we are running off 
the old ppc tree and updating to the new powerpc is not an 
insignificant task (It will probably happen this year though). I think 
this leads to USB devices falling over to OHCI (full-speed) even when 
they are EHCI(high-speed) compatible. This leads to data-rates of 1-2 MB 
per second. This could also be your problem.


Now in theory, if you compile a kernel with just EHCI (high-speed mode) 
your USB devices should work at high speeds (9+ MB per second). However, 
I have seen very weird behaviour on one specific flash drive here with 
just EHCI compiled in. When I first insert the device the usb driver 
spews out errors. Then if I put in another device, which happens to 
work, and reinsert the old flakey flash drive it works fine from then 
on. This makes me think there is some software/setup problem. The same 
flakey device always works in OHCI mode when EHCI hasn't been compiled 
into the kernel.


In summary - USB on ROACH has some known problems which will hopefully 
improve when we update to the new mainline kernel. If you are having 
reliability trouble try compiling a kernel without EHCI. If you want 
maximum performance try compiling without OHCI. Also make sure that your 
root filesystem isn't mounted 'sync'.


Regards,
David

  





[casper] Slow HD write speed

2009-10-14 Thread Kjetil Wormnes

Hello all,

We have a ROACH system and have butted against an ever so small problem 
that I was hoping one of you may be able to give some input on.


You may notice that I am new to the mailing list, so hello :-). Please 
don't hesitate to let me know if I am not conforming to the posting 
policies.


So, we have a ROACH system that we have set up to boot via usbboot into 
a full debian etch filesystem. The problem is that we get extremely low 
write speeds to the disk. In the order of a couple of Mbit/s.


Has anyone come across this problem before? Any ideas how to solve it?

Here is a bit of information about our setup

UBOOT version: U-Boot 2008.10-svn1923
bootargs: bootargs console=ttyS0,115200 mtdparts=${partitions} rootdelay=8
root=/dev/sda1 rw
kernel version: Linux-2.6.25-svn1867-dirty1

I can attach the full bootlog if that would be useful. I haven't done it 
here as I wouldn't want to scare you all with an enormous first post.


best regards
Kjetil