Re: [casper] Problem about the adc frequency in PAPER model.
On Wed, Nov 19, 2014 at 6:50 AM, Peter Niu peterniu...@163.com wrote: Hi,Dave, Sorry reply you late. The little trouble I encountered in netboot turned out to be that the uImage I am using have changed.Well ,As for a test ,I download the latest uImage from https://github.com/ska-sa/roach2_nfs_uboot/tree/master/boot, the uImage-roach2-3.16-hwmon https://github.com/ska-sa/roach2_nfs_uboot/blob/master/boot/uImage-roach2-3.16-hwmon as the uImage in netboot. The file like this: [peter@roachserver ~]$ file -L /srv/roach_boot/boot/uImage /srv/roach_boot/boot/uImage: u-boot legacy uImage, Linux-3.16.0-saska-03675-g1c70f, Linux/PowerPC, OS Kernel Image (gzip), 3034204 bytes, Tue Aug 26 14:54:14 2014, Load Address: 0x0070, Entry Point: 0x007010C4, Header CRC: 0x66EDCF88, Data CRC: 0x42A230BA I changed the uImage to uImage-r2borph3 https://github.com/ska-sa/roach2_nfs_uboot/blob/master/boot/uImage-r2borph3 , There should be an even newer uImage (ie linux kernel) and romfs (ie flash filesystem, containing tcpborphserver3) at that location. I think the most notable change is that we have changed the kernel memory model, so that the full 128Mb fpga address space is visible in one go. There are a probably some other fixes and change too - the commit logs in katcp_devel should have some information. Things are rather busy here, so apologies for not updating the NFS filesystem - we currently don't use it, so it is likely to remain out of date, though Dave (I think ?) maintains a more recent version. regards marc
Re: [casper] Problem about the adc frequency in PAPER model.
On Wed, Nov 19, 2014 at 8:37 AM, Marc Welz m...@ska.ac.za wrote: There should be an even newer uImage (ie linux kernel) and romfs (ie flash filesystem, containing tcpborphserver3) at that location. I think the most notable change is that we have changed the kernel memory model, so that the full 128Mb fpga address space is visible in one go. ... meaning that you would need to update both the kernel and tcpborphserver3 to the revisions checked in a week ago or so, to map the full address space - just updating one will not be sufficient. regards marc
Re: [casper] Problem about the adc frequency in PAPER model.
Hello I find a updated roach2-root-fullmap-2014-08-12.romfs.Could you please tell me what should I do to make it work? Should I put this file in the same place as tcpborphserver3 in Roach2 file system (/usr/local/sbin)? Thanks for your answer ,I am totally a fresh man. :) Peter If you are not solobooting, then on a linux pc somewhere # mkdir -p /mnt/tmp mount -o loop roach2-root-fullmap-2014-08-12.romfs /mnt/tmp ... now copy out /mnt/tmp/sbin/tcpborphserver3 to where you need it regards marc
Re: [casper] Problem about the adc frequency in PAPER model.
On Thu, Nov 13, 2014 at 5:49 AM, Richard Black aeldstes...@gmail.com wrote: Wow. Well that seemed to be the magic bullet. Thanks! Any ideas why this works? Is it because of an NFS lock-out or a 10-GbE driver issue in the NFS kernel image? So I don't know. It could also be a version difference ? The things to look at are the kernel and tcpborphserver (the former is a file in its own right, the latter can be gotten by mounting a romfs image via loopback and copying out /sbin/tcpborphserver3). We also have had interesting cases where the fpga doesn't quite do what the bus controller on the power pc expects to happen - in those cases random perturbations change the behaviour, although pathological cases can have the fpga contend with flash accesses which then corrupts things. Also look in https://github.com/ska-sa/roach2_nfs_uboot, particularly the boot directory - occasionally prebuilt images get uploaded there, though for the change information you will have to read the ska-sa/katcp_devel commits. Final, unrelated, tip: It is fine to have another (interactive) telnet connection to port 7147 on the roach while your scripts are doing things - this connection can be used to see failures or problems, and for detailed debugging messages, try typing ?log-level trace - just be mindful of the performance impact. There is a tool (kcplog) which can be built for a remote machine to automate this. regards marc
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard, I'm glad this fixed your problem as well! This is definitely one for the wiki!!! Dave On Nov 12, 2014, at 2:34 PM, Richard Black wrote: Wow. Well that seemed to be the magic bullet. Thanks! Any ideas why this works? Is it because of an NFS lock-out or a 10-GbE driver issue in the NFS kernel image? In any case, this is a tremendous discovery! Thanks to all for all the effort! Richard On Wednesday, November 12, 2014, 牛晨辉 peterniu...@163.com wrote: Hi All, I'm happy to tell you the PAPER model can run without overflow finally! I find the bof file no matter PAPER model or own could run in 200Mhz and the packet structure is right. That is the System setup on roach it matters,(Thanks to Marc's help in soloboot!).I try the soloboot on the roach, and it works fine for the model. I don't know why the setup on netboot is not ok ,(it influenced the frequency too much I guess)however, FWIW,The overflow problem company with me for few weeks finally solved out! I could have a good sleep tonight,Thanks for your warm help! Peter At 2014-11-08 03:10:47, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, I think that your 1 PPS should be very usable. I think we typically generate the 1 PPS from a GPS clock. If you want to try a test, you could disconnect the 1 PPS and use the software generated sync signal as per the earlier emails. If that works and using the external 1 PPS doesn't then you will have found the problem. I'd be surprised (but happy!) if that turns out to be the problem. Dave On Nov 7, 2014, at 10:55 AM, Richard Black wrote: Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave pulse_profile.png -- Richard Black
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Marc, On Nov 13, 2014, at 12:08 AM, Marc Welz wrote: On Thu, Nov 13, 2014 at 5:49 AM, Richard Black aeldstes...@gmail.com wrote: Any ideas why this works? Is it because of an NFS lock-out or a 10-GbE driver issue in the NFS kernel image? None of the control stuff goes over NFS so I don't think that's likely to be the problem, but at this point (almost) nothing would surprise me. So I don't know. It could also be a version difference ? The things to look at are the kernel and tcpborphserver (the former is a file in its own right, the latter can be gotten by mounting a romfs image via loopback and copying out /sbin/tcpborphserver3). Are the drivers that provide the /dev/roach/mem and /dev/roach/config nodes compiled into the kernel image? We also have had interesting cases where the fpga doesn't quite do what the bus controller on the power pc expects to happen - in those cases random perturbations change the behaviour, although pathological cases can have the fpga contend with flash accesses which then corrupts things. Also look in https://github.com/ska-sa/roach2_nfs_uboot, particularly the boot directory - occasionally prebuilt images get uploaded there, though for the change information you will have to read the ska-sa/katcp_devel commits. Final, unrelated, tip: It is fine to have another (interactive) telnet connection to port 7147 on the roach while your scripts are doing things - this connection can be used to see failures or problems, and for detailed debugging messages, try typing ?log-level trace - just be mindful of the performance impact. There is a tool (kcplog) which can be built for a remote machine to automate this. Thanks for the tips! Dave
Re: [casper] Problem about the adc frequency in PAPER model.
On Thu, Nov 13, 2014 at 8:32 AM, David MacMahon dav...@astro.berkeley.edu wrote: Are the drivers that provide the /dev/roach/mem and /dev/roach/config nodes compiled into the kernel image? Yes, the roach kernels have never used modules regards marc
Re: [casper] Problem about the adc frequency in PAPER model.
Thanks, Marc, On Nov 13, 2014, at 12:08 AM, Marc Welz wrote: Also look in https://github.com/ska-sa/roach2_nfs_uboot, particularly the boot directory - occasionally prebuilt images get uploaded there, though for the change information you will have to read the ska-sa/katcp_devel commits. FWIW, we are using the boot/uImage-r2borph3 kernel image from commit a8da6b6 of that repository. The file command shows it as: $ file -L /srv/tftpboot/uboot-roach2/uImage-r2borph3 /srv/tftpboot/uboot-roach2/uImage-r2borph3: u-boot legacy uImage, Linux-3.7.0-rc2+, Linux/PowerPC, OS Kernel Image (gzip), 2231485 bytes, Sun Nov 18 23:30:35 2012, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0x9BDC0E32, Data CRC: 0xF3A1DC96 Interestingly, the (NOT used by PAPER) soloboot uImage kernel image in /dev/mtdblock0 on one of our ROACH2s deployed in South Africa is: root@r2d020808:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.4.0-rc3+, Linux/PowerPC, OS Kernel Image (gzip), 2429134 bytes, Tue May 29 17:05:09 2012, Load Address: 0x0050, Entry Point: 0x00500460, Header CRC: 0xCAB17B63, Data CRC: 0x096FD3C7 ...while the (NOT used by PAPER) soloboot uImage kernel image in /dev/mtdblock0 on two ROACH2s in our lab is: root@r2d020813:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.9.0-rc1+, Linux/PowerPC, OS Kernel Image (gzip), 2345540 bytes, Wed Mar 6 02:54:34 2013, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0xC0B47AFF, Data CRC: 0x9247592F root@r2d020669:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.9.0-rc1+, Linux/PowerPC, OS Kernel Image (gzip), 2345540 bytes, Wed Mar 6 02:54:34 2013, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0xC0B47AFF, Data CRC: 0x9247592F These two ROACH2s were repaired by Digicom (813 for the U72 fix and 669 for vehicular stress). It looks like Digicom is populating the ROACH2 soloboot with a new uImage that is not available in the roach2_nfs_uboot repo. Are different kernels required for netboot vs soloboot or is this just an oversight? Richard and/or Peter, I'm curious to know what versions of uImage you have for both your netboot environment and in /dev/mtdblock0 on your ROACH2s. Can you please run the above file commands on your uImages and report back with the results? This will hopefully help us zero in on where the problem is (and where/when it was corrected). Thanks, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Hi ,Dave Though I am quite a new hand in the uImage system ,I did suspect the uImage could cause this problem.Do you remember one of my roach in netboot doesn't work normally I mentioned previously?It works fine with the soloboot. Interestingly,one of my roach works fine with netboot before,now could not works in soloboot!(Using the telnet,The /boffiles could not found in the soloboot Roach linux while others could) . I checked the uImage on the no-work-in-soloboot roach,Well,I use soloboot now,so the file command can not be found on busybox. ~ # file -sh: file: not found but the set-up in soloboot process tell me the ulmage is like this: Image Name: Linux-3.4.0-rc3+ Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size: 2429134 Bytes = 2.3 MiB Load Address: 0050 Entry Point: 00500460 Verifying Checksum ... OK Uncompressing Kernel Image ... OK The same roach run in netboot,and log in as root in ssh,The information like this: root@pf1:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.4.0-rc3+, Linux/PowerPC, OS Kernel Image (gzip), 2429134 bytes, Tue May 29 15:05:09 2012, Load Address: 0x00507 As the netboot,we also use the uImage-r2borph3 In our PC, the information like this: [peter@roachserver ~]$ file -L /srv/roach_boot/boot/uImage /srv/roach_boot/boot/uImage: u-boot legacy uImage, Linux-3.7.0-rc2+, Linux/PowerPC, OS Kernel Image (gzip), 2231485 bytes, Mon Nov 19 15:30:35 2012, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0x9BDC0E32, Data CRC: 0xF3A1DC96 (I am not sure why the data is not same with you:Mon Nov 19 15:30:35 2012) what's more information,The other roachs which could work in both soloboot and netboot. The soloboot information in set-up process: ## Booting kernel from Legacy Image at f800 ... Image Name: Linux-3.9.0-rc1+ Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size:2345540 Bytes = 2.2 MiB Load Address: 0050 Entry Point: 005010d4 Verifying Checksum ... OK Uncompressing Kernel Image ... OK As the no-work-in-soloboot roach in soloboot has a image name :Linux-3.4.0-rc3+ I am not sure whether is the Linux version in soloboot that matters. Jason have once mentioned this similar question to me,he also sent me a latest binary romfs soloboot. https://www.mail-archive.com/casper%40lists.berkeley.edu/msg05393.html Hope these information could be helpful to our question! Thanks for your warm help to me in PAPER model ! Peter PS:I also found a new version on https://github.com/ska-sa/roach2_nfs_uboot upload on Nov 12, 2014.I will try it latter. At 2014-11-14 08:40:35, David MacMahon dav...@astro.berkeley.edu wrote: Thanks, Marc, On Nov 13, 2014, at 12:08 AM, Marc Welz wrote: Also look in https://github.com/ska-sa/roach2_nfs_uboot, particularly the boot directory - occasionally prebuilt images get uploaded there, though for the change information you will have to read the ska-sa/katcp_devel commits. FWIW, we are using the boot/uImage-r2borph3 kernel image from commit a8da6b6 of that repository. The file command shows it as: $ file -L /srv/tftpboot/uboot-roach2/uImage-r2borph3 /srv/tftpboot/uboot-roach2/uImage-r2borph3: u-boot legacy uImage, Linux-3.7.0-rc2+, Linux/PowerPC, OS Kernel Image (gzip), 2231485 bytes, Sun Nov 18 23:30:35 2012, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0x9BDC0E32, Data CRC: 0xF3A1DC96 Interestingly, the (NOT used by PAPER) soloboot uImage kernel image in /dev/mtdblock0 on one of our ROACH2s deployed in South Africa is: root@r2d020808:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.4.0-rc3+, Linux/PowerPC, OS Kernel Image (gzip), 2429134 bytes, Tue May 29 17:05:09 2012, Load Address: 0x0050, Entry Point: 0x00500460, Header CRC: 0xCAB17B63, Data CRC: 0x096FD3C7 ...while the (NOT used by PAPER) soloboot uImage kernel image in /dev/mtdblock0 on two ROACH2s in our lab is: root@r2d020813:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.9.0-rc1+, Linux/PowerPC, OS Kernel Image (gzip), 2345540 bytes, Wed Mar 6 02:54:34 2013, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0xC0B47AFF, Data CRC: 0x9247592F root@r2d020669:~# file -s /dev/mtdblock0 /dev/mtdblock0: u-boot legacy uImage, Linux-3.9.0-rc1+, Linux/PowerPC, OS Kernel Image (gzip), 2345540 bytes, Wed Mar 6 02:54:34 2013, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0xC0B47AFF, Data CRC: 0x9247592F These two ROACH2s were repaired by Digicom (813 for the U72 fix and 669 for vehicular stress). It looks like Digicom is populating the ROACH2 soloboot with a new uImage that is not available in the roach2_nfs_uboot repo. Are different kernels required for netboot vs soloboot or is this just an oversight? Richard and/or Peter, I'm curious to know what versions of uImage you have for both your netboot environment and in /dev/mtdblock0 on your
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Peter, Thanks for this information! On Nov 13, 2014, at 7:22 PM, Peter Niu wrote: In our PC, the information like this: [peter@roachserver ~]$ file -L /srv/roach_boot/boot/uImage /srv/roach_boot/boot/uImage: u-boot legacy uImage, Linux-3.7.0-rc2+, Linux/PowerPC, OS Kernel Image (gzip), 2231485 bytes, Mon Nov 19 15:30:35 2012, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0x9BDC0E32, Data CRC: 0xF3A1DC96 (I am not sure why the data is not same with you:Mon Nov 19 15:30:35 2012) At 2014-11-14 08:40:35, David MacMahon dav...@astro.berkeley.edu wrote: $ file -L /srv/tftpboot/uboot-roach2/uImage-r2borph3 /srv/tftpboot/uboot-roach2/uImage-r2borph3: u-boot legacy uImage, Linux-3.7.0-rc2+, Linux/PowerPC, OS Kernel Image (gzip), 2231485 bytes, Sun Nov 18 23:30:35 2012, Load Address: 0x0050, Entry Point: 0x005010D4, Header CRC: 0x9BDC0E32, Data CRC: 0xF3A1DC96 These are the same uImage. The length, header CRC, and data CRC match. The timestamps differ by 16 hours, but I think that's because the timestamp is printed in the local timezone. If you do: env TZ=UTC file -L /srv/roach_boot/boot/uImage ...you will get a timestamp of Mon Nov 19 07:30:35 2012. This means that the uImage file is NOT the cause of the problem since the same version works for us but not for you. I think this might leave only the tcpborphserver version as the cause of the problem. Could it be anything else? Can you please run: telnet pf1 7147 (Type CTRL-] then q then ENTER to quit.) against both the soloboot and netboot environments and let me know the results? Thanks again, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Hi All, I'm happy to tell you the PAPER model can run without overflow finally! I find the bof file no matter PAPER model or own could run in 200Mhz and the packet structure is right. That is the System setup on roach it matters,(Thanks to Marc's help in soloboot!).I try the soloboot on the roach, and it works fine for the model. I don't know why the setup on netboot is not ok ,(it influenced the frequency too much I guess)however, FWIW,The overflow problem company with me for few weeks finally solved out! I could have a good sleep tonight,Thanks for your warm help! Peter At 2014-11-08 03:10:47, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, I think that your 1 PPS should be very usable. I think we typically generate the 1 PPS from a GPS clock. If you want to try a test, you could disconnect the 1 PPS and use the software generated sync signal as per the earlier emails. If that works and using the external 1 PPS doesn't then you will have found the problem. I'd be surprised (but happy!) if that turns out to be the problem. Dave On Nov 7, 2014, at 10:55 AM, Richard Black wrote: Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave pulse_profile.png
Re: [casper] Problem about the adc frequency in PAPER model.
Wow. Well that seemed to be the magic bullet. Thanks! Any ideas why this works? Is it because of an NFS lock-out or a 10-GbE driver issue in the NFS kernel image? In any case, this is a tremendous discovery! Thanks to all for all the effort! Richard On Wednesday, November 12, 2014, 牛晨辉 peterniu...@163.com wrote: Hi All, I'm happy to tell you the PAPER model can run without overflow finally! I find the bof file no matter PAPER model or own could run in 200Mhz and the packet structure is right. That is the System setup on roach it matters,(Thanks to Marc's help in soloboot!).I try the soloboot on the roach, and it works fine for the model. I don't know why the setup on netboot is not ok ,(it influenced the frequency too much I guess)however, FWIW,The overflow problem company with me for few weeks finally solved out! I could have a good sleep tonight,Thanks for your warm help! Peter At 2014-11-08 03:10:47, David MacMahon dav...@astro.berkeley.edu javascript:_e(%7B%7D,'cvml','dav...@astro.berkeley.edu'); wrote: Hi, Richard, I think that your 1 PPS should be very usable. I think we typically generate the 1 PPS from a GPS clock. If you want to try a test, you could disconnect the 1 PPS and use the software generated sync signal as per the earlier emails. If that works and using the external 1 PPS doesn't then you will have found the problem. I'd be surprised (but happy!) if that turns out to be the problem. Dave On Nov 7, 2014, at 10:55 AM, Richard Black wrote: Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu javascript:_e(%7B%7D,'cvml','dav...@astro.berkeley.edu'); wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave pulse_profile.png -- Richard Black
Re: [casper] Problem about the adc frequency in PAPER model.
Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks! Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to ( https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. We are using an external 1 PPS sync pulse. However, we are certain that it's set up correctly. Although, this could just be me grasping at straws since nothing else seems to solve the problem. How would we go about setting up the software-generated pulse? Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog inputs (e.g. a CW tone)? I've attached an image of the output from xtor_up.sh -f 1 with the latest rb-papergpu code. Nothing significant to note: the clock reads ~200 MHz. I've also attached an image of the output from adc16_dump_chans.rb, where A1 has a CW tone with a 10-MHz 40-V emf signal. You can see the oscillations in the first column and noise everywhere else. Just to rule it out, I double-checked (or more accurately triple-checked) the U72 part, and, sure enough, it is the correct oscillator, model number EEG-2121. Does it have the L suffix on the 100.000L frequency part of the chip markings? Yes, it does. On a related note, as I sent off-list to you and Peter earlier today: The fact that the Peter can send small packets at 200 MHz without overflow, but large packets give overflow is very interesting and puzzling. I assume that the smaller packets are just fewer channels of the same length spectrum and that the number of packets per second remains the same (I think we discussed this previously). In that case, the small packets reduce the data rate, which suggests that the 156.25 MHz xaui_ref_clk clock is maybe not really 156.25 MHz but something somewhat slower. This clock is driven by the oscillator at U56 and the clock splitter at U54 (see attached schematic snippet). Can you please inspect those parts on your board(s)? I will be able to inspect a ROACH2 this afternoon and report what I have on a known working system. On one
Re: [casper] Problem about the adc frequency in PAPER model.
you mentioned your 1 PPS is a square wave. that's different from everyone else's 1 PPS: standard 1 PPS systems output a pulse that is high for about 1 uS. (extremely low duty cycle). i don't know if a square wave could be a problem - my guess is that the correlator design uses an edge detection block, so is only sensitive to edges, not levels, but it might be worth investigating. best wishes, dan On Fri, Nov 7, 2014 at 9:03 AM, Richard Black aeldstes...@gmail.com wrote: Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks! Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. We are using an external 1 PPS sync pulse. However, we are certain that it's set up correctly. Although, this could just be me grasping at straws since nothing else seems to solve the problem. How would we go about setting up the software-generated pulse? Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog inputs (e.g. a CW tone)? I've attached an image of the output from xtor_up.sh -f 1 with the latest rb-papergpu code. Nothing significant to note: the clock reads ~200 MHz. I've also attached an image of the output from adc16_dump_chans.rb, where A1 has a CW tone with a 10-MHz 40-V emf signal. You can see the oscillations in the first column and noise everywhere else. Just to rule it out, I double-checked (or more accurately triple-checked) the U72 part, and, sure enough, it is the correct oscillator, model number EEG-2121. Does it have the L suffix on the 100.000L frequency part of the chip markings? Yes, it does. On a related note, as I sent off-list to you and Peter earlier today: The fact that the Peter can send small packets at 200 MHz without overflow, but large packets give overflow is very interesting and puzzling. I assume that the smaller packets are just fewer channels of the same length spectrum and that the number of
Re: [casper] Problem about the adc frequency in PAPER model.
Dan, We aren't using a square wave. It's a pulse function, but that pulse's shape can be easily described as a very thin square pulse. However, you are saying that the pulse is high for only 1 us? That is much shorter than what we are doing. I'll see if I can twiddle that down. Thanks, Richard Black On Fri, Nov 7, 2014 at 10:09 AM, Dan Werthimer d...@ssl.berkeley.edu wrote: you mentioned your 1 PPS is a square wave. that's different from everyone else's 1 PPS: standard 1 PPS systems output a pulse that is high for about 1 uS. (extremely low duty cycle). i don't know if a square wave could be a problem - my guess is that the correlator design uses an edge detection block, so is only sensitive to edges, not levels, but it might be worth investigating. best wishes, dan On Fri, Nov 7, 2014 at 9:03 AM, Richard Black aeldstes...@gmail.com wrote: Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks! Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. We are using an external 1 PPS sync pulse. However, we are certain that it's set up correctly. Although, this could just be me grasping at straws since nothing else seems to solve the problem. How would we go about setting up the software-generated pulse? Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog inputs (e.g. a CW tone)? I've attached an image of the output from xtor_up.sh -f 1 with the latest rb-papergpu code. Nothing significant to note: the clock reads ~200 MHz. I've also attached an image of the output from adc16_dump_chans.rb, where A1 has a CW tone with a 10-MHz 40-V emf signal. You can see the oscillations in the first column and noise everywhere else. Just to rule it out, I double-checked (or more
Re: [casper] Problem about the adc frequency in PAPER model.
Also, at least for many ADC boards that have a PPS input, the signal is connected to a 50 ohm resistor to ground and then goes into a TTL to LVDS converter chip. You mentioned 3 Vpp and 0 V offset, so that sounds like the signal is mostly at -1.5 V and then pulses up to +1.5V. I would suggest a positive only waveform, 0 V pulsing up to 3 V would be better. Glenn On Fri, Nov 7, 2014 at 12:12 PM, Richard Black aeldstes...@gmail.com wrote: Dan, We aren't using a square wave. It's a pulse function, but that pulse's shape can be easily described as a very thin square pulse. However, you are saying that the pulse is high for only 1 us? That is much shorter than what we are doing. I'll see if I can twiddle that down. Thanks, Richard Black On Fri, Nov 7, 2014 at 10:09 AM, Dan Werthimer d...@ssl.berkeley.edu wrote: you mentioned your 1 PPS is a square wave. that's different from everyone else's 1 PPS: standard 1 PPS systems output a pulse that is high for about 1 uS. (extremely low duty cycle). i don't know if a square wave could be a problem - my guess is that the correlator design uses an edge detection block, so is only sensitive to edges, not levels, but it might be worth investigating. best wishes, dan On Fri, Nov 7, 2014 at 9:03 AM, Richard Black aeldstes...@gmail.com wrote: Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks! Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. We are using an external 1 PPS sync pulse. However, we are certain that it's set up correctly. Although, this could just be me grasping at straws since nothing else seems to solve the problem. How would we go about setting up the software-generated pulse? Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog
Re: [casper] Problem about the adc frequency in PAPER model.
seconding glenn, the 1 PPS pulse should be 0 to +3 volts when terminated in 50 ohms. (when connected to the roach board). (that's 0 to 5 or 6 volts when not terminated). the 1 PPS pulse should not go negative. i suggest a pulse width of 1 uS (not 10 ms). best wishes, dan On Fri, Nov 7, 2014 at 9:12 AM, Richard Black aeldstes...@gmail.com wrote: Dan, We aren't using a square wave. It's a pulse function, but that pulse's shape can be easily described as a very thin square pulse. However, you are saying that the pulse is high for only 1 us? That is much shorter than what we are doing. I'll see if I can twiddle that down. Thanks, Richard Black On Fri, Nov 7, 2014 at 10:09 AM, Dan Werthimer d...@ssl.berkeley.edu wrote: you mentioned your 1 PPS is a square wave. that's different from everyone else's 1 PPS: standard 1 PPS systems output a pulse that is high for about 1 uS. (extremely low duty cycle). i don't know if a square wave could be a problem - my guess is that the correlator design uses an edge detection block, so is only sensitive to edges, not levels, but it might be worth investigating. best wishes, dan On Fri, Nov 7, 2014 at 9:03 AM, Richard Black aeldstes...@gmail.com wrote: Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks! Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. We are using an external 1 PPS sync pulse. However, we are certain that it's set up correctly. Although, this could just be me grasping at straws since nothing else seems to solve the problem. How would we go about setting up the software-generated pulse? Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog inputs (e.g. a CW tone)? I've attached an image of
Re: [casper] Problem about the adc frequency in PAPER model.
Hi Glenn,Richard,and all, first,Do you think ADC clock -9dbm is proper?I checked the manual on casper website,it said +6dbm,well ,I doubt it too big ,so I use -1dbm. second,could it possible that the data received by wireshark on hpc is disordered?is the wireshark reading order correct?i received packets that the header show up in the middle of the packet when i use wireshark.I doubt the wireshark reading order is not correct ... Best Regards! peter -- 发自 Android 网易邮箱 On 2014-11-08 01:15 , G Jones Wrote: Also, at least for many ADC boards that have a PPS input, the signal is connected to a 50 ohm resistor to ground and then goes into a TTL to LVDS converter chip. You mentioned 3 Vpp and 0 V offset, so that sounds like the signal is mostly at -1.5 V and then pulses up to +1.5V. I would suggest a positive only waveform, 0 V pulsing up to 3 V would be better. Glenn On Fri, Nov 7, 2014 at 12:12 PM, Richard Black aeldstes...@gmail.com wrote: Dan, We aren't using a square wave. It's a pulse function, but that pulse's shape can be easily described as a very thin square pulse. However, you are saying that the pulse is high for only 1 us? That is much shorter than what we are doing. I'll see if I can twiddle that down. Thanks, Richard Black On Fri, Nov 7, 2014 at 10:09 AM, Dan Werthimer d...@ssl.berkeley.edu wrote: you mentioned your 1 PPS is a square wave. that's different from everyone else's 1 PPS: standard 1 PPS systems output a pulse that is high for about 1 uS. (extremely low duty cycle). i don't know if a square wave could be a problem - my guess is that the correlator design uses an edge detection block, so is only sensitive to edges, not levels, but it might be worth investigating. best wishes, dan On Fri, Nov 7, 2014 at 9:03 AM, Richard Black aeldstes...@gmail.com wrote: Hi all, Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == Thanks!Richard Black On Tue, Nov 4, 2014 at 12:05 PM, Richard Black aeldstes...@gmail.com wrote: Hi David, Comments below: Richard Black On Mon, Nov 3, 2014 at 3:51 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote:So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags.We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors.I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems.When I rebooted the ROACH-2, I got the following header for U-Boot: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 Hope this is informative. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset.We are using an external 1 PPS sync pulse. However, we are
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Peter, Here is a tcpdump snapshot of the first part of a PAPER packet. The data from tcpdump includes headers from other network layers that encapsulate the application data. Here is the output: $ sudo tcpdump -i eth4 -s 100 -xx -c 1 -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth4, link-type EN10MB (Ethernet), capture size 100 bytes 20:36:04.678013 IP 10.10.4.1.8511 10.0.4.54.8511: UDP, length 8208 0x: 0202 c0a8 0401 0800 4500 0x0010: 202c 4000 ff11 3f80 0a0a 0401 0a00 0x0020: 0436 213f 213f 2018 0006 e74d 2d6d 0x0030: 0510 c003 1f1f f200 eefe 0fed e2dd dbf0 0x0040: e00f e5c3 eef4 03e2 ff11 31ed 1011 1e3c 0x0050: 4ce5 f342 10bf 1ff9 1f2a 9f26 e334 4e60 0x0060: 1010 1ff2 ... Here is a breakdown of what is there... # Ethernet Header Note the broadcast destination MAC (ff:ff:ff:ff:ff:ff) is used because this is a direct connection from ROACH2 to 10 GbE NIC (i.e. no switch). 0x: 0202 c0a8 0401 0800 # IP Header Note the source IP (10.10.4.1) and destination IP (10.0.4.45) in the last 8 octets. 0x: 4500 0x0010: 202c 4000 ff11 3f80 0a0a 0401 0a00 0x0020: 0436 # UDP Header PAPER uses port 8511 (0x213f) because US Letter Size paper is 8.5x11 inches. :-) The same port number is used for both source and destination ports. 0x2018 is UDP packet length == UDP header length + application packet length. Here we have 8216 == 8 + 8208. 0x0020: 213f 213f 2018 # PAPER Packet (finally!) The first 6 bytes are MCOUNT (0x0006e74d2d6d). The next 1 byte is FID (5). The next 1 byte is XID (16). The next 8192 bytes (not all shown) are the data. The final 8 bytes (not shown) are 4 bytes CRC + 4 bytes of zeros. The CRC is of the PAPER header and data (mcount+fid+xid+data). 0x0020: 0006 e74d 2d6d 0x0030: 0510 c003 1f1f f200 eefe 0fed e2dd dbf0 0x0040: e00f e5c3 eef4 03e2 ff11 31ed 1011 1e3c 0x0050: 4ce5 f342 10bf 1ff9 1f2a 9f26 e334 4e60 0x0060: 1010 1ff2 ... Hope this helps, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard, I think that your 1 PPS should be very usable. I think we typically generate the 1 PPS from a GPS clock. If you want to try a test, you could disconnect the 1 PPS and use the software generated sync signal as per the earlier emails. If that works and using the external 1 PPS doesn't then you will have found the problem. I'd be surprised (but happy!) if that turns out to be the problem. Dave On Nov 7, 2014, at 10:55 AM, Richard Black wrote: Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave pulse_profile.png
Re: [casper] Problem about the adc frequency in PAPER model.
David, Well, unfortunately, using only the software-generated sync did not fix the packet overflow issue. :-( Richard Black On Fri, Nov 7, 2014 at 12:10 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, I think that your 1 PPS should be very usable. I think we typically generate the 1 PPS from a GPS clock. If you want to try a test, you could disconnect the 1 PPS and use the software generated sync signal as per the earlier emails. If that works and using the external 1 PPS doesn't then you will have found the problem. I'd be surprised (but happy!) if that turns out to be the problem. Dave On Nov 7, 2014, at 10:55 AM, Richard Black wrote: Thanks David and all, I unfortunately misspoke when it came to the power in the ADC clock signal. In fact, we had it at 9 dBm, not -9. Sorry for any confusion. I set up the pulse generator to swing from +0.0 to +3.0 V at 1 us. To check on possible ringing, I also hooked up our pulse generator to an oscilloscope (I increased the pulse width to 10 ms, so I could see it). The waveform I observe has some severe overshoot both on the uptake and down. I've attached a drawing to explain what I mean. I can't seem to mitigate this overshoot with our little Agilent arbitrary waveform generator. Is this similar to the ringing seen at NRAO? If so, how is the 1 PPS generated by casperites? Thanks, Richard Black On Fri, Nov 7, 2014 at 11:29 AM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Nov 7, 2014, at 9:03 AM, Richard Black wrote: Haven't heard anything for a while, so I thought I would add some more detail about our system setup to see if it might shed some light on the problem: 1 PPS Signal - Square pulse Frequency: 1 Hz Amplitude: 3 Vpp Offset: 0 V Width: 10 ms Edge Time: 5 ns That should be fine assuming the 3 Vpp is measured with the 50 ohm termination in place. If you want to try a software sync, you can pass -S (UPPERcase!) to the latest paper_feng_init.rb script. Check the output of paper_feng_init.rb --help to see whether your version supports that option. ADC Clock - CW Tone Frequency: 200 MHz Power: -9 dBm It would be a good idea to increase the power level to +6 dBm as described on this wiki page: https://casper.berkeley.edu/wiki/ADC16x250-8_coax_rev_2#ADC16x250-8_coax_rev_2_Inputs But if the paper_feng_init.rb script reports that the ADC clocks are locked and they measure approximately 200 MHz, then I think this is unlikely to be the cause of the 10 GbE overflow problems (though it would be great if the fix were this simple!). For David, are there any red flags with our UBoot version or ROACH CPLD? Here they are again for reference: From serial interface after ROACH reboot == U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) ... CPLD: 2.1 == This matches one of our ROACH2s that is running and sending 10 GbE packets in our lab: U-Boot 2011.06-rc2-0-g2694c9d-dirty (Dec 04 2013 - 20:58:06) CPU: AMCC PowerPC 440EPx Rev. A at 533.333 MHz (PLB=133 OPB=66 EBC=66) No Security/Kasumi support Bootstrap Option C - Boot ROM Location EBC (16 bits) 32 kB I-Cache 32 kB D-Cache Board: ROACH2 I2C: ready DRAM: 512 MiB Flash: 128 MiB In:serial Out: serial Err: serial CPLD: 2.1 USB: Host(int phy) SN:ROACH2.2 batch=D#6#69 software fixups match MAC: 02:44:01:02:06:45 DTT: 1 is 23 C DTT: 2 is 26 C Net: ppc_4xx_eth0 Hope this helps, Dave pulse_profile.png
Re: [casper] Problem about the adc frequency in PAPER model.
David, So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Just to rule it out, I double-checked (or more accurately triple-checked) the U72 part, and, sure enough, it is the correct oscillator, model number EEG-2121. There is another possibility, albeit an unlikely problem: we currently have the ROACH-2 board booting off another PC (i.e. not the same PC that the ruby control scripts are running on). I can't imagine that this is the problem, but I'm planning on trying to consolidate the NFS and ruby scripts onto a single PC to rule it out. So I suppose at this point, my questions are: (1) What version of the roach2_nfs_uboot .git repository are SKA-SA using? (2) Is SKA-SA using the same PCs for ROACH-2 net boots and file systems as the ruby control scripts? (3) Are there any additional steps that need to be taken when installing the Quad SFP+ mezzanine cards onto the ROACH-2 board? Are there potentially some drivers or configuration steps that are needed to make sure they function properly? As I recall, when we got the boards, we didn't do anything special with the cards outside of simply plugging them in. Again, thanks for your patient advice and suggestions. Richard Black On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote: This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. Really? I recall saying that I often forget about increasing the MTU of the 10 GbE switch and NICs. I don't recall saying that I had a similar issue before but couldn't remember the problem. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? I cannot explain the 10 GbE overflow that you and Peter are experiencing. I have pushed some updates to the rb-papergpu.git repository listed on the PAPER Correlator Manifest page. The paper_feng_init.rb script now verifies that the ADC clocks are locked and provides options for issuing a software sync (only recommended for lab use) and for not storing the time of synchronization in redis (also only recommended for lab use). The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) while they are held in reset. Since you are using the paper_feng_init.rb script, this should not be happening (unless something has gone wrong during the running of that script) because that script specifically and explicitly disables the tx_valid signal before putting the cores into reset and it takes the cores out of reset before enabling the tx_valid signal. So assuming that this is not the cause of the overflows, there must be something else that is causing the 10 GbE cores to be unable to transmit data fast enough to keep up with the data stream it is being fed. Two things that could cause this are 1) running the
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard, On Nov 3, 2014, at 11:47 AM, Richard Black wrote: So, it's been a little while now, but not much has changed yet. We've gotten Chipscope working, and, so far, there aren't any red flags with the FPGA firmware 10-GbE control signals. That's good to know, although maybe in some way it would have been nice if you had found some red flags. We also confirmed that the bitstream we are using is in fact roach2_fengine_2013_Oct_14_1756.bof.gz, so that is unfortunately not the problem. At least you are using a known good BOF file, so that eliminates a source of potential errors. I also took a look at the ROACH2 PPC setup: we pulled from the .git repository on February 12, 2014 (commit number = e14df9016c3b7ccba62cc6d0cae05405f4929c94). There haven't been any changes to that repository since August 2013, so unless the SKA-SA ROACH-2s are using a pull from before then, I don't think that is our issue. We use our own homegrown NFS root filesystem for the ROACH2s, so I can't comment on the status of the one you refer to (https://github.com/ska-sa/roach2_nfs_uboot.git). I am more interested in the U-Boot version you have (see https://github.com/ska-sa/roach2_uboot.git) and which version of the ROACH2 CPLD image you are using (not sure where to get this). I think these are unlikely to be problematic, but we've already checked all the likely problems. We also tried out Jason Manley's suggestion of delaying the enabling of the 10-GbE cores to ensure that the sync pulse propagated through the entire system before buffering up data, but the problem persisted. Do you have an external 1 PPS sync pulse connected or have you tried the latest rb-papergpu software that supports a software-generated sync? The paper_feng_init.rb script already disables the data flow to the 10 GbE cores until the sync pulse has propagated through and the cores have been taken out of reset. Does the latest rb-papergpu code show that the ADC clocks (MMCMs) are locked? Does it estimate the clock frequency correctly? Does adc16_dump_chans.rb show samples that correspond correctly to the analog inputs (e.g. a CW tone)? Just to rule it out, I double-checked (or more accurately triple-checked) the U72 part, and, sure enough, it is the correct oscillator, model number EEG-2121. Does it have the L suffix on the 100.000L frequency part of the chip markings? On a related note, as I sent off-list to you and Peter earlier today: The fact that the Peter can send small packets at 200 MHz without overflow, but large packets give overflow is very interesting and puzzling. I assume that the smaller packets are just fewer channels of the same length spectrum and that the number of packets per second remains the same (I think we discussed this previously). In that case, the small packets reduce the data rate, which suggests that the 156.25 MHz xaui_ref_clk clock is maybe not really 156.25 MHz but something somewhat slower. This clock is driven by the oscillator at U56 and the clock splitter at U54 (see attached schematic snippet). Can you please inspect those parts on your board(s)? I will be able to inspect a ROACH2 this afternoon and report what I have on a known working system. On one of our ROACH2s U56 is labeled like this: EEG-2121 156.250L OGPN1Z5C Again, note the L suffix. I think that signifies LVDS, which is what is expected/required for the ROACH2. That's very important. I am not 100% sure about my transcription of the third line, it could have typos. There is another possibility, albeit an unlikely problem: we currently have the ROACH-2 board booting off another PC (i.e. not the same PC that the ruby control scripts are running on). I can't imagine that this is the problem, but I'm planning on trying to consolidate the NFS and ruby scripts onto a single PC to rule it out. The scripts communicate with the ROACH2 over the network via KATCP. There is no requirement that the scripts be running on the same server that is providing the NFS root filesystem to the ROACH2s. So I suppose at this point, my questions are: (1) What version of the roach2_nfs_uboot .git repository are SKA-SA using? I don't know. (2) Is SKA-SA using the same PCs for ROACH-2 net boots and file systems as the ruby control scripts? I doubt SKA-SA is using ruby, but as stated above the ruby scripts can be run on any system that can reach the ROACH2 via KATCP. (3) Are there any additional steps that need to be taken when installing the Quad SFP+ mezzanine cards onto the ROACH-2 board? Are there potentially some drivers or configuration steps that are needed to make sure they function properly? As I recall, when we got the boards, we didn't do anything special with the cards outside of simply plugging them in. Just plugging them in is all that is necessary. There is a slight complication in that the standoffs might not be exactly the right height and some
Re: [casper] Problem about the adc frequency in PAPER model.
Hi all, Sorry to reply you late. First, Though the serial number of all 8 roaches we have are in the range that might got wrong,fortunately, ours are installed the correct crystals (Epson EEG-2121-100.000L). I have viewed the discuss yesterday.My project'final frequency is 250Mhz,but I didn't turn it up to 250Mhz when I run PAPER model. As the initialization shows: [peter@roachserver rb_test]$ ./paper_feng_init.rb roach1:0 initializing roach1 as FID 0 connecting to roach1 roach1 roach2_fengine app/lib revision 47c59e2/cd26bd2 disabling network transmission setting roach1 FID to 0 setting fftshift to 2047 setting eq to 600/1 configuring 10 GbE interfaces setting corner turner mode 0 (8 F engines) arming sync generator(s) arming sync generator(s) storing sync time in redis on redishost seeding noise generators arming noise generator(s) Setting F-Engine inputs to ADC signals resetting network interfaces enable transmission to X engines enable transmission to switch all done The configuration looks ok,but no data send out because the overflow.I agree with David that It may not be the script that matters. Because I can use this script to initial my own model which are modified from PAPER for our use.What's more, it can send out data packets from ROACH in 200Mhz(even in 250Mhz).And the overflow problem has never happened.My model are sending data in 4112 bytes length. I also find neither PAPER model in 75 Mhz nor my model in 200Mhz could receive the correct data structure on my system.I mean the Header appears in the middle of the packet.I found this in wireshark. I have run the adc16_dump_chans.rb when I run PAPER model. The result is like flowing: [peter@roachserver bin]$ ./adc16_dump_chans.rb -r -v pf1 data snap took 0.363328416 seconds 111.5 112.0 112.1 112.1 127.1 127.1 127.3 127.4 112.2 112.3 111.8 112.0 112.1 112.2 111.6 112.0 112.4 111.6 112.1 112.0 127.0 127.4 127.1 127.3 112.1 111.4 112.0 111.7 127.3 126.7 127.4 126.6 I also download the new script as David point,but I met a name-error: [peter@roachserver bin]$ ./paper_feng_init.rb pf1 initializing pf1 as FID 0 connecting to pf1 ./paper_feng_init.rb:130:in `block in main': undefined local variable or method `a' for main:Object (NameError) from ./paper_feng_init.rb:112:in `map' from ./paper_feng_init.rb:112:in `main' Thanks for your communication and suggestions! peter At 2014-10-28 05:03:14, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard and Peter, Another possibility that crossed my mind is perhaps your ROACH2s were from the batch where the incorrect oscillator was installed for U72. This seems unlikely for Richard based on this email (which also describes the incorrect oscillator problem in general): https://www.mail-archive.com/casper@lists.berkeley.edu/msg04909.html Maybe it's worth a double check anyway? Dave On Oct 27, 2014, at 1:41 PM, Richard Black wrote: David, We'll take another close look at what model we are actually using, just to be safe. I went back and looked at our e-mails, and sure enough, you're right. You were referring to the MTU issue as being the problem you tend to suppress all memory of. It was just that you stated it in a separate paragraph, so, out-of-context, I extrapolated that you have had the same problem before. My bad for dragging your good name through the mud. :) We will also update our local repositories, in the event some bizarre race condition exists on our end. I didn't know that the buffer could fill up while reset was asserted. We'll definitely have to check up on that too. We haven't tried dumping raw ADC data yet since we have been trying to get the data link working first. After that, we were planning to inject signal and examine outputs. Thanks, Richard Black On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote: This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Peter, On Oct 28, 2014, at 5:34 AM, peter wrote: First, Though the serial number of all 8 roaches we have are in the range that might got wrong,fortunately, ours are installed the correct crystals (Epson EEG-2121-100.000L). Thanks for checking. That eliminates one potential cause of the problem. I have run the adc16_dump_chans.rb when I run PAPER model. The result is like flowing: [peter@roachserver bin]$ ./adc16_dump_chans.rb -r -v pf1 data snap took 0.363328416 seconds 111.5 112.0 112.1 112.1 127.1 127.1 127.3 127.4 112.2 112.3 111.8 112.0 112.1 112.2 111.6 112.0 112.4 111.6 112.1 112.0 127.0 127.4 127.1 127.3 112.1 111.4 112.0 111.7 127.3 126.7 127.4 126.6 The '-r' option tells the script to output the RMS of the 32 inputs. Those RMS values are very, very high. A full scale sine wave would have an RMS of only 90. What signals are driving the ADC inputs? If you don't pass '-r' then it will dump 1K of samples from each input (one column per input, one row per sample). What does that show? [peter@roachserver bin]$ ./paper_feng_init.rb pf1 initializing pf1 as FID 0 connecting to pf1 ./paper_feng_init.rb:130:in `block in main': undefined local variable or method `a' for main:Object (NameError) from ./paper_feng_init.rb:112:in `map' from ./paper_feng_init.rb:112:in `main' Sorry about that copy/paste error! I have pushed a fix. Hope this helps to get us closer to understanding this problem, Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com wrote: Peter, That's correct. We downloaded the FPGA firmware and programmed the ROACH with the precompiled bitstream. When we didn't get any data beyond that single packet, we stuck some overflow status registers in the model and found that we were overflowing at 1025 64-bit words (i.e. 8200 bytes). We have actually found a way to get packets to flow, but it isn't a good fix. When we turn the ADC clock frequency down to about 75 MHz, the packets begin to flow. There is an opinion in our group that the 10-GbE buffer overflow is a transient behavior, and, hence, if we slowly turn up the clock frequency after the ROACH has started up, packets may continue to flow in steady-state operation. We haven't tested this yet, though. Richard Black On Thu, Oct 23, 2014 at 8:39 PM, peter peterniu...@163.com wrote: Hi Richard, All, As you said the size of isolate packet is changing every time. ) : tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 bytes 10:10:55.622053 IP 10.10.2.1.8511 10.10.2.9.8511: UDP, length 4616 Ddi you download the PAPER gateware on the casper (https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest ) directly? How about the PAPER bof file run on your system? Have you met overflow before?I download and install PAPER model as the website says ,but the overflow shows when I run the paper_feng_netstat.rb. Thanks for your information. peter At 2014-10-24 09:59:12, Richard Black aeldstes...@gmail.com wrote: Peter, I don't mean to hijack your thread, but we've been having a very similar (and time-absorbing) issue with the PAPER f-engine FPGA firmware here at BYU. Out of curiosity, does this single packet that you're receiving in tcpdump change in size every time you reprogram the ROACH? We've seen this happen, and we're pretty sure that this isolated packet is the 10-GbE buffer flushing when the 10-GbE core is initialized (i.e. the enable signal isn't sync'd with the start of new packet). Regardless of whether we have the same issue, I'm very interested to see this problem's resolution. Good luck, Richard Black On Thu, Oct 23, 2014 at 7:50 PM, peter peterniu...@163.com wrote: Hi Joe, All, I find a thing this morning , there is one packet send out from roach When I run PAPER model, which I got from HPC tcpdump: tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 bytes 09:04:02.757813 IP 10.10.2.1.8511 10.10.2.9.8511: UDP, length 6456 The lenght is not expected 8200+8 ,and far from full TX buffer size 8K+512.And the other packets are stopped from overflow. I have tried to change the tutorial 2 packet size to 8200 bytes and 8K +512 bytes. It is a good transfer.I also make sure the boundary size is indeed 8K+512 ,because while I change size to 8K+513 byetes ,There is no data send.So the received packet this morning with length 6456 is totally under the limit.But what caused the other packets in overflow? Any suggestions could be helpful ! peter At 2014-10-24 00:37:14, Kujawski, Joseph jkujaw...@siena.edu wrote: Peter, By cadence of the broadcast, I mean how often are the 8200 byte packets
Re: [casper] Problem about the adc frequency in PAPER model.
Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com wrote: Peter, That's correct. We downloaded the FPGA firmware and programmed the ROACH with the precompiled bitstream. When we didn't get any data beyond that single packet, we stuck some overflow status registers in the model and found that we were overflowing at 1025 64-bit words (i.e. 8200 bytes). We have actually found a way to get packets to flow, but it isn't a good fix. When we turn the ADC clock frequency down to about 75 MHz, the packets begin to flow. There is an opinion in our group that the 10-GbE buffer overflow is a transient behavior, and, hence, if we slowly turn up the clock frequency after the ROACH has started up, packets may continue to flow in steady-state operation. We haven't tested this yet, though. Richard Black On Thu, Oct 23, 2014 at 8:39 PM, peter peterniu...@163.com wrote: Hi Richard, All, As you said the size of isolate packet is changing every time. ) : tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 bytes 10:10:55.622053 IP 10.10.2.1.8511 10.10.2.9.8511: UDP, length 4616 Ddi you download the PAPER gateware on the casper ( https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest ) directly? How about the PAPER bof file run on your system? Have you met overflow before?I download and install PAPER model as the website says ,but the overflow shows when I run the paper_feng_netstat.rb. Thanks for your information. peter At 2014-10-24 09:59:12, Richard Black aeldstes...@gmail.com wrote: Peter, I don't mean to hijack your thread, but we've been having a very similar (and time-absorbing) issue with the PAPER f-engine FPGA firmware here at BYU. Out of curiosity, does this single packet that you're receiving in tcpdump change in size every
Re: [casper] Problem about the adc frequency in PAPER model.
I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com wrote: Peter, That's correct. We downloaded the FPGA firmware and programmed the ROACH with the precompiled bitstream. When we didn't get any data beyond that single packet, we stuck some overflow status registers in the model and found that we were overflowing at 1025 64-bit words (i.e. 8200 bytes). We have actually found a way to get packets to flow, but it isn't a good fix. When we turn the ADC clock frequency down to about 75 MHz, the packets begin to flow. There is an opinion in our group that the 10-GbE buffer overflow is a transient behavior, and, hence, if we slowly turn up the clock frequency after the ROACH has started up, packets may continue to flow in steady-state operation. We haven't tested this yet, though. Richard Black On Thu, Oct 23, 2014 at 8:39 PM, peter peterniu...@163.com wrote: Hi Richard, All, As you said the size of isolate packet is changing every time. ) : tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on px1-2, link-type EN10MB (Ethernet), capture size 65535 bytes 10:10:55.622053 IP 10.10.2.1.8511 10.10.2.9.8511: UDP, length 4616 Ddi you download the PAPER gateware on the casper
Re: [casper] Problem about the adc frequency in PAPER model.
By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com wrote: Peter, That's correct. We downloaded the FPGA firmware and programmed the ROACH with the precompiled bitstream. When we didn't get any data beyond that single packet, we stuck some overflow status registers in the model and found that we were overflowing at 1025 64-bit words (i.e. 8200 bytes). We have actually found a way to get packets to flow, but it isn't a good fix. When we turn the ADC clock frequency down to about 75 MHz, the packets begin to flow. There is an opinion in our group that the 10-GbE buffer overflow is a transient behavior, and, hence, if we slowly turn up the clock frequency after the ROACH has started up, packets may continue to flow in steady-state operation. We haven't tested this yet, though. Richard Black On Thu, Oct 23, 2014 at
Re: [casper] Problem about the adc frequency in PAPER model.
Yep, ok, so whoever did it (Dave?) already knows about this issue and has dealt with it. So scratch that idea then! Only other thing to check is to make sure you don't actually toggle that software register until the core is configured. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:38, Richard Black aeldstes...@gmail.com wrote: By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com wrote: Peter, That's correct. We downloaded the FPGA firmware and programmed the ROACH with the precompiled bitstream. When we didn't get any data beyond that single packet, we stuck some overflow status registers in the model and found that we were overflowing at 1025 64-bit words (i.e. 8200 bytes).
Re: [casper] Problem about the adc frequency in PAPER model.
Jason, Fair point. One of our guys is currently trying to get ChipScope configured to make sure all our control signals are correct. We'll definitely look at that signal too. Hopefully that will finally put this issue to rest. Thanks for the tip, Richard Black On Mon, Oct 27, 2014 at 10:47 AM, Jason Manley jman...@ska.ac.za wrote: Yep, ok, so whoever did it (Dave?) already knows about this issue and has dealt with it. So scratch that idea then! Only other thing to check is to make sure you don't actually toggle that software register until the core is configured. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:38, Richard Black aeldstes...@gmail.com wrote: By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init fengien script in about 75 Mhz ,(./paper_feng_init.rb roach1:0 ) ,That will allow the packet transfer. then we can turn the frequency higher.However the finally ADC clock frequency is up to 120 Mhz in my experiment.Our final ADC frequency standard is 250 Mhz. Maybe I need run the bof file in a higher ADC frequency first to make a final steady 250 Mhz ADC clock frequncy. Why it need init in a lower frequency and turn it up? That didn't make sense.Is the hardware going wrong?As the yellow block adc16*250-8 is designed for 250 Mhz, it should be ok for 200Mhz or 250 Mhz.How about the final frequency in your experiment? Any reply will be helpful! Best Regards! peter At 2014-10-25 00:36:52, Richard Black aeldstes...@gmail.com
Re: [casper] Problem about the adc frequency in PAPER model.
Hi Richard, I've just had a very brief look at the design / software, so take this email with a pinch of salt, but on the off-chance you haven't checked this It looks like the PAPER F-engine setup on running the start script for software / firmware out of the box is -- 1. Disable all ethernet interfaces 2. Arm sync generator, wait 1 second for PPS 3. Reset ethernet interfaces 4. Enable interfaces. These four steps seem like they should be safe, yet the behaviour you're describing sounds like the design is midway sending a packet, then gets a sync, gives up sending an end-of-frame and starts sending a new packet, at which point the old packet + the new packet = overflow. Knowing that the design works for paper, my wondering is whether after arming the sync generator syncs are flowing through the design before the ethernet interface is enabled. Do you have a PPS-like input? the fengine initialisation script seems to wait for a second after arming, but if your sync input is something significantly slower, you could have problems. I'm sceptical about this theory (I think the symptoms would be lots of OK packets when you brought up the interface, and then it dying when the sync arrives, rather than a single good packet like you're seeing), but if the firmware + software really is the same as that working with paper, and the wiki hasn't just got out of sync with the paper devs, perhaps the problem is in your hardware setup Cheers, Jack On 27 October 2014 16:38, Richard Black aeldstes...@gmail.com wrote: By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's operating. In theory, you should do a global reset in case the PLL/DLLs lose lock during clock transitions, in which case the logic could be in a uncertain state. But the Sysgen flow just does a single POR. A better solution might be to keep the 10GbE cores turned off (enable line pulled low) on initialisation, until things are configured (tgtap started etc), and only then enable the transmission using a SW register. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 25 Oct 2014, at 10:34, peter peterniu...@163.com wrote: Hi Richard,Joe, all, Thanks for your help,It finally can receive packets now! As you point,After enabled the ADC card and run bof file(./adc_init.rb roach1 bof file)in 200 Mhz (or higher than it), We need run init
Re: [casper] Problem about the adc frequency in PAPER model.
Jack, I appreciate your help. I tend to agree that the issue is likely a hardware configuration problem, but we have been trying to match it as closely as possible. We do feed a 1-PPS signal into the board, but I'm hazy on the details of the other pulse parameters. I'll look into that as well. So, if I understand you correctly, you believe that the sync pulse is reaching the ethernet interfaces *after* the cores are enabled? If that is the case, couldn't we delay enabling the 10-GbE cores for another second to fix it? This might be a quick way to test that theory, but please correct me if I've misunderstood. Richard Black On Mon, Oct 27, 2014 at 11:05 AM, Jack Hickish jackhick...@gmail.com wrote: Hi Richard, I've just had a very brief look at the design / software, so take this email with a pinch of salt, but on the off-chance you haven't checked this It looks like the PAPER F-engine setup on running the start script for software / firmware out of the box is -- 1. Disable all ethernet interfaces 2. Arm sync generator, wait 1 second for PPS 3. Reset ethernet interfaces 4. Enable interfaces. These four steps seem like they should be safe, yet the behaviour you're describing sounds like the design is midway sending a packet, then gets a sync, gives up sending an end-of-frame and starts sending a new packet, at which point the old packet + the new packet = overflow. Knowing that the design works for paper, my wondering is whether after arming the sync generator syncs are flowing through the design before the ethernet interface is enabled. Do you have a PPS-like input? the fengine initialisation script seems to wait for a second after arming, but if your sync input is something significantly slower, you could have problems. I'm sceptical about this theory (I think the symptoms would be lots of OK packets when you brought up the interface, and then it dying when the sync arrives, rather than a single good packet like you're seeing), but if the firmware + software really is the same as that working with paper, and the wiki hasn't just got out of sync with the paper devs, perhaps the problem is in your hardware setup Cheers, Jack On 27 October 2014 16:38, Richard Black aeldstes...@gmail.com wrote: By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? Richard Black On Mon, Oct 27, 2014 at 2:41 AM, Jason Manley jman...@ska.ac.za wrote: Just a note that I don't recommend you adjust FPGA clock frequencies while it's
Re: [casper] Problem about the adc frequency in PAPER model.
Hi Richard, That's my theory, though I doubt it's right. But as you say, an easy test is just to delay after issuing a sync for a couple more seconds and see if that helps. But if your PPS is a real PPS (rather than just a square wave at some vague 1s period) then I can't see what difference this would make. When that doesn't help, my inclination would be to start prodding the 10gbe control signals from software to make sure the reset / sw enables are working / see if a tge reset without a new sync behaves differently. But I can't imagine how that would be broken unless the stuff on github is out of date (which I doubt). Jack On 27 October 2014 17:28, Richard Black aeldstes...@gmail.com wrote: Jack, I appreciate your help. I tend to agree that the issue is likely a hardware configuration problem, but we have been trying to match it as closely as possible. We do feed a 1-PPS signal into the board, but I'm hazy on the details of the other pulse parameters. I'll look into that as well. So, if I understand you correctly, you believe that the sync pulse is reaching the ethernet interfaces after the cores are enabled? If that is the case, couldn't we delay enabling the 10-GbE cores for another second to fix it? This might be a quick way to test that theory, but please correct me if I've misunderstood. Richard Black On Mon, Oct 27, 2014 at 11:05 AM, Jack Hickish jackhick...@gmail.com wrote: Hi Richard, I've just had a very brief look at the design / software, so take this email with a pinch of salt, but on the off-chance you haven't checked this It looks like the PAPER F-engine setup on running the start script for software / firmware out of the box is -- 1. Disable all ethernet interfaces 2. Arm sync generator, wait 1 second for PPS 3. Reset ethernet interfaces 4. Enable interfaces. These four steps seem like they should be safe, yet the behaviour you're describing sounds like the design is midway sending a packet, then gets a sync, gives up sending an end-of-frame and starts sending a new packet, at which point the old packet + the new packet = overflow. Knowing that the design works for paper, my wondering is whether after arming the sync generator syncs are flowing through the design before the ethernet interface is enabled. Do you have a PPS-like input? the fengine initialisation script seems to wait for a second after arming, but if your sync input is something significantly slower, you could have problems. I'm sceptical about this theory (I think the symptoms would be lots of OK packets when you brought up the interface, and then it dying when the sync arrives, rather than a single good packet like you're seeing), but if the firmware + software really is the same as that working with paper, and the wiki hasn't just got out of sync with the paper devs, perhaps the problem is in your hardware setup Cheers, Jack On 27 October 2014 16:38, Richard Black aeldstes...@gmail.com wrote: By enable port, I assume you mean the valid port. I've been looking at the PAPER model carefully for some time now, and that is how it operates. It has a gated valid signal with a software register on each 10-GbE core. Once again, this is not our model. This is one made available on the CASPER wiki and run without modifications. Richard Black On Mon, Oct 27, 2014 at 10:34 AM, Jason Manley jman...@ska.ac.za wrote: I suspect the 10GbE core's input FIFO is overflowing on startup. One key thing with this core is to the ensure that your design keeps the enable port held low until the core's been configured. The core becomes unusable once the TX FIFO overflows. This has been a long-standing bug (my emails trace back to 2009) but it's so easy to work around that I don't think anyone's bothered looking into fixing it. Jason Manley CBF Manager SKA-SA Cell: +27 82 662 7726 Work: +27 21 506 7300 On 27 Oct 2014, at 18:25, Richard Black aeldstes...@gmail.com wrote: Jason, Thanks for your comments. While I agree that changing the ADC frequency mid-operation is non-kosher and could result in uncertain behavior, the issue at hand for us is to figure out what is going on with the PAPER model that has been published on the CASPER wiki. This naturally won't be (and shouldn't be) the end-all solution to this problem. This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote: This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. Really? I recall saying that I often forget about increasing the MTU of the 10 GbE switch and NICs. I don't recall saying that I had a similar issue before but couldn't remember the problem. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? I cannot explain the 10 GbE overflow that you and Peter are experiencing. I have pushed some updates to the rb-papergpu.git repository listed on the PAPER Correlator Manifest page. The paper_feng_init.rb script now verifies that the ADC clocks are locked and provides options for issuing a software sync (only recommended for lab use) and for not storing the time of synchronization in redis (also only recommended for lab use). The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) while they are held in reset. Since you are using the paper_feng_init.rb script, this should not be happening (unless something has gone wrong during the running of that script) because that script specifically and explicitly disables the tx_valid signal before putting the cores into reset and it takes the cores out of reset before enabling the tx_valid signal. So assuming that this is not the cause of the overflows, there must be something else that is causing the 10 GbE cores to be unable to transmit data fast enough to keep up with the data stream it is being fed. Two things that could cause this are 1) running the design faster than the 200 MHz sample clock that it was built for and/or 2) some link issue that prevents the core from sending data. Unfortunately, I think both of those ideas are also pretty far fetched given all you've done to try to get the system working. I wonder whether there is some difference in the ROACH2 firmware (u-boot version or CPLD programming) or PPC Linux setup or tcpborhpserver revision or ???. Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data to make sure that it looks OK? Dave
Re: [casper] Problem about the adc frequency in PAPER model.
David, We'll take another close look at what model we are actually using, just to be safe. I went back and looked at our e-mails, and sure enough, you're right. You were referring to the MTU issue as being the problem you tend to suppress all memory of. It was just that you stated it in a separate paragraph, so, out-of-context, I extrapolated that you have had the same problem before. My bad for dragging your good name through the mud. :) We will also update our local repositories, in the event some bizarre race condition exists on our end. I didn't know that the buffer could fill up while reset was asserted. We'll definitely have to check up on that too. We haven't tried dumping raw ADC data yet since we have been trying to get the data link working first. After that, we were planning to inject signal and examine outputs. Thanks, Richard Black On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote: This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. Really? I recall saying that I often forget about increasing the MTU of the 10 GbE switch and NICs. I don't recall saying that I had a similar issue before but couldn't remember the problem. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? I cannot explain the 10 GbE overflow that you and Peter are experiencing. I have pushed some updates to the rb-papergpu.git repository listed on the PAPER Correlator Manifest page. The paper_feng_init.rb script now verifies that the ADC clocks are locked and provides options for issuing a software sync (only recommended for lab use) and for not storing the time of synchronization in redis (also only recommended for lab use). The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) while they are held in reset. Since you are using the paper_feng_init.rb script, this should not be happening (unless something has gone wrong during the running of that script) because that script specifically and explicitly disables the tx_valid signal before putting the cores into reset and it takes the cores out of reset before enabling the tx_valid signal. So assuming that this is not the cause of the overflows, there must be something else that is causing the 10 GbE cores to be unable to transmit data fast enough to keep up with the data stream it is being fed. Two things that could cause this are 1) running the design faster than the 200 MHz sample clock that it was built for and/or 2) some link issue that prevents the core from sending data. Unfortunately, I think both of those ideas are also pretty far fetched given all you've done to try to get the system working. I wonder whether there is some difference in the ROACH2 firmware (u-boot version or CPLD programming) or PPC Linux setup or tcpborhpserver revision or ???. Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data to make sure that it looks OK? Dave
Re: [casper] Problem about the adc frequency in PAPER model.
Hi, Richard and Peter, Another possibility that crossed my mind is perhaps your ROACH2s were from the batch where the incorrect oscillator was installed for U72. This seems unlikely for Richard based on this email (which also describes the incorrect oscillator problem in general): https://www.mail-archive.com/casper@lists.berkeley.edu/msg04909.html Maybe it's worth a double check anyway? Dave On Oct 27, 2014, at 1:41 PM, Richard Black wrote: David, We'll take another close look at what model we are actually using, just to be safe. I went back and looked at our e-mails, and sure enough, you're right. You were referring to the MTU issue as being the problem you tend to suppress all memory of. It was just that you stated it in a separate paragraph, so, out-of-context, I extrapolated that you have had the same problem before. My bad for dragging your good name through the mud. :) We will also update our local repositories, in the event some bizarre race condition exists on our end. I didn't know that the buffer could fill up while reset was asserted. We'll definitely have to check up on that too. We haven't tried dumping raw ADC data yet since we have been trying to get the data link working first. After that, we were planning to inject signal and examine outputs. Thanks, Richard Black On Mon, Oct 27, 2014 at 2:26 PM, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Richard, On Oct 27, 2014, at 9:25 AM, Richard Black wrote: This is a reportedly fully-functional model that shouldn't require any major changes in order to operate. However, this has clearly not been the case in at least two independent situations (us and Peter). This begs the question: what's so different about our use of PAPER? I just verified that the roach2_fengine_2013_Oct_14_1756.bof.gz file is the one being used by the PAPER correlator currently fielded in South Africa. It is definitely a fully functional model. That image (and all source files for it) is available from the git repo listed on the PAPER Correlator Manifest page of the CASPER Wiki: https://casper.berkeley.edu/wiki/PAPER_Correlator_Manifest We, at BYU, have made painstakingly sure that our IP addressing schemes, switch ports, and scripts are all configured correctly (thanks to David MacMahon for that, btw), but we still have hit the proverbial brick wall of 10-GbE overflow. When I last corresponded with David, he explained that he remembers having a similar issue before, but can't recall exactly what the problem was. Really? I recall saying that I often forget about increasing the MTU of the 10 GbE switch and NICs. I don't recall saying that I had a similar issue before but couldn't remember the problem. In any case, the fact that by turning down the ADC clock prior to start-up prevents the 10-GbE core from overflowing is a major lead for us at BYU (we've been spinning our wheels on this issue for several months now). By no means are we proposing mid-run ADC clock modifications, but this appears to be a very subtle (and quite sinister, in my opinion) bug. Any thoughts as to what might be going on? I cannot explain the 10 GbE overflow that you and Peter are experiencing. I have pushed some updates to the rb-papergpu.git repository listed on the PAPER Correlator Manifest page. The paper_feng_init.rb script now verifies that the ADC clocks are locked and provides options for issuing a software sync (only recommended for lab use) and for not storing the time of synchronization in redis (also only recommended for lab use). The 10 GbE cores can overflow if they are fed valid data (i.e. tx_valid=1) while they are held in reset. Since you are using the paper_feng_init.rb script, this should not be happening (unless something has gone wrong during the running of that script) because that script specifically and explicitly disables the tx_valid signal before putting the cores into reset and it takes the cores out of reset before enabling the tx_valid signal. So assuming that this is not the cause of the overflows, there must be something else that is causing the 10 GbE cores to be unable to transmit data fast enough to keep up with the data stream it is being fed. Two things that could cause this are 1) running the design faster than the 200 MHz sample clock that it was built for and/or 2) some link issue that prevents the core from sending data. Unfortunately, I think both of those ideas are also pretty far fetched given all you've done to try to get the system working. I wonder whether there is some difference in the ROACH2 firmware (u-boot version or CPLD programming) or PPC Linux setup or tcpborhpserver revision or ???. Have you tried using adc16_dump_chans.rb to dump snapshots of the ADC data to make sure that it looks OK? Dave