*sigh* tgtap does three things.
It resolves ARP, which is great. It creates a tap interface, which is of dubious value. (Does anyone know which packets received on 10gbe are passed along to the tap interface? It better not be a lot of them, because there's no way that the PPC could keep up at full speed.) It runs ifconfig, which is a disaster when booted over NFS. I commented out the ifconfig and life is good. That being said, should I expect ARP to still work when the 10Gb/sec is coming in? And will the transmissions *from* the BORPH side use up buffer and cause tx overruns? --Andy On Mon, Mar 28, 2011 at 12:24 PM, Jason Manley <jasonman...@gmail.com> wrote: > The fact that the board is not responding even to ssh isn't good. It sounds > like a possible fault with the kernel or tcpborphserver. Which versions are > you running? > > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tcpborphserver/tcpborphserver2-2011-03-10-r3405-fansilent > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-fallback-20110303 > > I would suggest connecting a serial cable to the ROACH, starting > tcpborphserver manually (non-daemon) and watching what goes wrong. > > Perhaps Marc can comment? > > Jason > > On 28 Mar 2011, at 18:16, Jon Losh wrote: > >> I'm not running the code with the '-a' option, so it should skip the ARP >> printing. I tried setting the logger level to debug, but the error messages >> are the exact same. >> >> When I run the code and try to interact with the fpga, any requests like >> fpga.listdev() or fpga.listbof() give me a "Request timed out after 10.0 >> seconds." fpga.is_connected() returns True, but I'm beginning to suspect >> that's more of a relic of it not being set to false due to something else >> being gummed up. >> >> Also, after running this code, the roach itself is unresponsive to pings or >> attempts to ssh into it, and I have to go manually reset it. I've never seen >> something that would lock up a roach like this... >> >> I tried stepping through the code manually, and the problems seem to start >> when I try to start tgtap a second time. I thought there was something in >> the mail archive about tgtap being stupid about starting a second instance, >> but wasn't that fixed? I've compiled this .bof with the latest stable >> version of the CASPER git, if that's useful to know. Here's a log of the >> ipython session: >> >> ---------------------------------------------------------------------------------------------------------------------------------- >> In [39]: fpga = corr.katcp_wrapper.FpgaClient('10.0.0.102', logger=logger) >> >> In [40]: fpga.progdev('tut2_2011_Mar_17_1503.bof') >> Out[40]: 'ok' >> >> In [41]: >> fpga.tap_start('tap0',rx_core_name,mac_base+dest_ip,dest_ip,fabric_port) >> >> In [42]: >> fpga.tap_start('tap3',tx_core_name,mac_base+source_ip,source_ip,fabric_port) >> ERROR: An unexpected error occurred while tokenizing input >> The following traceback may be corrupted or invalid >> The error message is: ('EOF in multi-line statement', (167, 0)) >> >> --------------------------------------------------------------------------- >> RuntimeError Traceback (most recent call last) >> >> /home/jlosh/fftt/models/test/tut2_short.py in <module>() >> ----> 1 >> 2 >> 3 >> 4 >> 5 >> >> /usr/lib/python2.6/site-packages/corr-0.6.5-py2.6.egg/corr/katcp_wrapper.pyc >> in tap_start(self, tap_dev, device, mac, ip, port) >> 176 >> 177 self._logger.info("Starting tgtap driver instance for %s: %s >> %s %s %s %s"%("tap-start", tap_dev, device, ip_str, port_str, mac_str)) >> --> 178 reply, informs = self._request("tap-start", tap_dev, device, >> ip_str, port_str, mac_str) >> 179 if reply.arguments[0]=='ok': return >> 180 else: raise RuntimeError("Failure starting tap device %s >> with mac %s, %s:%s"%(device,mac_str,ip_str,port_str)) >> >> /usr/lib/python2.6/site-packages/corr-0.6.5-py2.6.egg/corr/katcp_wrapper.pyc >> in _request(self, name, *args) >> 59 """ >> 60 request = Message.request(name, *args) >> ---> 61 reply, informs = >> self.blocking_request(request,keepalive=True) >> 62 >> 63 if reply.arguments[0] != Message.OK: >> >> /usr/lib/python2.6/site-packages/katcp-0.2.6-py2.6.egg/katcp/client.pyc in >> blocking_request(self, msg, timeout, keepalive) >> 605 else: >> 606 raise RuntimeError("Request %s timed out after %s >> seconds." % >> --> 607 (msg.name, timeout)) >> 608 >> 609 def handle_inform(self, msg): >> >> RuntimeError: Request tap-start timed out after 10.0 seconds. >> >> ---------------------------------------------------------------------------------------------------------------------------------- >> >> I don't understand why running the tut2.py script will start both cores just >> fine but manually doing it won't. The commands I put in manually were done >> after running the tut2.py script once, so all of the variables like mac_base >> were already set properly. Do these tests shed light on anything? >> >> On Mon, Mar 28, 2011 at 6:00 AM, Jason Manley <jasonman...@gmail.com> wrote: >> Here is the snippet of code that is breaking things: >> >> print 'Resetting cores and counters...', >> sys.stdout.flush() >> fpga.write_int('rst',3) >> fpga.write_int('rst',0) >> print 'done' >> >> time.sleep(2) >> >> if opts.arp: >> print '\n\n===============================' >> print '10GbE Transmitter core details:' >> print '===============================' >> print "Note that for some IP address values, only the lower 8 bits >> are valid!" >> fpga.print_10gbe_core_details(tx_core_name,arp=True) >> print '\n\n============================' >> print '10GbE Receiver core details:' >> print '============================' >> print "Note that for some IP address values, only the lower 8 bits >> are valid!" >> fpga.print_10gbe_core_details(rx_core_name,arp=True) >> >> print 'Sent %i packets already.'%fpga.read_int('gbe0_tx_cnt') >> print 'Received %i packets already.'%fpga.read_int('gbe3_rx_frame_cnt') >> >> I assume you're executing the script without the "-a" option? In this case >> it should be waiting for two seconds and then printing the current tx packet >> count next. It seems it's having trouble with something here. >> >> I'd suggest opening an interactive python command-line session (ipython or >> just python from the command line) and trying the command manually to see >> what's wrong. Try an fpga.listdev() to make sure that the register actually >> exists and then fpga.read_uint('gbe0_tx_cnt') to make sure you can read it >> ok. >> >> In the python file itself, you can also try increase the logging level: find >> the logger.setLevel(10) line and change to logger.setLevel(logging.DEBUG) >> >> Jason >> >> >> On 28 Mar 2011, at 02:58, Jon Losh wrote: >> >> > Okay, we extensively went through and made sure everything was up to date, >> > and now tap_start seems to work fine. The new problem is a little harder >> > to pin down though. Here's a log: >> > >> > >> > ----------------------------------------------------------------------------------------------------- >> > In [15]: run tut2.py 10.0.0.102 >> > Connecting to server 10.0.0.102... ok >> > >> > ------------------------ >> > Programming FPGA... ok >> > --------------------------- >> > Port 0 linkup: True >> > Port 3 linkup: True >> > --------------------------- >> > Configuring receiver core... done >> > Configuring transmitter core... done >> > --------------------------- >> > Setting-up packet source... done >> > Setting-up destination addresses... done >> > Resetting cores and counters... done >> > FAILURE DETECTED. Log entries: >> > 10.0.0.102: Starting thread Thread-5 >> > 10.0.0.102: #client-connected 10.0.0.4:59996 >> > 10.0.0.102: #client-connected 10.0.0.4:59997 >> > 10.0.0.102: #version raw-0.1 >> > 10.0.0.102: #build-state tcpborphserver-2.3398 >> > 10.0.0.102: ?progdev tut2_2011_Mar_17_1503.bof >> > >> > 10.0.0.102: !progdev ok 229 >> > 10.0.0.102: ?read gbe0_linkup 0 4 >> > >> > 10.0.0.102: !read ok \0\0\0 >> > 10.0.0.102: ?read gbe3_linkup 0 4 >> > >> > 10.0.0.102: !read ok \0\0\0 >> > 10.0.0.102: Starting tgtap driver instance for tap-start: tap0 gbe3 >> > 10.0.0.30 60000 02:02:0A:00:00:1E >> > 10.0.0.102: ?tap-start tap0 gbe3 10.0.0.30 60000 02:02:0A:00:00:1E >> > >> > 10.0.0.102: !tap-start ok >> > 10.0.0.102: Starting tgtap driver instance for tap-start: tap3 gbe0 >> > 10.0.0.20 60000 02:02:0A:00:00:14 >> > 10.0.0.102: ?tap-start tap3 gbe0 10.0.0.20 60000 02:02:0A:00:00:14 >> > >> > 10.0.0.102: !tap-start ok >> > 10.0.0.102: ?write pkt_sim_period 0 \0\0@\0 >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read pkt_sim_period 0 4 >> > >> > 10.0.0.102: !read ok \0\0@\0 >> > 10.0.0.102: Write 4000 to register pkt_sim_period at offset 0 ok. >> > 10.0.0.102: ?write pkt_sim_payload_len 0 \0\0\0� >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read pkt_sim_payload_len 0 4 >> > >> > 10.0.0.102: !read ok \0\0\0� >> > 10.0.0.102: Write 80 to register pkt_sim_payload_len at offset 0 ok. >> > 10.0.0.102: ?write dest_ip 0 \n\0\0 >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read dest_ip 0 4 >> > >> > 10.0.0.102: !read ok \n\0\0 >> > 10.0.0.102: Write a00001e to register dest_ip at offset 0 ok. >> > 10.0.0.102: ?write dest_port 0 \0\0�` >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read dest_port 0 4 >> > >> > 10.0.0.102: !read ok \0\0�` >> > 10.0.0.102: Write ea60 to register dest_port at offset 0 ok. >> > 10.0.0.102: ?write rst 0 \0\0\0 >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read rst 0 4 >> > >> > 10.0.0.102: !read ok \0\0\0 >> > 10.0.0.102: Write 3 to register rst at offset 0 ok. >> > 10.0.0.102: ?write rst 0 \0\0\0\0 >> > >> > 10.0.0.102: !write ok >> > 10.0.0.102: ?read rst 0 4 >> > >> > 10.0.0.102: !read ok \0\0\0\0 >> > 10.0.0.102: Write 0 to register rst at offset 0 ok. >> > 10.0.0.102: ?read gbe0_tx_cnt 0 4 >> > >> > None >> > ----------------------------------------------------------------------------------------------------- >> > >> > After running this, fpga.is_connected() still returns True, but any >> > requests made time out. Also, the roach becomes unresponsive to pings. >> > I've been told before that the requests timeouts are just sort of >> > mysterious and the best bet is to reboot the roach, but I've tried that a >> > few times (and with different roaches) and had no success. Where should I >> > go from here? >> > >> > On Fri, Mar 18, 2011 at 2:10 AM, Jason Manley <jasonman...@gmail.com> >> > wrote: >> > I think the version in your filesystem might be outdated. I'd suggest >> > downloading this tarball >> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/filesystem/filesystem_etch_2010-03-24_sd_shipping.tar.gz >> > and then manually updating the components listed below... >> > >> > Ensure you're running the latest of these things; any one that's >> > broken/old could cause a failure like what you're seeing: >> > * kernel: >> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-fallback-20110303 >> > * tcpborphserver: >> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tcpborphserver/tcpborphserver2-2011-03-10-r3405-fansilent >> > * tgtap: >> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tgtap/tgtap_2010-03-24 >> > >> > Jason >> > >> > On 18 Mar 2011, at 01:28, Mark Wagner wrote: >> > >> > > Well, that depends. Are you simply using the mmc card that came with >> > > your roach? If so, how long ago did it arrive? If you're using NFS or >> > > USB stick, then you probably would have downloaded the tarball, in which >> > > case, the version would be in the name. Maybe some else knows a better >> > > way to tell. >> > > >> > > You should be able to get the newest version here: >> > > >> > > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/filesystem/ >> > > >> > > Also, have you tried running tgtap from the roach itself? Once you've >> > > already loaded the design? Find the process ID, and the name you're >> > > using for the device... for ex: >> > > >> > > tgtap -b /proc/959/hw/ioreg/ten_GbE -a 10.0.0.31 -t ten_GbE -m >> > > 02:02:0A:00:00:1F -p 33107 >> > > >> > > Mark >> > > >> > > >> > > On Thu, Mar 17, 2011 at 1:10 PM, Jon Losh <jl...@mit.edu> wrote: >> > > How do I check what version of the filesystem I have? >> > > >> > > >> > > On Thu, Mar 17, 2011 at 4:04 PM, Mark Wagner <mwag...@ssl.berkeley.edu> >> > > wrote: >> > > Hi Jon, >> > > >> > > You're right, if you've just run the script, tgtap should be running on >> > > the roach. >> > > >> > > Which version of the roach filesystem are you using? There were some >> > > >> > > changes to tgtap awhile back. If you're using an older version of the >> > > filesystem >> > > >> > > tgtap may not be compatible with the latest version of tut2. >> > > >> > > Mark >> > > >> > > >> > > On Thu, Mar 17, 2011 at 11:38 AM, Jon Losh <jl...@mit.edu> wrote: >> > > Hi, >> > > >> > > So I've been trying to get 10gbe working for tutorial 2, and the script >> > > keeps failing when it tries to call fpga.tap_start(). I get the >> > > following error: >> > > >> > > Connecting to server 10.0.0.105... ok >> > > >> > > ------------------------ >> > > Programming FPGA... ok >> > > --------------------------- >> > > Port 0 linkup: True >> > > Port 3 linkup: True >> > > --------------------------- >> > > Configuring receiver core... FAILURE DETECTED. Log entries: >> > > 10.0.0.105: Starting thread Thread-1 >> > > 10.0.0.105: #version poco-0.1 >> > > 10.0.0.105: #build-state poco-0.1775 >> > > 10.0.0.105: ?progdev tut2_2011_Mar_16_1608.bof >> > > >> > > 10.0.0.105: !progdev ok >> > > 10.0.0.105: ?read gbe0_linkup 0 4 >> > > >> > > 10.0.0.105: !read ok \0\0\0 >> > > 10.0.0.105: ?read gbe3_linkup 0 4 >> > > >> > > 10.0.0.105: !read ok \0\0\0 >> > > None >> > > >> > > My sysadmin tells me that we have the latest versions of of the ROACH >> > > kernel, tcpborphserver, and corr, so I'm not totally sure what's up. One >> > > of the threads in the mail archive said to try checking the processes >> > > running on the roach to see if tgtap was running; a "ps aux" gives: >> > > >> > > root@10:~# ps aux >> > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >> > > root 1 0.0 0.0 2428 744 ? Ss Oct01 0:02 init [2] >> > > root 2 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [kthreadd] >> > > root 3 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [ksoftirqd/0] >> > > root 4 0.0 0.0 0 0 ? S< Oct01 0:03 >> > > [events/0] >> > > root 5 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [khelper] >> > > root 51 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [kblockd/0] >> > > root 61 0.0 0.0 0 0 ? S< Oct01 0:00 [khubd] >> > > root 68 0.0 0.0 0 0 ? S< Oct01 0:00 [kmmcd] >> > > root 88 0.0 0.0 0 0 ? S Oct01 0:01 >> > > [bkexecd] >> > > root 89 0.0 0.0 0 0 ? S Oct01 0:00 >> > > [pdflush] >> > > root 90 0.0 0.0 0 0 ? S Oct01 0:00 >> > > [pdflush] >> > > root 91 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [kswapd0] >> > > root 92 0.0 0.0 0 0 ? S< Oct01 0:00 [aio/0] >> > > root 150 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [mtdblockd] >> > > root 196 0.0 0.0 0 0 ? S< Oct01 0:00 [krmond] >> > > root 203 0.0 0.0 0 0 ? S< Oct01 0:00 >> > > [rpciod/0] >> > > root 208 0.0 0.0 0 0 ? S< Oct01 0:00 [mmcqd] >> > > root 214 0.0 0.0 0 0 ? SN Oct01 0:00 >> > > [jffs2_gcd_mtd3] >> > > root 238 0.0 0.1 6700 1164 ? Ss Oct01 0:00 >> > > /usr/sbin/sshd >> > > ntp 247 0.0 0.1 5432 1312 ? Ss Oct01 0:03 >> > > /usr/sbin/ntpd -p /var/run/ntpd.pid -u 101:103 -g -b -l /tmp/nt >> > > root 256 0.0 0.0 1908 648 ? S Oct01 0:00 >> > > tcpborphserver2 >> > > root 264 0.0 0.0 1788 576 ttyS0 Ss+ Oct01 0:00 >> > > /sbin/getty -L ttyS0 115200 vt100 >> > > root 292 0.0 0.0 1632 304 ? S 02:32 0:00 >> > > /boffiles/tut2_2011_Mar_16_1608.bof >> > > root 293 7.3 0.2 10000 2668 ? Ss 02:38 0:00 sshd: >> > > root@pts/0 >> > > root 296 1.0 0.1 3552 1792 pts/0 Ss 02:38 0:00 -bash >> > > root 300 0.0 0.0 2780 996 pts/0 R+ 02:38 0:00 ps aux >> > > >> > > Should tgtap be one of the processes running under "command"? If so, it >> > > doesn't appear to be there. Any ideas on a good angle to attack this >> > > from? >> > > >> > > >> > > >> > >> > >> >> > > >