Re: [casper] tut2 10gbe not configuring

Andrew Lutomirski Mon, 28 Mar 2011 14:29:51 -0700

*sigh*

tgtap does three things.


It resolves ARP, which is great.

It creates a tap interface, which is of dubious value.  (Does anyone
know which packets received on 10gbe are passed along to the tap
interface?  It better not be a lot of them, because there's no way
that the PPC could keep up at full speed.)

It runs ifconfig, which is a disaster when booted over NFS.

I commented out the ifconfig and life is good.

That being said, should I expect ARP to still work when the 10Gb/sec
is coming in?  And will the transmissions *from* the BORPH side use up
buffer and cause tx overruns?

--Andy

On Mon, Mar 28, 2011 at 12:24 PM, Jason Manley <jasonman...@gmail.com> wrote:
> The fact that the board is not responding even to ssh isn't good. It sounds 
> like a possible fault with the kernel or tcpborphserver. Which versions are 
> you running?
>
> http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tcpborphserver/tcpborphserver2-2011-03-10-r3405-fansilent
> http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-fallback-20110303
>
> I would suggest connecting a serial cable to the ROACH, starting 
> tcpborphserver manually (non-daemon) and watching what goes wrong.
>
> Perhaps Marc can comment?
>
> Jason
>
> On 28 Mar 2011, at 18:16, Jon Losh wrote:
>
>> I'm not running the code with the '-a' option, so it should skip the ARP 
>> printing. I tried setting the logger level to debug, but the error messages 
>> are the exact same.
>>
>> When I run the code and try to interact with the fpga, any requests like 
>> fpga.listdev() or fpga.listbof() give me a "Request timed out after 10.0 
>> seconds." fpga.is_connected() returns True, but I'm beginning to suspect 
>> that's more of a relic of it not being set to false due to something else 
>> being gummed up.
>>
>> Also, after running this code, the roach itself is unresponsive to pings or 
>> attempts to ssh into it, and I have to go manually reset it. I've never seen 
>> something that would lock up a roach like this...
>>
>> I tried stepping through the code manually, and the problems seem to start 
>> when I try to start tgtap a second time. I thought there was something in 
>> the mail archive about tgtap being stupid about starting a second instance, 
>> but wasn't that fixed? I've compiled this .bof with the latest stable 
>> version of the CASPER git, if that's useful to know. Here's a log of the 
>> ipython session:
>>
>> ----------------------------------------------------------------------------------------------------------------------------------
>> In [39]: fpga = corr.katcp_wrapper.FpgaClient('10.0.0.102', logger=logger)
>>
>> In [40]: fpga.progdev('tut2_2011_Mar_17_1503.bof')
>> Out[40]: 'ok'
>>
>> In [41]: 
>> fpga.tap_start('tap0',rx_core_name,mac_base+dest_ip,dest_ip,fabric_port)
>>
>> In [42]: 
>> fpga.tap_start('tap3',tx_core_name,mac_base+source_ip,source_ip,fabric_port)
>> ERROR: An unexpected error occurred while tokenizing input
>> The following traceback may be corrupted or invalid
>> The error message is: ('EOF in multi-line statement', (167, 0))
>>
>> ---------------------------------------------------------------------------
>> RuntimeError                              Traceback (most recent call last)
>>
>> /home/jlosh/fftt/models/test/tut2_short.py in <module>()
>> ----> 1
>>       2
>>       3
>>       4
>>       5
>>
>> /usr/lib/python2.6/site-packages/corr-0.6.5-py2.6.egg/corr/katcp_wrapper.pyc 
>> in tap_start(self, tap_dev, device, mac, ip, port)
>>     176
>>     177         self._logger.info("Starting tgtap driver instance for %s: %s 
>> %s %s %s %s"%("tap-start", tap_dev, device, ip_str, port_str, mac_str))
>> --> 178         reply, informs = self._request("tap-start", tap_dev, device, 
>> ip_str, port_str, mac_str)
>>     179         if reply.arguments[0]=='ok': return
>>     180         else: raise RuntimeError("Failure starting tap device %s 
>> with mac %s, %s:%s"%(device,mac_str,ip_str,port_str))
>>
>> /usr/lib/python2.6/site-packages/corr-0.6.5-py2.6.egg/corr/katcp_wrapper.pyc 
>> in _request(self, name, *args)
>>      59            """
>>      60         request = Message.request(name, *args)
>> ---> 61         reply, informs = 
>> self.blocking_request(request,keepalive=True)
>>      62
>>      63         if reply.arguments[0] != Message.OK:
>>
>> /usr/lib/python2.6/site-packages/katcp-0.2.6-py2.6.egg/katcp/client.pyc in 
>> blocking_request(self, msg, timeout, keepalive)
>>     605         else:
>>     606             raise RuntimeError("Request %s timed out after %s 
>> seconds." %
>> --> 607                                 (msg.name, timeout))
>>     608
>>     609     def handle_inform(self, msg):
>>
>> RuntimeError: Request tap-start timed out after 10.0 seconds.
>>
>> ----------------------------------------------------------------------------------------------------------------------------------
>>
>> I don't understand why running the tut2.py script will start both cores just 
>> fine but manually doing it won't. The commands I put in manually were done 
>> after running the tut2.py script once, so all of the variables like mac_base 
>> were already set properly. Do these tests shed light on anything?
>>
>> On Mon, Mar 28, 2011 at 6:00 AM, Jason Manley <jasonman...@gmail.com> wrote:
>> Here is the snippet of code that is breaking things:
>>
>>    print 'Resetting cores and counters...',
>>    sys.stdout.flush()
>>    fpga.write_int('rst',3)
>>    fpga.write_int('rst',0)
>>    print 'done'
>>
>>    time.sleep(2)
>>
>>    if opts.arp:
>>        print '\n\n==============================='
>>        print '10GbE Transmitter core details:'
>>        print '==============================='
>>        print "Note that for some IP address values, only the lower 8 bits 
>> are valid!"
>>        fpga.print_10gbe_core_details(tx_core_name,arp=True)
>>        print '\n\n============================'
>>        print '10GbE Receiver core details:'
>>        print '============================'
>>        print "Note that for some IP address values, only the lower 8 bits 
>> are valid!"
>>        fpga.print_10gbe_core_details(rx_core_name,arp=True)
>>
>>    print 'Sent %i packets already.'%fpga.read_int('gbe0_tx_cnt')
>>    print 'Received %i packets already.'%fpga.read_int('gbe3_rx_frame_cnt')
>>
>> I assume you're executing the script without the "-a" option? In this case 
>> it should be waiting for two seconds and then printing the current tx packet 
>> count next. It seems it's having trouble with something here.
>>
>> I'd suggest opening an interactive python command-line session (ipython or 
>> just python from the command line) and trying the command manually to see 
>> what's wrong. Try an fpga.listdev() to make sure that the register actually 
>> exists and then fpga.read_uint('gbe0_tx_cnt') to make sure you can read it 
>> ok.
>>
>> In the python file itself, you can also try increase the logging level: find 
>> the logger.setLevel(10) line and change to     logger.setLevel(logging.DEBUG)
>>
>> Jason
>>
>>
>> On 28 Mar 2011, at 02:58, Jon Losh wrote:
>>
>> > Okay, we extensively went through and made sure everything was up to date, 
>> > and now tap_start seems to work fine. The new problem is a little harder 
>> > to pin down though. Here's a log:
>> >
>> >
>> > -----------------------------------------------------------------------------------------------------
>> > In [15]: run tut2.py 10.0.0.102
>> > Connecting to server 10.0.0.102...  ok
>> >
>> > ------------------------
>> > Programming FPGA... ok
>> > ---------------------------
>> > Port 0 linkup:  True
>> > Port 3 linkup:  True
>> > ---------------------------
>> > Configuring receiver core... done
>> > Configuring transmitter core... done
>> > ---------------------------
>> > Setting-up packet source... done
>> > Setting-up destination addresses... done
>> > Resetting cores and counters... done
>> > FAILURE DETECTED. Log entries:
>> > 10.0.0.102: Starting thread Thread-5
>> > 10.0.0.102: #client-connected 10.0.0.4:59996
>> > 10.0.0.102: #client-connected 10.0.0.4:59997
>> > 10.0.0.102: #version raw-0.1
>> > 10.0.0.102: #build-state tcpborphserver-2.3398
>> > 10.0.0.102: ?progdev tut2_2011_Mar_17_1503.bof
>> >
>> > 10.0.0.102: !progdev ok 229
>> > 10.0.0.102: ?read gbe0_linkup 0 4
>> >
>> > 10.0.0.102: !read ok \0\0\0
>> > 10.0.0.102: ?read gbe3_linkup 0 4
>> >
>> > 10.0.0.102: !read ok \0\0\0
>> > 10.0.0.102: Starting tgtap driver instance for tap-start: tap0 gbe3 
>> > 10.0.0.30 60000 02:02:0A:00:00:1E
>> > 10.0.0.102: ?tap-start tap0 gbe3 10.0.0.30 60000 02:02:0A:00:00:1E
>> >
>> > 10.0.0.102: !tap-start ok
>> > 10.0.0.102: Starting tgtap driver instance for tap-start: tap3 gbe0 
>> > 10.0.0.20 60000 02:02:0A:00:00:14
>> > 10.0.0.102: ?tap-start tap3 gbe0 10.0.0.20 60000 02:02:0A:00:00:14
>> >
>> > 10.0.0.102: !tap-start ok
>> > 10.0.0.102: ?write pkt_sim_period 0 \0\0@\0
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read pkt_sim_period 0 4
>> >
>> > 10.0.0.102: !read ok \0\0@\0
>> > 10.0.0.102: Write     4000 to register pkt_sim_period at offset 0 ok.
>> > 10.0.0.102: ?write pkt_sim_payload_len 0 \0\0\0�
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read pkt_sim_payload_len 0 4
>> >
>> > 10.0.0.102: !read ok \0\0\0�
>> > 10.0.0.102: Write       80 to register pkt_sim_payload_len at offset 0 ok.
>> > 10.0.0.102: ?write dest_ip 0 \n\0\0
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read dest_ip 0 4
>> >
>> > 10.0.0.102: !read ok \n\0\0
>> > 10.0.0.102: Write  a00001e to register dest_ip at offset 0 ok.
>> > 10.0.0.102: ?write dest_port 0 \0\0�`
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read dest_port 0 4
>> >
>> > 10.0.0.102: !read ok \0\0�`
>> > 10.0.0.102: Write     ea60 to register dest_port at offset 0 ok.
>> > 10.0.0.102: ?write rst 0 \0\0\0
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read rst 0 4
>> >
>> > 10.0.0.102: !read ok \0\0\0
>> > 10.0.0.102: Write        3 to register rst at offset 0 ok.
>> > 10.0.0.102: ?write rst 0 \0\0\0\0
>> >
>> > 10.0.0.102: !write ok
>> > 10.0.0.102: ?read rst 0 4
>> >
>> > 10.0.0.102: !read ok \0\0\0\0
>> > 10.0.0.102: Write        0 to register rst at offset 0 ok.
>> > 10.0.0.102: ?read gbe0_tx_cnt 0 4
>> >
>> > None
>> > -----------------------------------------------------------------------------------------------------
>> >
>> > After running this, fpga.is_connected() still returns True, but any 
>> > requests made time out. Also, the roach becomes unresponsive to pings. 
>> > I've been told before that the requests timeouts are just sort of 
>> > mysterious and the best bet is to reboot the roach, but I've tried that a 
>> > few times (and with different roaches) and had no success. Where should I 
>> > go from here?
>> >
>> > On Fri, Mar 18, 2011 at 2:10 AM, Jason Manley <jasonman...@gmail.com> 
>> > wrote:
>> > I think the version in your filesystem might be outdated. I'd suggest 
>> > downloading this tarball 
>> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/filesystem/filesystem_etch_2010-03-24_sd_shipping.tar.gz
>> >  and then manually updating the components listed below...
>> >
>> > Ensure you're running the latest of these things; any one that's 
>> > broken/old could cause a failure like what you're seeing:
>> >  * kernel: 
>> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/linux/uImage-fallback-20110303
>> >  * tcpborphserver: 
>> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tcpborphserver/tcpborphserver2-2011-03-10-r3405-fansilent
>> >  * tgtap: 
>> > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/tgtap/tgtap_2010-03-24
>> >
>> > Jason
>> >
>> > On 18 Mar 2011, at 01:28, Mark Wagner wrote:
>> >
>> > > Well, that depends.  Are you simply using the mmc card that came with 
>> > > your roach?  If so,  how long ago did it arrive?  If you're using NFS or 
>> > > USB stick, then you probably would have downloaded the tarball, in which 
>> > > case, the version would be in the name. Maybe some else knows a better 
>> > > way to tell.
>> > >
>> > > You should be able to get the newest version here:
>> > >
>> > > http://casper.berkeley.edu/svn/trunk/roach/sw/binaries/filesystem/
>> > >
>> > > Also, have you tried running tgtap from the roach itself?  Once you've 
>> > > already loaded the design?  Find the process ID, and the name you're 
>> > > using for the device... for ex:
>> > >
>> > > tgtap -b /proc/959/hw/ioreg/ten_GbE -a 10.0.0.31 -t ten_GbE -m 
>> > > 02:02:0A:00:00:1F -p 33107
>> > >
>> > > Mark
>> > >
>> > >
>> > > On Thu, Mar 17, 2011 at 1:10 PM, Jon Losh <jl...@mit.edu> wrote:
>> > > How do I check what version of the filesystem I have?
>> > >
>> > >
>> > > On Thu, Mar 17, 2011 at 4:04 PM, Mark Wagner <mwag...@ssl.berkeley.edu> 
>> > > wrote:
>> > > Hi Jon,
>> > >
>> > > You're right, if you've just run the script, tgtap should be running on 
>> > > the roach.
>> > >
>> > > Which version of the roach filesystem are you using?  There were some
>> > >
>> > > changes to tgtap awhile back.  If you're using an older version of the 
>> > > filesystem
>> > >
>> > > tgtap may not be compatible with the latest version of tut2.
>> > >
>> > > Mark
>> > >
>> > >
>> > > On Thu, Mar 17, 2011 at 11:38 AM, Jon Losh <jl...@mit.edu> wrote:
>> > > Hi,
>> > >
>> > > So I've been trying to get 10gbe working for tutorial 2, and the script 
>> > > keeps failing when it tries to call fpga.tap_start(). I get the 
>> > > following error:
>> > >
>> > > Connecting to server 10.0.0.105...  ok
>> > >
>> > > ------------------------
>> > > Programming FPGA... ok
>> > > ---------------------------
>> > > Port 0 linkup:  True
>> > > Port 3 linkup:  True
>> > > ---------------------------
>> > > Configuring receiver core... FAILURE DETECTED. Log entries:
>> > > 10.0.0.105: Starting thread Thread-1
>> > > 10.0.0.105: #version poco-0.1
>> > > 10.0.0.105: #build-state poco-0.1775
>> > > 10.0.0.105: ?progdev tut2_2011_Mar_16_1608.bof
>> > >
>> > > 10.0.0.105: !progdev ok
>> > > 10.0.0.105: ?read gbe0_linkup 0 4
>> > >
>> > > 10.0.0.105: !read ok \0\0\0
>> > > 10.0.0.105: ?read gbe3_linkup 0 4
>> > >
>> > > 10.0.0.105: !read ok \0\0\0
>> > > None
>> > >
>> > > My sysadmin tells me that we have the latest versions of of the ROACH 
>> > > kernel, tcpborphserver, and corr, so I'm not totally sure what's up. One 
>> > > of the threads in the mail archive said to try checking the processes 
>> > > running on the roach to see if tgtap was running; a "ps aux" gives:
>> > >
>> > > root@10:~# ps aux
>> > > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> > > root         1  0.0  0.0   2428   744 ?        Ss   Oct01   0:02 init [2]
>> > > root         2  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [kthreadd]
>> > > root         3  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [ksoftirqd/0]
>> > > root         4  0.0  0.0      0     0 ?        S<   Oct01   0:03 
>> > > [events/0]
>> > > root         5  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [khelper]
>> > > root        51  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [kblockd/0]
>> > > root        61  0.0  0.0      0     0 ?        S<   Oct01   0:00 [khubd]
>> > > root        68  0.0  0.0      0     0 ?        S<   Oct01   0:00 [kmmcd]
>> > > root        88  0.0  0.0      0     0 ?        S    Oct01   0:01 
>> > > [bkexecd]
>> > > root        89  0.0  0.0      0     0 ?        S    Oct01   0:00 
>> > > [pdflush]
>> > > root        90  0.0  0.0      0     0 ?        S    Oct01   0:00 
>> > > [pdflush]
>> > > root        91  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [kswapd0]
>> > > root        92  0.0  0.0      0     0 ?        S<   Oct01   0:00 [aio/0]
>> > > root       150  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [mtdblockd]
>> > > root       196  0.0  0.0      0     0 ?        S<   Oct01   0:00 [krmond]
>> > > root       203  0.0  0.0      0     0 ?        S<   Oct01   0:00 
>> > > [rpciod/0]
>> > > root       208  0.0  0.0      0     0 ?        S<   Oct01   0:00 [mmcqd]
>> > > root       214  0.0  0.0      0     0 ?        SN   Oct01   0:00 
>> > > [jffs2_gcd_mtd3]
>> > > root       238  0.0  0.1   6700  1164 ?        Ss   Oct01   0:00 
>> > > /usr/sbin/sshd
>> > > ntp        247  0.0  0.1   5432  1312 ?        Ss   Oct01   0:03 
>> > > /usr/sbin/ntpd -p /var/run/ntpd.pid -u 101:103 -g -b -l /tmp/nt
>> > > root       256  0.0  0.0   1908   648 ?        S    Oct01   0:00 
>> > > tcpborphserver2
>> > > root       264  0.0  0.0   1788   576 ttyS0    Ss+  Oct01   0:00 
>> > > /sbin/getty -L ttyS0 115200 vt100
>> > > root       292  0.0  0.0   1632   304 ?        S    02:32   0:00 
>> > > /boffiles/tut2_2011_Mar_16_1608.bof
>> > > root       293  7.3  0.2  10000  2668 ?        Ss   02:38   0:00 sshd: 
>> > > root@pts/0
>> > > root       296  1.0  0.1   3552  1792 pts/0    Ss   02:38   0:00 -bash
>> > > root       300  0.0  0.0   2780   996 pts/0    R+   02:38   0:00 ps aux
>> > >
>> > > Should tgtap be one of the processes running under "command"? If so, it 
>> > > doesn't appear to be there. Any ideas on a good angle to attack this 
>> > > from?
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>
>
>

Re: [casper] tut2 10gbe not configuring

Reply via email to