Right, so after a day spent with Daviey and a bunch of 30MB pcap files,
we think we've figured this out.

the key exchange that failed happens here:


 7418   112.051626      10.55.200.99    10.55.200.1     TFTP    Read Request, 
File: amd64/generic/quantal/commissioning/initrd.gz, Transfer type: octet, 
tsize\000=0\000, blksize\000=1408\000
 7419   112.053444      10.55.200.1     10.55.200.99    TFTP    Option 
Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7420   113.053489      10.55.200.1     10.55.200.99    TFTP    Option 
Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7423   116.053542      10.55.200.1     10.55.200.99    TFTP    Option 
Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7425   116.832761      10.55.200.99    10.55.200.1     TFTP    
Acknowledgement, Block: 0

The client requests the initrd, but something in the firmware or
pxelinux itself gets hung for almost five seconds.  During that time,
the maas tftpd sends three ACKs (option acknowledgements, specifically),
and times out.  By the time the client sends the ACK-0 to start the data
transfer, the session state has been discarded and the tftpd just loggs
the exception as an OOPS and waits for the next session to start.

Incidentally, we spent a lot of time correlating requested/actual block
sizes for a while between this tftpd and the HPA tftpd.  That turned out
to be a red herring, of course, but it seemed like a compelling lead for
a while.  The solution did come from a comparision to tftpd-hpa, though.

In a few places in tftp/bootstrap.py and tftp/session.py there are
timeout tuples set to (1, 3, 7).  The iterable is consumed by the
watchdog code every time a packet is sent out, and once the iterable is
empty the watchdog tells the state machine to give up on the request.
We never dug too far into the units or where in the conversation these
things are read, but the fact that there are three times in the tuple
and that the daemon gave up after three ACKs is a compelling
coïncidence.

The tftpd-hpa code tries six times, waiting one second each:

    <Daviey> Spads: #define TIMEOUT 1000000         /* Default timeout (us) */
    <Daviey> #define TRIES   6               /* Number of attempts to send each 
packet */
    <Daviey> #define TIMEOUT_LIMIT ((1 << TRIES)-1)

Extending the tuple at line 346 of bootstrap.py solved this situation
for us, and the maas tftpd succeeded just as tftpd-hpa.  In the end we
settled on:

    class RemoteOriginReadSession(TFTPBootstrap):
        """Bootstraps a L{ReadSession}, that was started remotely, - we've 
received
        a RRQ.

        """
        timeout = (1, 1, 1, 1, 1, 1)

...as this more closely mimics what Daviey found in the tftpd-hpa
source.

This timeout tuple appears in a few places, so any adjustments to this
code should probably be made to all of the timeout iterables in
bootstrap.py and session.py.

Finally, while it's true that this seems to be a workaround for a fault
on the client side (whether the fault is in firmware or in pxelinux.0 I
can't say), I believe it is also a regression against the precise maas,
which used cobbler.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to python-tx-tftp in Ubuntu.
https://bugs.launchpad.net/bugs/1155556

Title:
  HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp
  server.  DL380 G6 succeeds.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1155556/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

Reply via email to