The core dumps we've seen have been baffling (at least to me), but
once I was able to reproduce the problem, it was relatively trivial to
track it down: calling script_stop() causes the callback to run, and
if that pending callback is dhcp_drop() or dhcp_release(), then we're
doomed.

The change (just 47 lines) recognizes script invocation for those two
cases as a terminal request: the state machine is on its way down,
nothing can stop it (all roads lead to finished_smach() and state
machine shutdown), and there's no reason to launch a new script at
that point.

Internal:       http://zhadum.east/ws/carlsonj/6592155-fix/webrev/

External:       http://cr.opensolaris.org/~carlsonj/webrev-6592155/

For external folks (who inexplicably cannot see the 'Evaluation' field
in bugster!), here are the evaluation comments in that CR.


The problem that causes the dump has its roots in this code in
agent.c:

                if ((isv6 && !check_main_lif(dsmp, &msg.ifam, msglen)) ||
                    (!isv6 && !verify_lif(dsmp->dsm_lif))) {
                        if (dsmp->dsm_script_pid != -1)
                                script_stop(dsmp);
                        (void) script_start(dsmp, EVENT_DROP6, dhcp_drop, NULL,
                            NULL);
                        continue;
                }

There are two problems with that sequence.  The first one is obvious
-- it issues an IPv6 event even when IPv4 is in use -- but the second
one is subtle.

If drop or release events are already running when we get here, then
we're set up to call dhcp_drop() or dhcp_release() when the script is
done running.  Invoking script_stop() *causes* the callbacks to occur,
which tear down the state machine.  By the time we get to
script_start(), we're toast.  We're using a freed state machine
pointer in dsmp, and havoc breaks loose.

The fix is simple: just as in nuke_smach_list(), if the event that's
running already leads to the teardown, then do *nothing*.  As this
ends up being a duplicate of that code, we should put it in a single
common function.
*** (#1 of 2): 2008-04-15 17:27:39 EDT [EMAIL PROTECTED]

It's easy to reproduce this problem.  You need an eventhook script --
preferably one that does some work and doesn't return immediately --
and then a user manipulating the interface so that DHCP gets
frightened away.  The combination causes (a) DHCP to drop the
interface and (b) subsequent routing socket events to come in which
trigger a second attempt at dropping.  That's fatal.

# cat > /etc/dhcp/eventhook
#!/bin/sh
sleep 3
# chmod +x /etc/dhcp/eventhook 
# ifconfig nge0
nge0: flags=201004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,CoS> mtu 1500 
index 2
        inet 10.8.57.140 netmask ffffff00 broadcast 10.8.57.255
        ether 0:14:4f:3b:88:e2 
# ifconfig nge0 10.0.0.1
# Apr 16 14:01:45 purple-140.East.Sun.COM genunix: NOTICE: core_log: 
dhcpagent[65] core dumped: /export/core/core.dhcpagent.65
# pstack /export/core/core.dhcpagent.65
core '/export/core/core.dhcpagent.65' of 65:    /sbin/dhcpagent
 fee8a930 strlen   (80476b8, 8047d24, 8046f10, 0) + 30
 feece240 vsnprintf (80471bf, 4b9, 8047694, 8047d24) + 70
 feebe006 vsyslog  (1e, 8047b00, 8047d24) + 3de
 fed64e80 dhcpmsg  (2, 8064ffc, 0) + b4
 0805d86d dhcp_drop (807a268, 0) + 1d
 0805f787 script_cleanup (807a268) + ab
 0805f81c script_exit (8079070, 4, 1, 4, 807a268) + 68
 fed41ff5 iu_handle_events (8079070, 8078fd8) + 12d
 08055d6f main     (1, 8047e5c, 8047e64) + 343
 0805549a _start   (1, 8047ef4, 0, 8047f04, 8047f19, 8047f31) + 7a

The fixed code no longer exhibits this problem, because it runs
through the drop/release logic just once.
*** (#2 of 2): 2008-04-16 14:43:57 EDT [EMAIL PROTECTED]

-- 
James Carlson, Solaris Networking              <[EMAIL PROTECTED]>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to