The core dumps we've seen have been baffling (at least to me), but once I was able to reproduce the problem, it was relatively trivial to track it down: calling script_stop() causes the callback to run, and if that pending callback is dhcp_drop() or dhcp_release(), then we're doomed.
The change (just 47 lines) recognizes script invocation for those two cases as a terminal request: the state machine is on its way down, nothing can stop it (all roads lead to finished_smach() and state machine shutdown), and there's no reason to launch a new script at that point. Internal: http://zhadum.east/ws/carlsonj/6592155-fix/webrev/ External: http://cr.opensolaris.org/~carlsonj/webrev-6592155/ For external folks (who inexplicably cannot see the 'Evaluation' field in bugster!), here are the evaluation comments in that CR. The problem that causes the dump has its roots in this code in agent.c: if ((isv6 && !check_main_lif(dsmp, &msg.ifam, msglen)) || (!isv6 && !verify_lif(dsmp->dsm_lif))) { if (dsmp->dsm_script_pid != -1) script_stop(dsmp); (void) script_start(dsmp, EVENT_DROP6, dhcp_drop, NULL, NULL); continue; } There are two problems with that sequence. The first one is obvious -- it issues an IPv6 event even when IPv4 is in use -- but the second one is subtle. If drop or release events are already running when we get here, then we're set up to call dhcp_drop() or dhcp_release() when the script is done running. Invoking script_stop() *causes* the callbacks to occur, which tear down the state machine. By the time we get to script_start(), we're toast. We're using a freed state machine pointer in dsmp, and havoc breaks loose. The fix is simple: just as in nuke_smach_list(), if the event that's running already leads to the teardown, then do *nothing*. As this ends up being a duplicate of that code, we should put it in a single common function. *** (#1 of 2): 2008-04-15 17:27:39 EDT [EMAIL PROTECTED] It's easy to reproduce this problem. You need an eventhook script -- preferably one that does some work and doesn't return immediately -- and then a user manipulating the interface so that DHCP gets frightened away. The combination causes (a) DHCP to drop the interface and (b) subsequent routing socket events to come in which trigger a second attempt at dropping. That's fatal. # cat > /etc/dhcp/eventhook #!/bin/sh sleep 3 # chmod +x /etc/dhcp/eventhook # ifconfig nge0 nge0: flags=201004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,CoS> mtu 1500 index 2 inet 10.8.57.140 netmask ffffff00 broadcast 10.8.57.255 ether 0:14:4f:3b:88:e2 # ifconfig nge0 10.0.0.1 # Apr 16 14:01:45 purple-140.East.Sun.COM genunix: NOTICE: core_log: dhcpagent[65] core dumped: /export/core/core.dhcpagent.65 # pstack /export/core/core.dhcpagent.65 core '/export/core/core.dhcpagent.65' of 65: /sbin/dhcpagent fee8a930 strlen (80476b8, 8047d24, 8046f10, 0) + 30 feece240 vsnprintf (80471bf, 4b9, 8047694, 8047d24) + 70 feebe006 vsyslog (1e, 8047b00, 8047d24) + 3de fed64e80 dhcpmsg (2, 8064ffc, 0) + b4 0805d86d dhcp_drop (807a268, 0) + 1d 0805f787 script_cleanup (807a268) + ab 0805f81c script_exit (8079070, 4, 1, 4, 807a268) + 68 fed41ff5 iu_handle_events (8079070, 8078fd8) + 12d 08055d6f main (1, 8047e5c, 8047e64) + 343 0805549a _start (1, 8047ef4, 0, 8047f04, 8047f19, 8047f31) + 7a The fixed code no longer exhibits this problem, because it runs through the drop/release logic just once. *** (#2 of 2): 2008-04-16 14:43:57 EDT [EMAIL PROTECTED] -- James Carlson, Solaris Networking <[EMAIL PROTECTED]> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 _______________________________________________ networking-discuss mailing list [email protected]
