Understandable ;) I was able to produced another panic, however this time the backtrace is a different. This round resulted in a GP fault:
(Unfortunately, this machine does not support serial, so I will have to reproduce manually below) [0]> $c kmdb_enter+0xb() debug_enter+0x37(...) panicsys+0x3fd(...) vpanic+0x15d() panic+0x9c() die+0xea(...) trap+0x3d0(...) dnet`dnet_send+0x7f(...) dnet`dnet_m_tx+0x85(...) dls`dls_tx+0x1d(...) ... Steve Garrett D'Amore wrote: > Please review very carefully the mutexes. I suspect a potential out of > order mutex allocation. > > It would be informative to post the stack backtrace from the panic > here. (Its a lot easier to look at a back trace than to try to > reproduce myself. :-) > > -- Garrett > > Steven Stallion wrote: >> Garrett/All, >> >> Looks like the fix wasn't much of a fix; I may have just stumbled on a >> pre-existing issue. >> >> I erred on the safe side and updated the mutex handling in dnet_send >> to be a bit more agressive; the behavior matches precisely what >> existed in dnet prior to any of my changes. >> >> The panic occurs less frequently, but the race condition still exists. >> >> Essentially, the panic is raised in mutex_vector_enter as a result of >> trying to obtain a lock on dnetp->intrlock via mutex_enter. >> >> debug64 and debug32 builds do not exhibit this behavior (I am at a >> loss as to why this is occurring in obj64 builds only). >> >> I suspect something is running afoul in the ISR (dnet_intr). A >> possible solution is to move the code which kicks the transmitter up >> into dnet_m_tx - this will result in a single interrupt per packet >> chain rather than once per packet. >> >> At this point, I would like to have someone else verify that this is >> indeed an issue (see below) before I do much more. The device I am >> testing this on is known for being a bit difficult (Cogent chipset). >> >> To reproduce: >> >> Apply the dnet patch provided in the webrev and build an obj64 version >> of the driver. Plumb the interface and start pushing traffic (I was >> issuing 'rsh <host> find /' to the NICDRV client). A panic should >> result within a couple of minutes. >> >> Any ideas? >> >> Steve >> >> Steven Stallion wrote: >>> A quick update: >>> >>> Yesterday, while switching over to the auto nicdrv scripts Alan >>> mentioned, I also changed over to the non-debug version of the driver >>> and almost immediately ran into a panic. >>> >>> I managed to create an interesting race condition in dnet_send that >>> only shows up in the non-debug version of the driver. I am a bit >>> surprised since this really should affect the debug version equally, >>> however I was never able to duplicate the condition. >>> >>> Long story short, I was attempting to be cute with my mutex handling. >>> >>> Everything is now back on track, and I should have a new set of >>> NICDRV results later this evening. >>> >>> Steve >>> >> > -- Yet magic and hierarchy arise from the same source, and this source has a null pointer. Reference the NULL within NULL, it is the gateway to all wizardry. _______________________________________________ driver-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/driver-discuss
