I made some progress understanding the behavior but I am nowhere close to a solution. Any suggestions would be welcome.
First of all, I think the fix in snv_127 for the PCIe cards does not address the real issue. It simply slows down transmission to the point where the bug doesn't arrive. I fixed the card detection as masa suggested but I commented out the trigger commands in the send() function. Upon testing it, the driver worked fine. I tried reducing the counter iterations from 10 to 4 and the bug appeared. This is a strong indication that the fix works by changing timing of events rather than the extra trigger commands restarting transmissions. Something else that I noticed is that the interface does come back after 5 min or so after the watchdog expires. I tried reducing te value of the watchdog from 64K to 256 but it didn't change how fast recovery will be. I suspect that until we run our of trasmit buffers, the watchdog will not trigger. Finally, the other weird thing that happens is that when the card is stuck no packets seem to be received. Here are the kstat -m rge from two samples a few seconds apart when the driver is stuck. Look at rbytes. Also, as time goes by and I got more samples I started seeing norcvbuf going up. Any suggestions would be welcome. First sample: module: rge instance: 0 name: mac class: net adv_cap_1000fdx 1 adv_cap_1000hdx 0 adv_cap_100fdx 1 adv_cap_100hdx 1 adv_cap_100T4 0 adv_cap_10fdx 1 adv_cap_10gfdx 0 adv_cap_10hdx 1 adv_cap_asmpause 1 adv_cap_autoneg 1 adv_cap_pause 1 adv_rem_fault 0 align_errors 62207 brdcstrcv 4308 brdcstxmt 0 cap_1000fdx 1 cap_1000hdx 0 cap_100fdx 1 cap_100hdx 1 cap_100T4 0 cap_10fdx 1 cap_10gfdx 0 cap_10hdx 1 cap_asmpause 1 cap_autoneg 1 cap_pause 1 cap_rem_fault 0 carrier_errors 0 collisions 7452 crtime 42351.518238685 defer_xmts 0 ex_collisions 0 fcs_errors 0 first_collisions 770 ierrors 116109 ifspeed 1000000000 ipackets 59187601 ipackets64 59187601 jabber_errors 0 link_asmpause 0 link_autoneg 0 link_duplex 2 link_pause 0 link_state 1 link_up 1 lp_cap_1000fdx 0 lp_cap_1000hdx 0 lp_cap_100fdx 0 lp_cap_100hdx 0 lp_cap_100T4 0 lp_cap_10fdx 0 lp_cap_10gfdx 0 lp_cap_10hdx 0 lp_cap_asmpause 0 lp_cap_autoneg 0 lp_cap_pause 0 lp_rem_fault 0 macrcv_errors 0 macxmt_errors 0 multi_collisions 6682 multircv 324 multixmt 0 norcvbuf 0 noxmtbuf 0 obytes 1824883714 obytes64 19004752898 oerrors 120177 oflo 0 opackets 143451801 opackets64 143451801 promisc 0 rbytes 151565154 rbytes64 151565154 runt_errors 0 snaptime 42701.466558939 sqe_errors 0 toolong_errors 0 tx_late_collisions 0 uflo 0 unknowns 0 xcvr_addr 1 xcvr_id 1886482 xcvr_inuse 7 Second sample module: rge instance: 0 name: mac class: net adv_cap_1000fdx 1 adv_cap_1000hdx 0 adv_cap_100fdx 1 adv_cap_100hdx 1 adv_cap_100T4 0 adv_cap_10fdx 1 adv_cap_10gfdx 0 adv_cap_10hdx 1 adv_cap_asmpause 1 adv_cap_autoneg 1 adv_cap_pause 1 adv_rem_fault 0 align_errors 62207 brdcstrcv 4317 brdcstxmt 0 cap_1000fdx 1 cap_1000hdx 0 cap_100fdx 1 cap_100hdx 1 cap_100T4 0 cap_10fdx 1 cap_10gfdx 0 cap_10hdx 1 cap_asmpause 1 cap_autoneg 1 cap_pause 1 cap_rem_fault 0 carrier_errors 0 collisions 7452 crtime 42351.518238685 defer_xmts 0 ex_collisions 0 fcs_errors 0 first_collisions 770 ierrors 116109 ifspeed 1000000000 ipackets 59187622 ipackets64 59187622 jabber_errors 0 link_asmpause 0 link_autoneg 0 link_duplex 2 link_pause 0 link_state 1 link_up 1 lp_cap_1000fdx 0 lp_cap_1000hdx 0 lp_cap_100fdx 0 lp_cap_100hdx 0 lp_cap_100T4 0 lp_cap_10fdx 0 lp_cap_10gfdx 0 lp_cap_10hdx 0 lp_cap_asmpause 0 lp_cap_autoneg 0 lp_cap_pause 0 lp_rem_fault 0 macrcv_errors 0 macxmt_errors 0 multi_collisions 6682 multircv 324 multixmt 0 norcvbuf 0 noxmtbuf 0 obytes 1824884552 obytes64 19004753736 oerrors 120177 oflo 0 opackets 143451801 opackets64 143451801 promisc 0 rbytes 151565154 rbytes64 151565154 runt_errors 0 snaptime 42727.678195414 sqe_errors 0 toolong_errors 0 tx_late_collisions 0 uflo 0 unknowns 0 xcvr_addr 1 xcvr_id 1886482 xcvr_inuse 7 Third sample module: rge instance: 0 name: mac class: net adv_cap_1000fdx 1 adv_cap_1000hdx 0 adv_cap_100fdx 1 adv_cap_100hdx 1 adv_cap_100T4 0 adv_cap_10fdx 1 adv_cap_10gfdx 0 adv_cap_10hdx 1 adv_cap_asmpause 1 adv_cap_autoneg 1 adv_cap_pause 1 adv_rem_fault 0 align_errors 62207 brdcstrcv 4317 brdcstxmt 0 cap_1000fdx 1 cap_1000hdx 0 cap_100fdx 1 cap_100hdx 1 cap_100T4 0 cap_10fdx 1 cap_10gfdx 0 cap_10hdx 1 cap_asmpause 1 cap_autoneg 1 cap_pause 1 cap_rem_fault 0 carrier_errors 0 collisions 7452 crtime 42351.518238685 defer_xmts 0 ex_collisions 0 fcs_errors 0 first_collisions 770 ierrors 116109 ifspeed 1000000000 ipackets 59187622 ipackets64 59187622 jabber_errors 0 link_asmpause 0 link_autoneg 0 link_duplex 2 link_pause 0 link_state 1 link_up 1 lp_cap_1000fdx 0 lp_cap_1000hdx 0 lp_cap_100fdx 0 lp_cap_100hdx 0 lp_cap_100T4 0 lp_cap_10fdx 0 lp_cap_10gfdx 0 lp_cap_10hdx 0 lp_cap_asmpause 0 lp_cap_autoneg 0 lp_cap_pause 0 lp_rem_fault 0 macrcv_errors 0 macxmt_errors 0 multi_collisions 6682 multircv 324 multixmt 0 norcvbuf 13 noxmtbuf 0 obytes 1824886004 obytes64 19004755188 oerrors 120177 oflo 0 opackets 143451801 opackets64 143451801 promisc 0 rbytes 151565154 rbytes64 151565154 runt_errors 0 snaptime 42747.590540636 sqe_errors 0 toolong_errors 0 tx_late_collisions 0 uflo 0 unknowns 0 xcvr_addr 1 xcvr_id 1886482 xcvr_inuse 7 -- This message posted from opensolaris.org _______________________________________________ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org