Hi,
An update on this and a patch include at the end here.
I would very much appreciated feedback on this and what the chances are
to may be include this patch into the driver. I spent the last week
tracking down why that new problem showed up and it happened to be looks
like a hardware problem that is not new but showing up into some apple,
witch I can't checked as I don't have that hardware, but should also
work for it and into the sun ERI type of hardware of the GEM driver. It
would also help the new Gb Ethernet as well looks like.
I dig it up and found that both FreeBSD and NetBSD also suffer form the
same problem and had created work around for the hardware problem and
even disable the built in checksum as well because of the same issue.
The documentation is available here:
http://www.sun.com/processors/manuals/ge.pdf
I adapted the NetBSD version here:
http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/sys/dev/ic/gem.c?rev=1.69&content-type=text/plain
and here:
http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/sys/dev/ic/gemvar.h?rev=1.15.16.1&content-type=text/plain
to OpenBSD to make it work.
I also tracked it down to OpenBSD when it started to show up and that
was with this commit here:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/ic/gem.c.diff?r1=1.85;r2=1.86;f=h
specially the changes on line 1135, this line here:
/*
* On some chip revisions GEM_MAC_RX_OVERFLOW happen often
* due to a silicon bug so handle them silently.
*/
if (rxstat & GEM_MAC_RX_OVERFLOW) {
ifp->if_ierrors++;
gem_init(ifp);
}
What actually trigger the new total lost of connectivity is the removal
of the driver reset when the hardware bug is trigger. This
"gem_init(ifp);" removal left the server unaccessible and only need to have
ifconfig gem0 down
ifconfig gem0 up
to bring it back and you do see
gem0: device timeout
in the lags as well as I posted before.
I isolated to this after over to 50 kernels compile and tests to find it.
Now instead of just sending a patch to put this back, looking at what
other project did, I came to the following patch that is actually more
elegant. It only addressing the issue of the DMA and do not bringing the
link down/up when it is process, nor does it empty the fifo of data that
may needs to be sent still.
NetBSD does also have received buffer with OpenBSD do not have, so I
didn't include that as there would be much more changes that I am not
sure would be welcome and definitely would take considerable more time
as well and add new feature oppose to fix the current problem in hand now.
My issue is that I can't identify what actually trigger the hardware
bug, so I can't provoke it to be 100% sure there isn't a better
solution, witch might be, or not, I can't say.
Never the less, it's been now 24 hours on a production server that this
is running without issue yet and no lost of connectivity yet, witch
would indicate success. without it, I loose access to the server between
5 to 15 minutes and some cases up to 60 minutes, but always loose it
never the less. Al depend when the hardware bug is trigger, but like I
said, I can't trigger the bug, I can only patch, and test for some time
and hope the bug is trigger and see the results.
In any case, I would very much appreciate feedback on this and may be
seeing it included in the tree if it is judged to be acceptable to
address the hardware bug work around.
I am still running it and obviously will know more over time, but all
indicate success so far and based on google research, this is not a new
issue, just need to be worked around, witch I did below.
Thanks for your time.
Best,
Daniel
=========================================================
Index: dev/ic/gem.c
===================================================================
RCS file: /cvs/src/sys/dev/ic/gem.c,v
retrieving revision 1.87
diff -u -p -r1.87 gem.c
--- dev/ic/gem.c 27 Jan 2009 09:17:51 -0000 1.87
+++ dev/ic/gem.c 15 Mar 2009 10:21:58 -0000
@@ -94,6 +94,8 @@ int gem_bitwait(struct gem_softc *, bus
u_int32_t, u_int32_t);
void gem_reset(struct gem_softc *);
int gem_reset_rx(struct gem_softc *);
+void gem_reset_rxdma(struct gem_softc *sc);
+void gem_rx_common(struct gem_softc *sc);
int gem_reset_tx(struct gem_softc *);
int gem_disable_rx(struct gem_softc *);
int gem_disable_tx(struct gem_softc *);
@@ -558,6 +560,70 @@ gem_reset_rx(struct gem_softc *sc)
return (0);
}
+/*
+ * Reset the receiver DMA engine.
+ *
+ * Intended to be used in case of GEM_INTR_RX_TAG_ERR, GEM_MAC_RX_OVERFLOW
+ * etc in order to reset the receiver DMA engine only and not do a full
+ * reset which amongst others also downs the link and clears the FIFOs.
+ */
+void
+gem_reset_rxdma(struct gem_softc *sc)
+{
+ struct ifnet *ifp = &sc->sc_arpcom.ac_if;
+ bus_space_tag_t t = sc->sc_bustag;
+ bus_space_handle_t h = sc->sc_h1;
+ int i;
+
+ if (gem_reset_rx(sc) != 0) {
+ gem_init(ifp);
+ return;
+ }
+ for (i = 0; i < GEM_NRXDESC; i++)
+ if (sc->sc_rxsoft[i].rxs_mbuf != NULL)
+ GEM_UPDATE_RXDESC(sc, i);
+ GEM_CDSYNC(sc, BUS_DMASYNC_PREWRITE);
+ GEM_CDSYNC(sc, BUS_DMASYNC_PREREAD);
+
+ /* Reprogram Descriptor Ring Base Addresses */
+ /* NOTE: we use only 32-bit DMA addresses here. */
+ bus_space_write_4(t, h, GEM_RX_RING_PTR_HI, 0);
+ bus_space_write_4(t, h, GEM_RX_RING_PTR_LO, GEM_CDRXADDR(sc, 0));
+
+ /* Redo ERX Configuration */
+ gem_rx_common(sc);
+
+ /* Give the reciever a swift kick */
+ bus_space_write_4(t, h, GEM_RX_KICK, GEM_NRXDESC - 4);
+}
+
+/*
+ * Common RX configuration for gem_init() and gem_reset_rxdma().
+ */
+void
+gem_rx_common(struct gem_softc *sc)
+{
+ bus_space_tag_t t = sc->sc_bustag;
+ bus_space_handle_t h = sc->sc_h1;
+ u_int32_t v;
+
+ /* Encode Receive Descriptor ring size: four possible values */
+ v = gem_ringsize(GEM_NRXDESC /*XXX*/);
+
+ /* Enable DMA */
+ bus_space_write_4(t, h, GEM_RX_CONFIG,
+ v|(GEM_THRSH_1024<<GEM_RX_CONFIG_FIFO_THRS_SHIFT)|
+ (2<<GEM_RX_CONFIG_FBOFF_SHFT)|GEM_RX_CONFIG_RXDMA_EN|
+ (0<<GEM_RX_CONFIG_CXM_START_SHFT));
+ /*
+ * The following value is for an OFF Threshold of about 3/4 full
+ * and an ON Threshold of 1/4 full.
+ */
+ bus_space_write_4(t, h, GEM_RX_PAUSE_THRESH,
+ (3 * sc->sc_rxfifosize / 256) |
+ ( (sc->sc_rxfifosize / 256) << 12));
+ bus_space_write_4(t, h, GEM_RX_BLANKING, (6<<12)|6);
+}
/*
* Reset the transmitter
@@ -769,23 +835,7 @@ gem_init(struct ifnet *ifp)
bus_space_write_4(t, h, GEM_TX_KICK, 0);
/* step 10. ERX Configuration */
-
- /* Encode Receive Descriptor ring size: four possible values */
- v = gem_ringsize(GEM_NRXDESC /*XXX*/);
-
- /* Enable DMA */
- bus_space_write_4(t, h, GEM_RX_CONFIG,
- v|(GEM_THRSH_1024<<GEM_RX_CONFIG_FIFO_THRS_SHIFT)|
- (2<<GEM_RX_CONFIG_FBOFF_SHFT)|GEM_RX_CONFIG_RXDMA_EN|
- (0<<GEM_RX_CONFIG_CXM_START_SHFT));
- /*
- * The following value is for an OFF Threshold of about 3/4 full
- * and an ON Threshold of 1/4 full.
- */
- bus_space_write_4(t, h, GEM_RX_PAUSE_THRESH,
- (3 * sc->sc_rxfifosize / 256) |
- ( (sc->sc_rxfifosize / 256) << 12));
- bus_space_write_4(t, h, GEM_RX_BLANKING, (6<<12)|6);
+ gem_rx_common(sc);
/* step 11. Configure Media */
mii_mediachg(&sc->sc_mii);
@@ -1123,8 +1173,17 @@ gem_intr(void *v)
printf("%s: MAC rx fault, status %x\n",
sc->sc_dev.dv_xname, rxstat);
#endif
- if (rxstat & GEM_MAC_RX_OVERFLOW)
+ /*
+ * At least with GEM_SUN_GEM and some GEM_SUN_ERI
+ * revisions GEM_MAC_RX_OVERFLOW happen often due to a
+ * silicon bug so handle them silently. Moreover, it's
+ * likely that the receiver has hung so we reset it.
+ */
+ if (rxstat & GEM_MAC_RX_OVERFLOW) {
ifp->if_ierrors++;
+ gem_reset_rxdma(sc);
+ }
+
#ifdef GEM_DEBUG
else if (rxstat & ~(GEM_MAC_RX_DONE | GEM_MAC_RX_FRAME_CNT))
printf("%s: MAC rx fault, status %x\n",
Index: dev/ic/gemvar.h
===================================================================
RCS file: /cvs/src/sys/dev/ic/gemvar.h,v
retrieving revision 1.20
diff -u -p -r1.20 gemvar.h
--- dev/ic/gemvar.h 14 Dec 2008 21:31:50 -0000 1.20
+++ dev/ic/gemvar.h 15 Mar 2009 10:21:58 -0000
@@ -252,6 +252,10 @@ do {
\
bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap, \
GEM_CDRXOFF((x)), sizeof(struct gem_desc), (ops))
+#define GEM_CDSYNC(sc, ops)
\
+ bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap, \
+ 0, sizeof(struct gem_control_data), (ops))
+
#define GEM_CDSPSYNC(sc, ops)
\
bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap, \
GEM_CDSPOFF, GEM_SETUP_PACKET_LEN, (ops))
@@ -269,6 +273,18 @@ do {
\
(((__m->m_ext.ext_size)<<GEM_RD_BUFSHIFT) \
& GEM_RD_BUFSIZE) | GEM_RD_OWN); \
GEM_CDRXSYNC((sc), (x), BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); \
+} while (0)
+
+#define GEM_UPDATE_RXDESC(sc, x) \
+do { \
+ struct gem_rxsoft *__rxs = &sc->sc_rxsoft[(x)]; \
+ struct gem_desc *__rxd = &sc->sc_rxdescs[(x)]; \
+ struct mbuf *__m = __rxs->rxs_mbuf; \
+ \
+ __rxd->gd_flags = \
+ GEM_DMA_WRITE((sc), \
+ (((__m->m_ext.ext_size)<<GEM_RD_BUFSHIFT) \
+ & GEM_RD_BUFSIZE) | GEM_RD_OWN); \
} while (0)
#ifdef _KERNEL