Hi,

An update on this and a patch include at the end here.

I would very much appreciated feedback on this and what the chances are to may be include this patch into the driver. I spent the last week tracking down why that new problem showed up and it happened to be looks like a hardware problem that is not new but showing up into some apple, witch I can't checked as I don't have that hardware, but should also work for it and into the sun ERI type of hardware of the GEM driver. It would also help the new Gb Ethernet as well looks like.

I dig it up and found that both FreeBSD and NetBSD also suffer form the same problem and had created work around for the hardware problem and even disable the built in checksum as well because of the same issue.

The documentation is available here:

http://www.sun.com/processors/manuals/ge.pdf

I adapted the NetBSD version here:

http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/sys/dev/ic/gem.c?rev=1.69&content-type=text/plain

and here:

http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/sys/dev/ic/gemvar.h?rev=1.15.16.1&content-type=text/plain

to OpenBSD to make it work.

I also tracked it down to OpenBSD when it started to show up and that was with this commit here:

http://www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/ic/gem.c.diff?r1=1.85;r2=1.86;f=h

specially the changes on line 1135, this line here:

        /*
         * On some chip revisions GEM_MAC_RX_OVERFLOW happen often
         * due to a silicon bug so handle them silently.
         */
        if (rxstat & GEM_MAC_RX_OVERFLOW) {
                ifp->if_ierrors++;
                gem_init(ifp);
        }

What actually trigger the new total lost of connectivity is the removal of the driver reset when the hardware bug is trigger. This "gem_init(ifp);" removal left the server unaccessible and only need to have

ifconfig gem0 down
ifconfig gem0 up

to bring it back and you do see

gem0: device timeout

in the lags as well as I posted before.

I isolated to this after over to 50 kernels compile and tests to find it.

Now instead of just sending a patch to put this back, looking at what other project did, I came to the following patch that is actually more elegant. It only addressing the issue of the DMA and do not bringing the link down/up when it is process, nor does it empty the fifo of data that may needs to be sent still.

NetBSD does also have received buffer with OpenBSD do not have, so I didn't include that as there would be much more changes that I am not sure would be welcome and definitely would take considerable more time as well and add new feature oppose to fix the current problem in hand now.

My issue is that I can't identify what actually trigger the hardware bug, so I can't provoke it to be 100% sure there isn't a better solution, witch might be, or not, I can't say.

Never the less, it's been now 24 hours on a production server that this is running without issue yet and no lost of connectivity yet, witch would indicate success. without it, I loose access to the server between 5 to 15 minutes and some cases up to 60 minutes, but always loose it never the less. Al depend when the hardware bug is trigger, but like I said, I can't trigger the bug, I can only patch, and test for some time and hope the bug is trigger and see the results.

In any case, I would very much appreciate feedback on this and may be seeing it included in the tree if it is judged to be acceptable to address the hardware bug work around.

I am still running it and obviously will know more over time, but all indicate success so far and based on google research, this is not a new issue, just need to be worked around, witch I did below.

Thanks for your time.

Best,

Daniel


=========================================================
Index: dev/ic/gem.c
===================================================================
RCS file: /cvs/src/sys/dev/ic/gem.c,v
retrieving revision 1.87
diff -u -p -r1.87 gem.c
--- dev/ic/gem.c        27 Jan 2009 09:17:51 -0000      1.87
+++ dev/ic/gem.c        15 Mar 2009 10:21:58 -0000
@@ -94,6 +94,8 @@ int           gem_bitwait(struct gem_softc *, bus
                    u_int32_t, u_int32_t);
 void           gem_reset(struct gem_softc *);
 int            gem_reset_rx(struct gem_softc *);
+void           gem_reset_rxdma(struct gem_softc *sc);
+void           gem_rx_common(struct gem_softc *sc);
 int            gem_reset_tx(struct gem_softc *);
 int            gem_disable_rx(struct gem_softc *);
 int            gem_disable_tx(struct gem_softc *);
@@ -558,6 +560,70 @@ gem_reset_rx(struct gem_softc *sc)
        return (0);
 }

+/*
+ * Reset the receiver DMA engine.
+ *
+ * Intended to be used in case of GEM_INTR_RX_TAG_ERR, GEM_MAC_RX_OVERFLOW
+ * etc in order to reset the receiver DMA engine only and not do a full
+ * reset which amongst others also downs the link and clears the FIFOs.
+ */
+void
+gem_reset_rxdma(struct gem_softc *sc)
+{
+       struct ifnet *ifp = &sc->sc_arpcom.ac_if;
+       bus_space_tag_t t = sc->sc_bustag;
+       bus_space_handle_t h = sc->sc_h1;
+       int i;
+
+       if (gem_reset_rx(sc) != 0) {
+               gem_init(ifp);
+               return;
+       }
+       for (i = 0; i < GEM_NRXDESC; i++)
+               if (sc->sc_rxsoft[i].rxs_mbuf != NULL)
+                       GEM_UPDATE_RXDESC(sc, i);
+       GEM_CDSYNC(sc, BUS_DMASYNC_PREWRITE);
+       GEM_CDSYNC(sc, BUS_DMASYNC_PREREAD);
+
+       /* Reprogram Descriptor Ring Base Addresses */
+       /* NOTE: we use only 32-bit DMA addresses here. */
+       bus_space_write_4(t, h, GEM_RX_RING_PTR_HI, 0);
+       bus_space_write_4(t, h, GEM_RX_RING_PTR_LO, GEM_CDRXADDR(sc, 0));
+
+       /* Redo ERX Configuration */
+       gem_rx_common(sc);
+
+       /* Give the reciever a swift kick */
+       bus_space_write_4(t, h, GEM_RX_KICK, GEM_NRXDESC - 4);
+}
+
+/*
+ * Common RX configuration for gem_init() and gem_reset_rxdma().
+ */
+void
+gem_rx_common(struct gem_softc *sc)
+{
+       bus_space_tag_t t = sc->sc_bustag;
+       bus_space_handle_t h = sc->sc_h1;
+       u_int32_t v;
+
+       /* Encode Receive Descriptor ring size: four possible values */
+       v = gem_ringsize(GEM_NRXDESC /*XXX*/);
+
+       /* Enable DMA */
+       bus_space_write_4(t, h, GEM_RX_CONFIG,
+           v|(GEM_THRSH_1024<<GEM_RX_CONFIG_FIFO_THRS_SHIFT)|
+           (2<<GEM_RX_CONFIG_FBOFF_SHFT)|GEM_RX_CONFIG_RXDMA_EN|
+           (0<<GEM_RX_CONFIG_CXM_START_SHFT));
+       /*
+        * The following value is for an OFF Threshold of about 3/4 full
+        * and an ON Threshold of 1/4 full.
+        */
+       bus_space_write_4(t, h, GEM_RX_PAUSE_THRESH,
+           (3 * sc->sc_rxfifosize / 256) |
+           (   (sc->sc_rxfifosize / 256) << 12));
+       bus_space_write_4(t, h, GEM_RX_BLANKING, (6<<12)|6);
+}

 /*
  * Reset the transmitter
@@ -769,23 +835,7 @@ gem_init(struct ifnet *ifp)
        bus_space_write_4(t, h, GEM_TX_KICK, 0);

        /* step 10. ERX Configuration */
-
-       /* Encode Receive Descriptor ring size: four possible values */
-       v = gem_ringsize(GEM_NRXDESC /*XXX*/);
-
-       /* Enable DMA */
-       bus_space_write_4(t, h, GEM_RX_CONFIG,
-               v|(GEM_THRSH_1024<<GEM_RX_CONFIG_FIFO_THRS_SHIFT)|
-               (2<<GEM_RX_CONFIG_FBOFF_SHFT)|GEM_RX_CONFIG_RXDMA_EN|
-               (0<<GEM_RX_CONFIG_CXM_START_SHFT));
-       /*
-        * The following value is for an OFF Threshold of about 3/4 full
-        * and an ON Threshold of 1/4 full.
-        */
-       bus_space_write_4(t, h, GEM_RX_PAUSE_THRESH,
-           (3 * sc->sc_rxfifosize / 256) |
-           (   (sc->sc_rxfifosize / 256) << 12));
-       bus_space_write_4(t, h, GEM_RX_BLANKING, (6<<12)|6);
+       gem_rx_common(sc);

        /* step 11. Configure Media */
        mii_mediachg(&sc->sc_mii);
@@ -1123,8 +1173,17 @@ gem_intr(void *v)
                        printf("%s: MAC rx fault, status %x\n",
                            sc->sc_dev.dv_xname, rxstat);
 #endif
-               if (rxstat & GEM_MAC_RX_OVERFLOW)
+               /*
+                * At least with GEM_SUN_GEM and some GEM_SUN_ERI
+                * revisions GEM_MAC_RX_OVERFLOW happen often due to a
+                * silicon bug so handle them silently. Moreover, it's
+                * likely that the receiver has hung so we reset it.
+                */
+               if (rxstat & GEM_MAC_RX_OVERFLOW) {
                        ifp->if_ierrors++;
+                       gem_reset_rxdma(sc);
+               }
+
 #ifdef GEM_DEBUG
                else if (rxstat & ~(GEM_MAC_RX_DONE | GEM_MAC_RX_FRAME_CNT))
                        printf("%s: MAC rx fault, status %x\n",
Index: dev/ic/gemvar.h
===================================================================
RCS file: /cvs/src/sys/dev/ic/gemvar.h,v
retrieving revision 1.20
diff -u -p -r1.20 gemvar.h
--- dev/ic/gemvar.h     14 Dec 2008 21:31:50 -0000      1.20
+++ dev/ic/gemvar.h     15 Mar 2009 10:21:58 -0000
@@ -252,6 +252,10 @@ do {                                                       
                \
        bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap,               \
            GEM_CDRXOFF((x)), sizeof(struct gem_desc), (ops))

+#define        GEM_CDSYNC(sc, ops)                                             
\
+       bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap,               \
+           0, sizeof(struct gem_control_data), (ops))
+
 #define        GEM_CDSPSYNC(sc, ops)                                           
\
        bus_dmamap_sync((sc)->sc_dmatag, (sc)->sc_cddmamap,               \
            GEM_CDSPOFF, GEM_SETUP_PACKET_LEN, (ops))
@@ -269,6 +273,18 @@ do {                                                       
                \
                (((__m->m_ext.ext_size)<<GEM_RD_BUFSHIFT)              \
            & GEM_RD_BUFSIZE) | GEM_RD_OWN);                                \
        GEM_CDRXSYNC((sc), (x), BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); \
+} while (0)
+
+#define GEM_UPDATE_RXDESC(sc, x)                                       \
+do {                                                                   \
+       struct gem_rxsoft *__rxs = &sc->sc_rxsoft[(x)];                  \
+       struct gem_desc *__rxd = &sc->sc_rxdescs[(x)];                   \
+       struct mbuf *__m = __rxs->rxs_mbuf;                          \
+                                                                       \
+       __rxd->gd_flags =                                            \
+           GEM_DMA_WRITE((sc),                                         \
+                       (((__m->m_ext.ext_size)<<GEM_RD_BUFSHIFT)      \
+                               & GEM_RD_BUFSIZE) | GEM_RD_OWN);    \
 } while (0)

 #ifdef _KERNEL

Reply via email to