> I am attempting to use NFS over RDMA (over infiniband), but there is some
> problem.  The NFS filesystem can be mounted on the client, and things
> will work for some time (can read, modify, etc. the files over the mount),
> but then (at a seemingly random time) the NFS server will dump these
> lines to the logs:
>
> [ 4380.623922] svcrdma: Error fast registering memory for xprt 
> ffff8803307d7400
> [ 4413.343161] svcrdma: error fast registering xdr for xprt ffff8803319edc00

Digging into it further, it seems like the Mellanox Infiniband driver
could somehow be involved.  Adding some trace's to the code, it's obvious
something like this is happening:

At some time sq_cq_reap() is called, which ends up like this:

  sq_cq_reap()
    ib_poll_cq()
      mlx4_ib_poll_cq()
        mlx4_ib_poll_one()
          mlx4_ib_handle_error_cqe()
            - Which then sets wc->status to IB_WC_WR_FLUSH_ERR rather
              often, but the killer blow seems to be when
              IB_WC_REM_ACCESS_ERR is set.
    - Because of the error previously, sq_cq_reap sets the XPT_CLOSE
      flag

Then, sometime later:

  fast_reg_read_chunks()
    svc_rdma_fastreg()
      svc_rdma_send()
        svc_rdma_send()
          - XPT_CLOSE is set and hence -ENOTCONN is returned
    - Since svc_rdma_fastreg() had an error fast_reg_read_chunks() bails
      and the client seems to then hang.

I'd ask the infiband guys, what does IB_WC_WR_FLUSH_ERR and
IB_WC_REM_ACCESS_ERR mean?  Is it something drastic that should result
in hangs?

nog.

> Both client and server are running the latest vanilla 2.6.34.1 kernel
> with Mellanox Connect-X infiniband cards.  If more information is
> required, please do ask.
>
> BTW: I can reproduce the problem quite reliably by running the bonnie++
> "benchmark" on the NFS mounted filesystem.
>
> nog.
>
> ps: I'm not subscribed to the list, please CC me on all replies.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to