RE: [PATCH] rsocket: Return ECONNRESET when socket in recv is disconnected

2014-10-09 Thread Hefty, Sean
> Upon successful completion, recv() shall return the length of the
> message in bytes. If no messages are available to be received and the
> peer has performed an orderly shutdown, recv() shall return 0.
> Otherwise, -1 shall be returned and errno set to indicate the error.
> 
> Errors
> 
> The recv() function shall fail if:
> 
> ECONNRESET
>  A connection was forcibly closed by a peer.
> ==
> 
> So the current behavior of returning 0 is wrong as there was no orderly
> shutdown performed on the other end before a close() is issued [
> http://stackoverflow.com/questions/5879560/how-can-i-cause-an-econnreset-
> in-recv-from-a-client

The patch is incorrect then.  It always returns ECONNRESET.  Rsockets does not 
distinguish between the app calling shutdown versus close.  Because everything 
is in user space, the disconnect message itself comes from the app.  The rclose 
call invokes rshutdown, otherwise the remote side will may never know that the 
peer is gone (which can happen if the app exits without calling rclose).

I'm inclined to just leave the current behavior as is.  I don't see why a real 
app would care if close performed an orderly shutdown or not, as this errors on 
the side of acting better than expected.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rsocket: Return ECONNRESET when socket in recv is disconnected

2014-10-09 Thread Sreedhar Kodali

Hi Jason & Sean,

POSIX recv() manual page describes behavior as follows (reproduced here 
for convenience) -


http://linux.die.net/man/3/recv

==
Return Value

Upon successful completion, recv() shall return the length of the 
message in bytes. If no messages are available to be received and the 
peer has performed an orderly shutdown, recv() shall return 0. 
Otherwise, -1 shall be returned and errno set to indicate the error.


Errors

The recv() function shall fail if:

ECONNRESET
A connection was forcibly closed by a peer.
==

So the current behavior of returning 0 is wrong as there was no orderly 
shutdown performed on the other end before a close() is issued [ 
http://stackoverflow.com/questions/5879560/how-can-i-cause-an-econnreset-in-recv-from-a-client 
].


Thank You.

- Sreedhar


On 2014-10-09 22:14, Hefty, Sean wrote:

Thanks for the fix.  Now, we are able to see the similar behavior
between tcp/ip and rsockets when the other end is suddenly closed
while waiting on data.


Can you read and respond to Jason's comments on this patch?


That isn't the whole story though - you only get ECONNRESET in some
cases, and it is OS dependent.



stackoverflow suggets:



http://stackoverflow.com/questions/2974021/what-does-econnreset-mean-in-the-context-of-an-af-local-socket



Which looks reasonable to me and matches my experience with socket
programming on Linux. The Steven's book might have some authoritative
clarifications on the subject as well.


Based on this information, it's not clear that the current behavior of
returning 0 is wrong.

- Sean


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib_ipoib: CSUM support in connected mode

2014-10-09 Thread Yuval Shaia
On Wed, Oct 08, 2014 at 02:34:07PM -0400, Chuck Lever wrote:
> Hi Yuval-
> 
> On Oct 4, 2014, at 7:45 AM, Yuval Shaia  wrote:
> 
> > This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB CM.
> > IPoIB Connected Mode driver uses RC (Reliable Connection) which guarantees 
> > the
> > corruption free delivery of the packet.
> 
> Some naïve questions:
> 
> Is this interoperable with other IPoIB implementations, such as Solaris?
Yes, part of the implementation deals with how driver interact with legacy peer 
and this includes Solaris as well as legacy Linux.
> 
> Are there implications for IPv6, where checksumming is not optional?
> Maybe I misunderstood what is being replaced.
This is not a removal of checksumming, just replacement with other (and better) 
mechanism.
The good part is that no module will be affected by it as the driver will fake 
checksumming to upper networking layer.
> 
> What happens if IPoIB uses UD instead of RC?
You mean CM-UD? 
> 
> When IPoIB traffic is routed off the IB fabric, isn’t end-to-end checksumming
> lost because the router node has to fabricate an IP checksum that wasn’t
> provided by the sender?
Yes, it is lost.
This feature should be disabled in such setups.
This feature can be used only in LAN.
> 
> 
> 
> > InfiniBand uses 32b CRC which provides stronger data integrity protection
> > compare to 16b IP Checksum.
> > So, there is no added value that IP Checksum provides in the IB world.
> > The proposal is to tell that network stack that IPoIB-CM supports IP 
> > Checksum
> > offload.
> > This enables Linux to save the time of checksum calculation of IPoIB CM 
> > packets.
> > Network sends the IP packet without adding the IP Checksum to the header.
> > On the receive side, IPoIB driver again tells the network stack that IP
> > Checksum is good for the incoming packets and network stack avoids the IP
> > Checksum calculations.
> > During connection establishment the driver determine if the other end 
> > supports
> > IB CRC as checksum.
> > This is done so driver will be able to calculate checksum before transmiting
> > the packet in case the other end does not support this feature.
> > A support for fragmented skb is added to transmit path.
> > 
> > Signed-off-by: Yuval Shaia 
> > ---
> > drivers/infiniband/ulp/ipoib/ipoib.h  |   25 ++
> > drivers/infiniband/ulp/ipoib/ipoib_cm.c   |  117 
> > 
> > drivers/infiniband/ulp/ipoib/ipoib_ib.c   |3 +-
> > drivers/infiniband/ulp/ipoib/ipoib_main.c |   19 -
> > 4 files changed, 143 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
> > b/drivers/infiniband/ulp/ipoib/ipoib.h
> > index 0ed9aed..6fedf83 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> > @@ -100,6 +100,7 @@ enum {
> > IPOIB_FLAG_AUTO_MODER = 13, /*indicates moderation is running*/
> > /*indicates if event handler was registered*/
> > IPOIB_FLAG_EVENTS_REGISTERED= 14,
> > +   IPOIB_FLAG_CSUM   = 15,
> > 
> > IPOIB_MAX_BACKOFF_SECONDS = 16,
> > 
> > @@ -192,9 +193,28 @@ struct ipoib_pmtu_update {
> > 
> > struct ib_cm_id;
> > 
> > +#define IPOIB_CM_PROTO_SIG 0x2211
> > +#define IPOIB_CM_PROTO_VER (1UL << 12)
> > +
> > +static inline int ipoib_cm_check_proto_sig(u16 proto_sig)
> > +{
> > +   return proto_sig & IPOIB_CM_PROTO_SIG;
> > +};
> > +
> > +static inline int ipoib_cm_check_proto_ver(u16 caps)
> > +{
> > +   return caps & IPOIB_CM_PROTO_VER;
> > +};
> > +
> > +enum ipoib_cm_data_caps {
> > +   IPOIB_CM_CAPS_IBCRC_AS_CSUM = 1UL << 0,
> > +};
> > +
> > struct ipoib_cm_data {
> > __be32 qpn; /* High byte MUST be ignored on receive */
> > __be32 mtu;
> > +   __be16 sig; /* must be IPOIB_CM_PROTO_SIG */
> > +   __be16 caps; /* 4 bits proto ver and 12 bits capabilities */
> > };
> > 
> > /*
> > @@ -241,6 +261,7 @@ struct ipoib_cm_rx {
> > int recv_count;
> > u32 qpn;
> > int index; /* For ring counters */
> > +   u16 caps;
> > };
> > 
> > struct ipoib_cm_tx {
> > @@ -255,6 +276,7 @@ struct ipoib_cm_tx {
> > unsigned tx_tail;
> > unsigned longflags;
> > u32  mtu;
> > +   u16  caps;
> > };
> > 
> > struct ipoib_cm_rx_buf {
> > @@ -558,6 +580,8 @@ void ipoib_del_neighs_by_gid(struct net_device *dev, u8 
> > *gid);
> > extern struct workqueue_struct *ipoib_workqueue;
> > extern struct workqueue_struct *ipoib_auto_moder_workqueue;
> > 
> > +extern int cm_ibcrc_as_csum;
> > +
> > /* functions */
> > 
> > int ipoib_poll(struct napi_struct *napi, int budget);
> > @@ -613,6 +637,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int 
> > flush);
> > 
> > void ipoib_mcast_dev_down(struct net_device *dev);
> > void ipoib_mcast_dev_flush(struct net_device *dev);
> > +int ipoib_dma_map_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_r

Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-09 Thread Or Gerlitz
On Tue, Oct 7, 2014, Sagi Grimberg  wrote:
[...]
>  enum ib_signature_prot_cap {
> @@ -182,6 +183,7 @@ struct ib_device_attr {
> int max_srq_wr;
> int max_srq_sge;
> unsigned intmax_fast_reg_page_list_len;
> +   unsigned intmax_indir_reg_mr_list_len;

The indirection registration list is basically made of  struct ib_sge
objects which are posted on a send-like work-request, any reason  to
have a dedicated dev cap attribute for that and not use the already
existing one (max_sge)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rsocket: Return ECONNRESET when socket in recv is disconnected

2014-10-09 Thread Hefty, Sean
> Thanks for the fix.  Now, we are able to see the similar behavior
> between tcp/ip and rsockets when the other end is suddenly closed
> while waiting on data.

Can you read and respond to Jason's comments on this patch?

> That isn't the whole story though - you only get ECONNRESET in some
> cases, and it is OS dependent.
 
> stackoverflow suggets:

> http://stackoverflow.com/questions/2974021/what-does-econnreset-mean-in-the-context-of-an-af-local-socket

> Which looks reasonable to me and matches my experience with socket
> programming on Linux. The Steven's book might have some authoritative
> clarifications on the subject as well.

Based on this information, it's not clear that the current behavior of 
returning 0 is wrong.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (10/9)

2014-10-09 Thread Shirley Ma
Attendees:

Yan Burman (Mellanox)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes: (Chuck, SteveD, Yan)
1. Follow-ups for Engineer resource allocation to speed up IB stack review 
process: in progress.

2. OFED update and bug status (Rupert):
OFED 3.12-1-RC3 is expected to be released next Monday after an update for 
infinipatch-psm which prevented RHEL 7.0 build.
Thanks for Steve Wise to resolve some NFSoRDMA issues. Two outstanding bugs:

http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2489
"Bug 2489 - System crash during cable pull test with Active NFS-RDMA share"
Bug 2489 is outstanding but is resolved in 3.17 RC6, need to bisect the right 
patch for OFED 3.12-1.

http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2507
The panic stack reported it's backport patch issue. To confirm that Steve Wise 
suggested to reproduce it with upstream 3.12 kernel. Devesh will build 3.12 
kernel and test it.
 
http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2502
Bug 2502 is RDS bug, which will talk to Oracle directly.

3. mainline kernel update and bug status: (Chuck, SteveW, Devesh)

https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Bug 269 xfstests generic/113 on NFSv4.1 causes connection drops
It was found in ConnectX-2. The number of outstanding completions is limited to 
87. When exceeding this, post_send will fail. The SQ depth is 256. Whether this 
is a limitation on ConnectX-2? It's better to try on different HCAs to see the 
difference.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=271
Steve Wise is making progress on this one.

Devesh suggested to use same approach on client side to reduce Server side 
signaling, he filed a bug to track this to see any performance difference.
https://bugzilla.linux-nfs.org/show_bug.cgi?id=272

4. Bake-a-thon NFSoRDMA conclave update: Chuck, SteveD gave update on 10/8 
Linux Enterprise NFSoRDMA
-- RHEL7.0 NFSoRDMA server is disabled, we still couldn't locate resource for 
NFSoRDMA server maintainer. NFSoRDMA client is supported.
-- NFSoRDMA test strategies and utilities: add NFSoRDMA test
-- NFSoRDMA future directions and features

5. 3.17-rc5 NFSoRDMA performance discussion: (Shirley)
Shirley has presented IOZone WRITE NFS performance numbers for NFSoIPoIB, 
NFSoRDMA FMR and FRWR mode on connectX-2. The discuss focus was on NFS WRITE. 
There are couple areas need to do further research:
-- NFS WRITE overhead: what limits NFS WRITE performance in NFS protocol?
-- Unexpected latency increase and BW drop in large I/O write
-- How much gain from IPoIB-CM SG, cheksum offloading patch
-- Yan suggested the test to move to ConnectX-3 since ConnectX-2 is out.

Feel free to reply here for anything missing. See you on Oct.23.

10/9/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Shirley
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rsocket: Return ECONNRESET when socket in recv is disconnected

2014-10-09 Thread Sreedhar Kodali

Hi Sean,

Thanks for the fix.  Now, we are able to see the similar behavior
between tcp/ip and rsockets when the other end is suddenly closed
while waiting on data.

Thank You.

- Sreedhar

On 2014-10-09 03:49, sean.he...@intel.com wrote:

From: Sean Hefty 

The following behavior difference was reported between rsockets and
sockets:

when remote end is suddenly closed, recv() waiting on it receives
tcp/ip => ECONNRESET error
rsockets => 0 value

Update rrecv() to return ECONNRESET if no data is available and
the connection is no longer readable.

Problem reported by: srkod...@linux.vnet.ibm.com

Signed-off-by: Sean Hefty 
---
 src/rsocket.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/rsocket.c b/src/rsocket.c
index 074fb18..7c7af52 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2008-2013 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2008-2014 Intel Corporation.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -2394,7 +2394,7 @@ ssize_t rrecv(int socket, void *buf, size_t len,
int flags)
struct rsocket *rs;
size_t left = len;
uint32_t end_size, rsize;
-   int ret;
+   int ret = 0;

rs = idm_at(&idm, socket);
if (rs->type == SOCK_DGRAM) {
@@ -2419,9 +2419,11 @@ ssize_t rrecv(int socket, void *buf, size_t
len, int flags)
  rs_conn_have_rdata);
if (ret)
break;
+
+   if (!(rs->state & rs_readable))
+   ret = ERR(ECONNRESET);
}

-   ret = 0;
if (flags & MSG_PEEK) {
left = len - rs_peek(rs, buf, left);
break;
@@ -2456,7 +2458,7 @@ ssize_t rrecv(int socket, void *buf, size_t len,
int flags)
} while (left && (flags & MSG_WAITALL) && (rs->state & rs_readable));

fastlock_release(&rs->rlock);
-   return ret ? ret : len - left;
+   return (ret && left == len) ? ret : len - left;
 }

 ssize_t rrecvfrom(int socket, void *buf, size_t len, int flags,


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html