Re: [PATCH 20/22] IB/iser: Support up to 8MB data transfer in a single command
On Aug 2, 2015, at 4:01 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote: +static void +iser_calc_scsi_params(struct iser_conn *iser_conn, + unsigned int max_sectors) +{ +struct iser_device *device = iser_conn-ib_conn.device; +unsigned short sg_tablesize, sup_sg_tablesize; + +sg_tablesize = DIV_ROUND_UP(max_sectors * 512, SIZE_4K); +sup_sg_tablesize = min_t(unsigned, ISCSI_ISER_MAX_SG_TABLESIZE, + device-dev_attr.max_fast_reg_page_list_len); + +if (sg_tablesize sup_sg_tablesize) { +sg_tablesize = sup_sg_tablesize; +iser_conn-scsi_max_sectors = sg_tablesize * SIZE_4K / 512; +} else { +iser_conn-scsi_max_sectors = max_sectors; +} + Why SIZE_4K and not PAGE_SIZE? Yes, I'll change that to PAGE_SIZE. Thanks. Would non-4KB pages (e.g. PPC 64KB) be an issue? Would this work between hosts with different page sizes? Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] scsi_cmnd: Introduce scsi_transfer_length helper
On Jun 26, 2014, at 10:55 AM, James Bottomley james.bottom...@hansenpartnership.com wrote: On Thu, 2014-06-26 at 16:53 +0200, Bart Van Assche wrote: On 06/11/14 11:09, Sagi Grimberg wrote: + return xfer_len + (xfer_len ilog2(sector_size)) * 8; Sorry that I just noticed this now, but why is a shift-right and ilog2() used in the above expression instead of just dividing the transfer length by the sector size ? It's a performance thing. Division is really slow on most CPUs. However, we know the divisor is a power of two so we re-express the division as a shift, which the processor can do really fast. James I have done this in the past as well, but have you benchmarked it? Compilers typically do the right thing in this case (i.e replace division with shift). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] scsi_cmnd: Introduce scsi_transfer_length helper
On Jun 26, 2014, at 12:38 PM, James Bottomley james.bottom...@hansenpartnership.com wrote: On June 26, 2014 11:41:48 AM EDT, Atchley, Scott atchle...@ornl.gov wrote: On Jun 26, 2014, at 10:55 AM, James Bottomley james.bottom...@hansenpartnership.com wrote: On Thu, 2014-06-26 at 16:53 +0200, Bart Van Assche wrote: On 06/11/14 11:09, Sagi Grimberg wrote: + return xfer_len + (xfer_len ilog2(sector_size)) * 8; Sorry that I just noticed this now, but why is a shift-right and ilog2() used in the above expression instead of just dividing the transfer length by the sector size ? It's a performance thing. Division is really slow on most CPUs. However, we know the divisor is a power of two so we re-express the division as a shift, which the processor can do really fast. James I have done this in the past as well, but have you benchmarked it? Compilers typically do the right thing in this case (i.e replace division with shift). The compiler can only do that for values which are reducible to constants at compile time. This is a runtime value, the compiler has no way of deducing that it will be a power of 2 James You're right, I should have said runtime. However, gcc on Intel seems to choose the right algorithm at runtime. On a trivial app with -O0, I see the same performance for shift and division if the divisor is a power of two. Is see ~38% penalty if the divisor is not a power of 2. With -O3, shift is faster than division by about ~17% when the divisor is a power of two. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow performance with librspreload.so
On Aug 30, 2013, at 1:38 PM, Hefty, Sean sean.he...@intel.com wrote: Another strange issue: $ sudo LD_PRELOAD=/usr/local/lib/rsocket/librspreload.so iperf -c 172.17.0.2 Client connecting to 172.17.0.2, TCP port 5001 TCP window size: 128 KByte (default) Increasing the window size may improve the results. E.g. on my systems I go from 17.7 Gbps at 128 KB to 24.3 Gbps for 512 KB. [ 3] local 172.17.0.1 port 57926 connected with 172.17.0.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 12.2 GBytes 10.4 Gbits/sec $ iperf -c 172.17.0.2 Client connecting to 172.17.0.2, TCP port 5001 TCP window size: 648 KByte (default) [ 3] local 172.17.0.1 port 58113 connected with 172.17.0.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 14.5 GBytes 12.5 Gbits/sec rsocket slower than IPoIB ? This is surprising to me - just getting 12.5 Gbps out of ipoib is surprising. Does iperf use sendfile()? I have a pair of nodes connected by QDR via a switch. Using normal IPoIB, a single Netperf can reach 18.4 Gb/s if I bind to the same core that the IRQ handler is bound to. With four concurrent Netperfs, I can reach 23 Gb/s. This is in datagram mode. Connected mode is slower. I have not tried rsockets on these nodes. Scott My results with iperf (version 2.0.5) over ipoib (default configurations) vary considerably based on the TCP window size. (Note that this is a 40 Gbps link.) Results summarized: TCP window size: 27.9 KByte (default) [ 3] 0.0-10.0 sec 12.8 GBytes 11.0 Gbits/sec TCP window size: 416 KByte (WARNING: requested 500 KByte) [ 3] 0.0-10.0 sec 8.19 GBytes 7.03 Gbits/sec TCP window size: 250 KByte (WARNING: requested 125 KByte) [ 3] 0.0-10.0 sec 4.99 GBytes 4.29 Gbits/sec I'm guessing that there are some settings I can change to increase the ipoib performance on my systems. Using rspreload, I get: LD_PRELOAD=/usr/local/lib/rsocket/librspreload.so iperf -c 192.168.0.103 TCP window size: 512 KByte (default) [ 3] 0.0-10.0 sec 28.3 GBytes 24.3 Gbits/sec It seems that ipoib bandwidth should be close to rsockets, similar to what you see. I also don't understand the effect that the TCP window size is having on the results. The smallest window gives the best bandwidth for ipoib?! - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
On Aug 14, 2013, at 3:21 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi, maybe some information about the environment I am working in: - CentOS 6.4 with custom kernel 3.8.13 - librdmacm / librspreload from git, tag 1.0.17 - application started with librspreload in LD_PRELOAD environment Currently, I have increased the value of the spin time by setting the default value for polling_time in the source code. I guess that the correct way to do this is via configuration in /etc/rdma/rsocket/polling_time? Concerning the rpoll() itself, some more comments/questions embedded below. On Tue, 13 Aug 2013 21:44:42 + Hefty, Sean sean.he...@intel.com wrote: I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? When the real poll() is called, an event is expected on an fd associated with the CQ's completion channel. The first question I would have is: why is the rpoll() split into these two pieces? There must have been some reason to do a busy loop on some local state information rather than just call the real poll() directly. Sean can answer specifically, but this is a typical HPC technique. The worst thing you can do is handle an event and then block when the next event is available. This adds 1-3 us to latency which is unacceptable in HPC. In HPC, we poll. If we worry about power, we poll until we get no more events and then we poll a little more before blocking. Determining the little more is the fun part. ;-) Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
On Aug 13, 2013, at 10:06 AM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I found a workaround for my (our) problem: in the librdmacm code, rsocket.c, there is a global constant polling_time, which is set to 10 microseconds at the moment. I raise this to 1 - and all of a sudden things work nicely. I am adding the linux-rdma list to CC so Sean might see this. If I understand what you are describing, the caller to rpoll() spins for up to 10 ms (10,000 us) before calling the real poll(). What is the purpose of the real poll() call? Is it simply a means to block the caller and avoid spinning? Or does it actually expect to detect an event? I think we are looking at two issues here: 1. the thread structure of ceph messenger For a given socket connection, there are 3 threads of interest here: the main messenger thread, the Pipe::reader and the Pipe::writer. For a ceph client like the ceph admin command, I see the following sequence - the connection to the ceph monitor is created by the main messenger thread, the Pipe::reader and Pipe::writer are instantiated. - the requested command is sent to the ceph monitor, the answer is read and printed - at this point the Pipe::reader already has called tcp_read_wait(), polling for more data or connection termination - after the response had been printed, the main loop calls the shutdown routines which in in turn shutdown() the socket There is some time between the last two steps - and this gap is long enough to open a race: 2. rpoll, ibv and poll the rpoll implementation in rsockets is split in 2 phases: - a busy loop which checks the state of the underlying ibv queue pair - the call to real poll() system call (i.e. the uverbs(?) implementation of poll() inside the kernel) The busy loop has a maximum duration of polling_time (10 microseconds by default) - and is able detect the local shutdown and returns a POLLHUP. The poll() system call (i.e. the uverbs implementation of poll() in the kernel) does not detect the local shutdown - and only returns after the caller supplied timeout expires. Increasing the rsockets polloing_time from 10 to 1 microseconds results in the rpoll to detect the local shutdown within the busy loop. Decreasing the ceph ms tcp read timeout from the default of 900 to 5 seconds serves a similar purpose, but is much coarser. From my understanding, the real issue is neither at the ceph nor at the rsockets level: it is related to the uverbs kernel module. An alternative way to address the current problem at the rsockets level would be w re-write of the rpoll(): instead of the busy loop at the beginning followed by the reall poll() call with the full user specificed timeout value (ms tcp read timeout in our case), I would embed the real poll() into a loop, splitting the user specified timeout into smaller portions and doing the rsockets specific rs_poll_check() on every timeout of the real poll(). I have not looked at the rsocket code, so take the following with a grain of salt. If the purpose of the real poll() is to simply block the user for a specified time, then you can simply make it a short duration (taking into consideration what granularity the OS provides) and then call ibv_poll_cq(). Keep in mind, polling will prevent your CPU from reducing power. If the real poll() is actually checking for something (e.g. checking on the RDMA channel's fd or the IB channel's fd), then you may not want to spin too much. Scott Best Regards Andreas Bluemle On Tue, 13 Aug 2013 07:53:12 +0200 Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi Matthew, I can confirm the beahviour whichi you describe. I too believe that the problem is on the client side (ceph command). My log files show the very same symptom, i.e. the client side not being able to shutdown the pipes properly. (Q: I had problems yesterday to send a mail to ceph-users list with the log files attached to it because of the size of the attachments exceeding some limit; I hadnÄt been subscribed to the list at that point. Is the uses of pastebin.com the better way to provide such lengthy information in general? Best Regards Andreas Bluemle On Tue, 13 Aug 2013 11:59:36 +0800 Matthew Anderson manderson8...@gmail.com wrote: Moving this conversation to ceph-devel where the dev's might be able to shed some light on this. I've added some additional debug to my code to narrow the issue down a bit and the reader thread appears to be getting locked by tcp_read_wait() because rpoll never returns an event when the socket is shutdown. A hack way of proving this was to lower the timeout in rpoll to 5 seconds. When command like 'ceph osd tree' completes you can see it block for 5 seconds until rpoll times out and returns 0. The reader thread is then able to join and the
Re: [PATCH V2] libibverbs: Allow arbitrary int values for MTU
On Jul 30, 2013, at 2:31 PM, Christoph Lameter c...@gentwo.org wrote: On Tue, 30 Jul 2013, Jeff Squyres (jsquyres) wrote: On Jul 30, 2013, at 12:44 PM, Christoph Lameter c...@gentwo.org wrote: What in the world does that mean? I am an oldtimer I guess. Seems that this is something that can be done in the newfangled forum? How does this affect mailing lists? I'm not sure what you're asking me; please see the prior posts on this thread that describes the MTU issue and why we still need a solution. What does bump mean? You keep sending replies that just says bump. http://en.wikipedia.org/wiki/Internet_forum#Thread When a member posts in a thread it will jump to the top since it is the latest updated thread. Similarly, other threads will jump in front of it when they receive posts. When a member posts in a thread for no reason but to have it go to the top, it is referred to as a bump or bumping. He is trying to bring it back to everyone's attention. Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Apr 18, 2013, at 3:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler spencer.shep...@gmail.com wrote: Note that SPEC SFS does not support RDMA. IIRC, the benchmark comes with source code - wondering anyone has modified it to run on RDMA ? Or is there any real user to share the experience ? I am not familiar with SpecSFS, but if it exercises the filesystem, it does not know which RPC layer that NFS uses, no? Or does it implement its own client and directly access the RPC layer? -- Wendy From: Wendy Cheng Sent: 4/18/2013 9:16 AM To: Yan Burman Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote: What do you suggest for benchmarking NFS? I believe SPECsfs has been widely used by NFS (server) vendors to position their product lines. Its workload was based on a real life NFS deployment. I think it is more torward office type of workload (large client/user count with smaller file sizes e.g. software development with build, compile, etc). BTW, we're experimenting a similar project and would be interested to know your findings. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? Remember there are always gaps between wire speed (that ib_send_bw measures) and real world applications. That being said, does your server use default export (sync) option ? Export the share with async option can bring you closer to wire speed. However, the practice (async) is generally not recommended in a real production system - as it can cause data integrity issues, e.g. you have more chances to lose data when the boxes crash. -- Wendy Wendy, It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance. Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Sharing MR Between Multiple Connections
On Nov 13, 2012, at 11:36 PM, Christopher Mitchell christop...@cemetech.net wrote: Hi, I am working on building an Infiniband application with a server that can handle many simultaneous clients. The server exposes a chunk of memory that each of the clients can read via RDMA. I was previously creating a new MR on the server for each client (and of course in that connection's PD). However, under stress testing, I realized that ibv_reg_mr() started failing after I simultaneously MRed the same area enough times to cover 20.0 GB. I presume that the problem is reaching some pinning limit, although ulimit reports unlimited for all relevant possibilities. I tried creating a single global PD and a single MR to be shared among the multiple connections, but rdma_create_qp() fails with an invalid argument when I try to do that. I therefore deduce that the PD specified in rdma_create_qp() must correspond to an active connection, not simply be created by opening a device. Long question short: is there any way I can share the same MR among multiple clients, so that my shared memory region is limited to N bytes instead of N/C (clients) bytes? Christopher, Yes, it is possible. You have to use the same PD for all QPs/connections. We do this in CCI when using the Verbs transport. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to preserve QP over HA events for librdmacm applications
On Sep 20, 2012, at 1:37 PM, Pradeep Satyanarayana prade...@linux.vnet.ibm.com wrote: On 09/19/2012 11:14 AM, Atchley, Scott wrote: On Sep 19, 2012, at 1:05 PM, Hefty, Seansean.he...@intel.com wrote: I too would be interested in bringing a QP from error back to a usable state. I have been debating whether to reconnect using the current RDMA calls versus trying to transition the existing RC QP. I assumed to transition the existing QP that I would need to open a socket to coordinate the two sides. Is that correct? If I were instead to use rdma_connect(), does it require a new CM id or just a new QP within the same id? What if you say pre-created a second (fail over) QP for HA purposes all under the covers of a single socket? And both QPs were connected before the failure. Not sure if that would work with the same CM id though. If not, we will need to rdma_connect() the second QP after failure. By having a second QP and bound to say a different port/device, one could survive not just link up/down events, but device failures too. Would that be more generic? Hi Pradeep, What is the memory cost of a QP? I assume it will require a second CM id as well. Involving a second device and/or port is not an option for my usage. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to preserve QP over HA events for librdmacm applications
On Sep 19, 2012, at 11:58 AM, Alex Rosenbaum al...@mellanox.com wrote: Since we use the RDMA_PS_IPOIB we need librdmacm to help get the correct pkey_index and qkey (in INIT-RTR transition) to match IPoIB's UD QP own values. If not, than our user space UD QP will not be able to send/recv from IPoIB on remote machines (which is what we want to gain by using the IPOIB port space). Maybe we can save the values used from the rdma_create_qp and reuse them once modify the UD QP state by libverbs (ibv_modify_qp). It would be nice if we had access to the rdma's modify qp wrapper to do this nicely from application level. I too would be interested in bringing a QP from error back to a usable state. I have been debating whether to reconnect using the current RDMA calls versus trying to transition the existing RC QP. I assumed to transition the existing QP that I would need to open a socket to coordinate the two sides. Is that correct? If I were instead to use rdma_connect(), does it require a new CM id or just a new QP within the same id? Thanks, Scott -Original Message- From: Or Gerlitz Sent: Wednesday, September 19, 2012 6:53 PM To: Alex Rosenbaum Cc: Hefty, Sean; linux-rdma (linux-rdma@vger.kernel.org) Subject: Re: how to preserve QP over HA events for librdmacm applications On 19/09/2012 18:48, Hefty, Sean wrote: Can this flushing be somehow done with the current librdmacm/libibverbs APIs or we need some enhancement? You can call verbs directly to transition the QP state. That leaves the CM state unchanged, which doesn't really matter for UD QPs anyway. Alex, Any reason we can't deploy this hack? is that for the IPoIB port space it would require copying some low level code from librdmacm or even from the kernel? e.g the IPoIB qkey, etc. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to preserve QP over HA events for librdmacm applications
On Sep 19, 2012, at 1:05 PM, Hefty, Sean sean.he...@intel.com wrote: I too would be interested in bringing a QP from error back to a usable state. I have been debating whether to reconnect using the current RDMA calls versus trying to transition the existing RC QP. I assumed to transition the existing QP that I would need to open a socket to coordinate the two sides. Is that correct? If I were instead to use rdma_connect(), does it require a new CM id or just a new QP within the same id? What do you gain by transitioning an RC QP from error to RTS, versus just establishing a new connection? I have a certain amount of state regarding a peer. I lookup that state based on the qp_num returned within a work completion, for example. If I reconnect, I will need to migrate the state from the old qp_num to the new qp_num. I have no preference which is why I asked about the two options (opening a socket to coordinate state transitions versus connecting with a new QP). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to preserve QP over HA events for librdmacm applications
On Sep 19, 2012, at 2:14 PM, Atchley, Scott atchle...@ornl.gov wrote: On Sep 19, 2012, at 1:05 PM, Hefty, Sean sean.he...@intel.com wrote: I too would be interested in bringing a QP from error back to a usable state. I have been debating whether to reconnect using the current RDMA calls versus trying to transition the existing RC QP. I assumed to transition the existing QP that I would need to open a socket to coordinate the two sides. Is that correct? If I were instead to use rdma_connect(), does it require a new CM id or just a new QP within the same id? What do you gain by transitioning an RC QP from error to RTS, versus just establishing a new connection? I have a certain amount of state regarding a peer. I lookup that state based on the qp_num returned within a work completion, for example. If I reconnect, I will need to migrate the state from the old qp_num to the new qp_num. I have no preference which is why I asked about the two options (opening a socket to coordinate state transitions versus connecting with a new QP). I don't know if it matters to the conversation or not, but I use an SRQ. I am unclear how to remove a QP from the SRQ. Is ibv_destroy_qp() sufficient? Or do I need to use rdma_destroy_qp()? I basically, use the rdma_* calls for connection setup. After that, I use only ibv_* calls for communication (Send/Recv and RDMA). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote: On Wed, 29 Aug 2012, Atchley, Scott wrote: I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? Sounds about right, This is not a hardware limitation but a limitation of the socket I/O layer / PCI-E bus. The cards generally can process more data than the PCI bus and the OS can handle. PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. First, thanks for the reply. I am not sure where are are getting the 2.3 GB/s value. When using verbs natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP offloads that might allow higher Socket performance. Ethtool reports: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. So basically, am I in the ball park given this hardware? What should I expect as a max for ipoib with FDR cards? More of the same. You may want to A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: # increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limit net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # increase the length of the processor input queue net.core.netdev_max_backlog = 3 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes. C) Bypass the socket layer. Look at Sean's rsockets layer f.e. We actually want to test the socket stack and not bypass it. Thanks again! Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 1:50 PM, Reeted wrote: On 08/29/12 21:35, Atchley, Scott wrote: Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. What benchmark program are you using? netperf with process binding (-T). I tune sysctl per the DOE FasterData specs: http://fasterdata.es.net/host-tuning/linux/ Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. K that is a major problem. Both are on by default here. What NIC is this? These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: Nope increase mtu. Connected mode supports up to 64k mtu size I believe. Yes, I am using the max MTU (65520). or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? That does nothing for performance. The problem is that the handling of the data by the kernel causes too much latency so that you cannot reach the full bw of the hardware. We actually want to test the socket stack and not bypass it. AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:04 PM, Reeted wrote: On 09/05/12 19:59, Atchley, Scott wrote: On Sep 5, 2012, at 1:50 PM, Reeted wrote: I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. Have a look at an old thread in this ML by Sebastien Dugue IPoIB to Ethernet routing performance He had numbers much higher than yours on similar hardware, and was suggested to use datagram to achieve offloading and even higher speeds. Keep me informed if you can fix this, I am interested but can't test infiniband myself right now. He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using datagram mode. I checked and datagram mode shows support via ethtool for more offloads. In my case, I still see better performance with connected mode. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. Hmmm... You are running an old kernel. What version of OFED do you use? Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-) Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote: On 9/5/2012 3:48 PM, Atchley, Scott wrote: On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Since you are using netperf, you might also considering experimenting with the TCP_SENDFILE test. Using sendfile/splice calls can have a significant impact for sockets-based apps. Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 22Gb/s single core/stream while fully CPU bound. With sendfile/splice, there is no issue saturating a 40G link with about 40-50% core utilization. That being said, binding to the right core/node, message size and memory alignment, interrupt handling, and proper host/NIC tuning all have an impact on the performance. The state of high-performance networking is certainly not plug-and-play. Thanks for the tip. The app we want to test does not use sendfile() or splice(). I do bind to the best core (determined by testing all combinations on client and server). I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad to hear that you got to 22 Gb/s for a single stream. That is more reassuring. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IPoIB performance
Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? What should I expect as a max for ipoib with FDR cards? Thanks, Scott hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.626 node_guid: 0002:c903:000b:6520 sys_image_guid: 0002:c903:000b:6523 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt: 1 max_mr_size:0x page_size_cap: 0xfe00 max_qp: 65464 max_qp_wr: 16384 device_cap_flags: 0x006c9c76 max_sge:32 max_sge_rd: 0 max_cq: 65408 max_cqe:4194303 max_mr: 131056 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom:1047424 max_qp_init_rd_atom:128 max_ee_init_rd_atom:0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd:0 max_mw: 0 max_raw_ipv6_qp:0 max_raw_ethy_qp:0 max_mcast_grp: 8192 max_mcast_qp_attach:56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr:0 max_srq:65472 max_srq_wr: 16383 max_srq_sge:31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 6 port_lid: 8 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x4000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len:128 subnet_timeout: 18 init_type_reply:0 active_width: 4X (2) active_speed: 10.0 Gbps (4) phys_state: LINK_UP (5) GID[ 0]: fe80::::0002:c903:000b:6521 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT: netmap - a novel framework for fast packet I/O
On Jan 20, 2012, at 11:20 AM, Ira Weiny wrote: On Fri, 20 Jan 2012 06:18:44 -0800 Atchley, Scott atchle...@ornl.gov wrote: Interesting. It totally hijacks the NIC; all traffic is captured. You would have to implement your own IP stack, Verbs stack, etc. Can multiple user space processes share the card? If so, how is security handled between them? It is not clear from the paper I scanned. There does seem to be a mechanism to send selected packets up the host stack. Scott Ira Scott On Jan 19, 2012, at 11:50 AM, Yann Droneaud wrote: Hi, I have discovered today the netmap project[1] through an ACM Queue article[2]. Netmap is a new interface to send and receive packets through an Ethernet interface (NIC). It seems to provide a raw access to network interface in order to process packets at high rate with a low overhead. This is an another example of kernel-bypass/zero-copy which are core features of InfiniBand verbs/RDMA. But unlike InfiniBand verbs/RDMA, Netmap seems to have a very small API. Such API could be enough to build an unreliable datagram messaging system on low cost hardware (without concerns of determinism, flow control, etc.). I'm asking myself if the way netmap exposes internal NIC rings could be applicable for IB/IBoE HCA ? e.g. beyond 10GbE NIC, is netmap relevant ? Regards. [1] http://info.iet.unipi.it/~luigi/netmap/ netmap - a novel framework for fast packet I/O Luigi Rizzo Università di Pisa [2] http://queue.acm.org/detail.cfm?id=2103536 Revisiting Network I/O APIs: The netmap Framework Luigi Rizzo, 2012-01-17 -- Yann Droneaud -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send with immediate data completion
On Jan 11, 2012, at 5:22 PM, Hefty, Sean wrote: I'm still waiting on feedback from the IBTA, but they are looking into the matter. The intent is for immediate data only to be provided on receive work completions. The IBTA will clarify the spec on this. I'll submit patches that remove setting the wc flag, which may help avoid this confusion some. Sean, Thanks for looking into this. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send with immediate data completion
On Jan 3, 2012, at 12:35 PM, Atchley, Scott wrote: On Jan 3, 2012, at 11:55 AM, Hefty, Sean wrote: I have a question about a completion for a send with immediate data. The IB spec (1.2.1) only mentions that the WC's immediate data be present at the receiver. It is silent on the value on the sender at completion. It does say that it is only valid if the WC's immediate data indicator is set. Can you provide a section reference to the spec on the areas that you're looking at? Looking quickly, section 11.4.2.1 reads like immediate data should be available in either case. I've never checked imm data on the send wc. I'm just trying to determine if there's an issue in the spec that should be addressed, or if this is simply a bug in the hca/driver. There is the definition in the glossary: Immediate Data Data contained in a Work Queue Element that is sent along with the payload to the remote Channel Adapter and placed in a Receive Work Completion. Section 3.7.4 Transport Layer: The Immediate Data (IMMDT) field is optionally present in RDMA WRITE and SEND messages. It contains data that the consumer placed in the Send or RDMA Write request and the receiving QP will place that value in the current receive WQE. An RDMA Write with immediate data will consume a receive WQE even though the QP did not place any data into the receive buffer since the IMMDT is placed in a CQE that references the receive WQE and indicates that the WQE has completed. Section 11.4.1.1 Post Send Request has: Immediate Data Indicator. This is set if Immediate Data is to be included in the outgoing request. Valid only for Send or Write RDMA operations. 4-byte Immediate Data. Valid only for Send or Write RDMA operations. 11.4.2.1 Poll for Completion Immediate data indicator. This is set if immediate data is present. 4-byte immediate data. None specifically mention the sender's completion event. Sean, Any thoughts? Personally, I would like to have it in the send completion, but it might not be possible for all drivers to implement. If not, then the spec should be clarified. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Send with immediate data completion
Hi all, I have a question about a completion for a send with immediate data. The IB spec (1.2.1) only mentions that the WC's immediate data be present at the receiver. It is silent on the value on the sender at completion. It does say that it is only valid if the WC's immediate data indicator is set. When I test using a 2.6.38 kernel with the kernel.org libibverbs git tree, I see a send completion's wc_flags set with IBV_WC_WITH_IMM yet the imm_data is not what I passed in. Since the spec is silent on setting imm_data on the sender, I assume that I should not rely on looking at the imm_data on a send completion. Given that, should IBV_WC_WITH_IMM ever be set on the sender? Thanks, Scott - Scott Atchley HPC Systems Engineer Center for Computational Sciences Oak Ridge National Laboratory atchle...@ornl.gov -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send with immediate data completion
On Jan 3, 2012, at 11:55 AM, Hefty, Sean wrote: I have a question about a completion for a send with immediate data. The IB spec (1.2.1) only mentions that the WC's immediate data be present at the receiver. It is silent on the value on the sender at completion. It does say that it is only valid if the WC's immediate data indicator is set. Can you provide a section reference to the spec on the areas that you're looking at? Looking quickly, section 11.4.2.1 reads like immediate data should be available in either case. I've never checked imm data on the send wc. I'm just trying to determine if there's an issue in the spec that should be addressed, or if this is simply a bug in the hca/driver. For the record, I am using: hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.626 node_guid: 0002:c903:000b:64e8 sys_image_guid: 0002:c903:000b:64eb vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html