Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] It would be difficult to believe that 10Gbit Ethernet offers better bandwidth than 56Gbit Infiniband (the current offering). The swiching model is quite similar. The main reason why IB offers better latency is a better HBA hardware interface and a specialized stack. 5X is 5X. Put another way, the reason infiniband is so much higher throughput and lower latency than ethernet is because the switching (at the physical layer) is completely different from ethernet, and messages are passed directly from user-level to user-level on remote system ram via RDMA, bypassing the OSI layer model and other kernel overhead. I read a paper from vmware, where they implemented RDMA over ethernet and doubled the speed of vmotion (but still not as fast as infiniband, by like 4x.) Beside the bypassing of OSI layers and kernel latency, IB latency is lower because Ethernet switches use store-and-forward buffering managed by the backplane in the switch, in which a sender sends a packet to a buffer on the switch, which then pushes it through the backplane, and finally to another buffer on the destination. IB uses cross-bar, or cut-through switching, in which the sending host channel adapter signals the destination address to the switch, then waits for the channel to be opened. Once the channel is opened, it stays open, and the switch in between is nothing but signal amplification (as well as additional virtual lanes for congestion management, and other functions). The sender writes directly to RAM on the destination via RDMA, no buffering in between. Bypassing the OSI layer model. Hence much lower latency. IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 4x, 16x designations, and the 40Gbit specifications. Something which is quasi-possible in ethernet via LACP, but not as good and not the same. IB guarantees packets delivered in the right order, with native congestion control as compared to ethernet which may drop packets and TCP must detect and retransmit... Ethernet includes a lot of support for IP addressing, and variable link speeds (some 10Gbit, 10/100, 1G etc) and all of this asynchronous. For these reasons, IB is not a suitable replacement for IP communications done on ethernet, with a lot of variable peer-to-peer and broadcast traffic. IB is designed for networks where systems want to establish connections to other systems, and those connections remain mostly statically connected. Primarily clustering storage networks. Not primarily TCP/IP. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
some of these points are a bit dated. Allow me to make some updates. I'm sure that you are aware that most 10gig switches these days are cut through and not store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. Cisco has a mix of things, but they aren't really in the low latency space. The 10g and 40g port to port forwarding is in nanoseconds. buffering is mostly reserved to carrier operations anymore, and even there it is becoming less common because of the toll it causes to things like IPVideo and VOIP. Buffers are good for web farms, still, and to a certain extent storage servers or WAN links where there is a high degree of contention from disparate traffic. At a physical level, the signalling of IB compared to Ethernet (10g+) is very similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, and QDR and FDR infiniband on any port. there are also a fair number of vendors that support RDMA in ethernet NIC now, like SolarFlare with Onboot technology. The main reason for lowest achievable latency is higher speed. Latency is roughly equivalent to the inversion of bandwidth. But, the higher levels of protocols that you stack on top contribute much more than the hardware theoretical minimums or maximums. TCP/IP is a killer in terms of adding overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is much faster than the kernel overhead induced by TCP session setups and other host side user/kernel boundaries and buffering. PCI latency is also higher than the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband. There is even a special layer that you can write custom protocols to on Infiniband called Verbs for lowering latency further. Infiniband is inherently a layer1 and 2 protocol, and the subnet manager (software) is resposible for setting up all virtual circuits (routes between hosts on the fabric) and rerouting when a path goes bad. Also, the link aggregation, as you mention, is rock solid and amazingly good. Auto rerouting is fabulous and super fast. But, you don't get layer3. TCP over IB works out of the box, but adds large overhead. Still, it does make it possible that you can have IB native and IP over IB with gateways to a TCP network with a single cable. That's pretty cool. Sent from my android device. -Original Message- From: Edward Ned Harvey (openindiana) openindi...@nedharvey.com To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Sent: Tue, 16 Apr 2013 10:49 AM Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20) From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] It would be difficult to believe that 10Gbit Ethernet offers better bandwidth than 56Gbit Infiniband (the current offering). The swiching model is quite similar. The main reason why IB offers better latency is a better HBA hardware interface and a specialized stack. 5X is 5X. Put another way, the reason infiniband is so much higher throughput and lower latency than ethernet is because the switching (at the physical layer) is completely different from ethernet, and messages are passed directly from user-level to user-level on remote system ram via RDMA, bypassing the OSI layer model and other kernel overhead. I read a paper from vmware, where they implemented RDMA over ethernet and doubled the speed of vmotion (but still not as fast as infiniband, by like 4x.) Beside the bypassing of OSI layers and kernel latency, IB latency is lower because Ethernet switches use store-and-forward buffering managed by the backplane in the switch, in which a sender sends a packet to a buffer on the switch, which then pushes it through the backplane, and finally to another buffer on the destination. IB uses cross-bar, or cut-through switching, in which the sending host channel adapter signals the destination address to the switch, then waits for the channel to be opened. Once the channel is opened, it stays open, and the switch in between is nothing but signal amplification (as well as additional virtual lanes for congestion management, and other functions). The sender writes directly to RAM on the destination via RDMA, no buffering in between. Bypassing the OSI layer model. Hence much lower latency. IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 4x, 16x designations, and the 40Gbit specifications. Something which is quasi-possible in ethernet via LACP, but not as good and not the same. IB guarantees packets delivered in the right order, with native congestion control as compared to ethernet which may drop packets and TCP must detect and retransmit... Ethernet includes a lot of support for IP addressing, and variable link speeds (some 10Gbit, 10/100, 1G etc) and all of this asynchronous. For these reasons, IB is not a suitable replacement for IP
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
On Mon, 15 Apr 2013, Ong Yu-Phing wrote: Working set of ~50% is quite large; when you say data analysis I'd assume some sort of OLTP or real-time BI situation, but do you know the nature of your processing, i.e. is it latency dependent or bandwidth dependent? Reason I ask, is because I think 10GB delivers better overall B/W, but 4GB infiniband delivers better latency. It would be difficult to believe that 10Gbit Ethernet offers better bandwidth than 56Gbit Infiniband (the current offering). The swiching model is quite similar. The main reason why IB offers better latency is a better HBA hardware interface and a specialized stack. 5X is 5X. If 3xdisk raidz1 is too expensive, then put more SSDs in each raidz1. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
A heads up that 10-12TB means you'd need 11.5-13TB useable, assuming you'd need to keep used storage 90% of total storage useable (or is that old news now?). So, using Saso's RAID5 config of Intel DC3700s in 3xdisk raidz1, that means you'd need 21x Intel DC3700's at 800GB (21x800/3*2*.9=10.008) to get 10TB, or 27x to get 12.9TB useable, excluding root/cache etc. Which means 50+K for SSDs, leaving you only 10K for the server platform, which might not be enough to get 0.5TB of RAM etc (unless you can get a bulk discount on the Intel DC3700s!). Working set of ~50% is quite large; when you say data analysis I'd assume some sort of OLTP or real-time BI situation, but do you know the nature of your processing, i.e. is it latency dependent or bandwidth dependent? Reason I ask, is because I think 10GB delivers better overall B/W, but 4GB infiniband delivers better latency. 10 years ago I've worked with 30+TB data sets which were preloaded into an Oracle database, with data structures highly optimized for the types of reads which the applications required (2-3 day window for complex analysis of monthly data). No SSDs and fancy stuff in those days. But if your data is live/realtime and constantly streaming in, then the work profile can be dramatically different. On 15/04/2013 07:17, Sa?o Kiselkov wrote: On 04/14/2013 05:15 PM, Wim van den Berge wrote: Hello, We have been running OpenIndiana (and its various predecessors) as storage servers in production for the last couple of years. Over that time the majority of our storage infrastructure has been moved to Open Indiana to the point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ servers in three datacenters . All of these systems are pretty much the same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, multiple 10Gb uplinks. All of these work like a charm. However the next system is going to be a little different. It needs to be the absolute fastest iSCSI target we can create/afford. We'll need about 10-12TB of capacity and the working set will be 5-6TB and IO over time is 90% reads and 10% writes using 32K blocks but this is a data analysis scenario so all the writes are upfront. Contrary to previous installs, money is a secondary (but not unimportant) issue for this one. I'd like to stick with a SuperMicro platform and we've been thinking of trying the new Intel S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep system cost below $60K. This is new ground for us. Before this one, the game has always been primarily about capacity/data integrity and anything we designed based on ZFS/Open Solaris has always more than delivered in the performance arena. This time we're looking to fill up the dedicated 10Gbe connections to each of the four to eight processing nodes as much as possible. The processing nodes have been designed that they will consume whatever storage bandwidth they can get. Any ideas/thoughts/recommendations/caveats would be much appreciated. Hi Wim, Interesting project. You should definitely look at all-SSD pools here. With the 800GB DC S3700 running in 3-drive raidz1's you're looking at approximately $34k CAPEX (for the 10TB capacity point) just for the SSDs. That leaves you ~$25k you can spend on the rest of the box, which is *a lot*. Be sure to put lots of RAM (512GB+) into the box. Also consider ditching 10GE and go straight to IB. A dual-port QDR card can be had nowadays for about $1k (SuperMicro even makes motherboards with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for about $8k (this integrates the IB subnet manager, so this is all you need to set up an IB network): http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7idproduct=158 Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss