Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Edward Ned Harvey (openindiana)
 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 
 It would be difficult to believe that 10Gbit Ethernet offers better
 bandwidth than 56Gbit Infiniband (the current offering).  The swiching
 model is quite similar.  The main reason why IB offers better latency
 is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP communications done on ethernet, with a 
lot of variable peer-to-peer and broadcast traffic.  IB is designed for 
networks where systems want to establish connections to other systems, and 
those connections remain mostly statically connected.  Primarily clustering  
storage networks.  Not primarily TCP/IP.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Doug Hughes
some of these points are a bit dated. Allow me to make some updates. I'm sure 
that you are aware that most 10gig switches these days are cut through and not 
store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. 
Cisco has a mix of things, but they aren't really in the low latency space. The 
10g and 40g port to port forwarding is in nanoseconds. buffering is mostly 
reserved to carrier operations anymore, and even there it is becoming less 
common because of the toll it causes to things like IPVideo and VOIP. Buffers 
are good for web farms, still, and to a certain extent storage servers or WAN 
links where there is a high degree of contention from disparate traffic.
  
At a physical level, the signalling of IB compared to Ethernet (10g+) is very 
similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, 
and QDR and FDR infiniband on any port.
 there are also a fair number of vendors that support RDMA in ethernet NIC now, 
like SolarFlare with Onboot technology.

The main reason for lowest achievable latency is higher speed. Latency is 
roughly equivalent to the inversion of bandwidth.  But, the higher levels of 
protocols that you stack on top contribute much more than the hardware 
theoretical minimums or maximums. TCP/IP is a killer in terms of adding 
overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is 
much faster than the kernel overhead induced by TCP session setups and other 
host side user/kernel boundaries and buffering. PCI latency is also higher than 
the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband.

There is even a special layer that you can write custom protocols to on 
Infiniband called Verbs for lowering latency further.

Infiniband is inherently a layer1 and 2 protocol, and the subnet manager 
(software) is resposible for setting up all virtual circuits (routes between 
hosts on the fabric) and rerouting when a path goes bad. Also, the link 
aggregation, as you mention, is rock solid and amazingly good. Auto rerouting 
is fabulous and super fast. But, you don't get layer3. TCP over IB works out of 
the box, but adds large overhead. Still, it does make it possible that you can 
have IB native and IP over IB with gateways to a TCP network with a single 
cable. That's pretty cool.


Sent from my android device.

-Original Message-
From: Edward Ned Harvey (openindiana) openindi...@nedharvey.com
To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
Sent: Tue, 16 Apr 2013 10:49 AM
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage 
(OpenIndiana-discuss Digest, Vol 33, Issue 20)

 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 
 It would be difficult to believe that 10Gbit Ethernet offers better
 bandwidth than 56Gbit Infiniband (the current offering).  The swiching
 model is quite similar.  The main reason why IB offers better latency
 is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP

Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-15 Thread Bob Friesenhahn

On Mon, 15 Apr 2013, Ong Yu-Phing wrote:
Working set of ~50% is quite large; when you say data analysis I'd assume 
some sort of OLTP or real-time BI situation, but do you know the nature of 
your processing, i.e. is it latency dependent or bandwidth dependent?  Reason 
I ask, is because I think 10GB delivers better overall B/W, but 4GB 
infiniband delivers better latency.


It would be difficult to believe that 10Gbit Ethernet offers better 
bandwidth than 56Gbit Infiniband (the current offering).  The swiching 
model is quite similar.  The main reason why IB offers better latency 
is a better HBA hardware interface and a specialized stack.  5X is 5X.


If 3xdisk raidz1 is too expensive, then put more SSDs in each raidz1.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-14 Thread Ong Yu-Phing
A heads up that 10-12TB means you'd need 11.5-13TB useable, assuming 
you'd need to keep used storage  90% of total storage useable (or is 
that old news now?).


So, using Saso's RAID5 config of Intel DC3700s in 3xdisk raidz1, that 
means you'd need 21x Intel DC3700's at 800GB (21x800/3*2*.9=10.008) to 
get 10TB, or 27x to get 12.9TB useable, excluding root/cache etc.  Which 
means 50+K for SSDs, leaving you only 10K for the server platform, which 
might not be enough to get 0.5TB of RAM etc (unless you can get a bulk 
discount on the Intel DC3700s!).


Working set of ~50% is quite large; when you say data analysis I'd 
assume some sort of OLTP or real-time BI situation, but do you know the 
nature of your processing, i.e. is it latency dependent or bandwidth 
dependent?  Reason I ask, is because I think 10GB delivers better 
overall B/W, but 4GB infiniband delivers better latency.


10 years ago I've worked with 30+TB data sets which were preloaded into 
an Oracle database, with data structures highly optimized for the types 
of reads which the applications required (2-3 day window for complex 
analysis of monthly data).  No SSDs and fancy stuff in those days.  But 
if your data is live/realtime and constantly streaming in, then the work 
profile can be dramatically different.


On 15/04/2013 07:17, Sa?o Kiselkov wrote:

On 04/14/2013 05:15 PM, Wim van den Berge wrote:

Hello,

We have been running OpenIndiana (and its various predecessors) as storage
servers in production for the last couple of years. Over that time the
majority of our storage infrastructure has been moved to Open Indiana to the
point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
servers in three datacenters . All of these systems are pretty much the
same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
multiple 10Gb uplinks. All of these work like a charm.

However the next system is  going to be a little different. It needs to be
the absolute fastest iSCSI target we can create/afford. We'll need about
10-12TB of capacity and the working set will be 5-6TB and IO over time is
90% reads and 10% writes using 32K blocks but this is a data analysis
scenario so all the writes are upfront. Contrary to previous installs, money
is a secondary (but not unimportant) issue for this one. I'd like to stick
with a SuperMicro platform and we've been thinking of trying the new Intel
S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
system cost below $60K.

This is new ground for us. Before this one, the game has always been
primarily about capacity/data integrity and anything we designed based on
ZFS/Open Solaris has always more than delivered in the performance arena.
This time we're looking to fill up the dedicated 10Gbe connections to each
of the four to eight processing nodes as much as possible. The processing
nodes have been designed that they will consume whatever storage bandwidth
they can get.

Any ideas/thoughts/recommendations/caveats would be much appreciated.

Hi Wim,

Interesting project. You should definitely look at all-SSD pools here.
With the 800GB DC S3700 running in 3-drive raidz1's you're looking at
approximately $34k CAPEX (for the 10TB capacity point) just for the
SSDs. That leaves you ~$25k you can spend on the rest of the box, which
is *a lot*. Be sure to put lots of RAM (512GB+) into the box.

Also consider ditching 10GE and go straight to IB. A dual-port QDR card
can be had nowadays for about $1k (SuperMicro even makes motherboards
with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for
about $8k (this integrates the IB subnet manager, so this is all you
need to set up an IB network):
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7idproduct=158

Cheers,
--
Saso






___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss