Re: [lopsa-tech] ATA over Ethernet experience...

Tracy Reed Sat, 07 Aug 2010 21:20:30 -0700

On Sat, Aug 07, 2010 at 08:06:02PM +1000, Daniel Pittman spake thusly:
> I was wondering what sort of experiences y'all had with ATA over Ethernet in a
> primarily Linux environment.  This would be operated as backing storage for
> KVM based virtual machines, Linux host and guest.


For the last 3 years I have been using AoE extensively to back Xen
based VMs although soon I will be using KVM.

> We would be looking, at this stage, to attach the AOE devices to the host,
> then use an LVM layer atop that, then the KVM guest devices stored as raw LVM
> logical volumes.[1]

I use LVM on the AoE target to carve out volumes to export to the VMs
and then the VMs use LVM on their volume to divvy up the space however
they like.

I need to be able to do VM migration. It has been a great help in
avoiding VM service interruptions when I need to do something with a
particular piece of hardware. This means multiple Xen machines need to
be able to see the same disk. So I cannot export the whole disk to the
multiple initiators and let them use LVM to carve it up. I need to
look into cluster lvm for this. So far I have not played with it at
all wanting to keep things simple. But there is some added complexity
in having to log into the target to do lvm there and keep track of
what lv goes to what VM so clvm it may be worth it.

> The deployment would be Gigabit Ethernet, typically using a dedicated SAN NIC
> in each host — but we have a bunch of single NIC legacy hardware that I would
> be loath to throw out just for this, so comments on the cost of sharing that
> port with regular TCP would be interesting.

I find minimal cost in sharing that port with regular TCP. In fact,
all of my machines have two interfaces which I bond using 802.3ad
channel bonding. Then I run VLANs in the switch configure each vlan on
the Xen Dom0 using vconfig and appropriate bridges with
brctl. Actually, RedHat does this for me in their init scripts if I do
things like this:

# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes

# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes

# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MTU=9000

# cat /etc/sysconfig/network-scripts/ifcfg-bond0.2 
DEVICE=bond0.2
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
VLAN=yes
BRIDGE=external
MTU=1500

# cat /etc/sysconfig/network-scripts/ifcfg-bond0.3
DEVICE=bond0.3
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
VLAN=yes
BRIDGE=dmz
MTU=1500

# cat /etc/sysconfig/network-scripts/ifcfg-bond0.5 
DEVICE=bond0.5
NETMASK=255.255.255.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
VLAN=yes
MTU=9000
BRIDGE=san

You need to make sure you use 9000mtu packets and a quality switch to
get good performance. Also make sure your NICs can do 9000 mtu of
course. By using bonding in this way I can get 200MB/s if talking to
enough disks spread across different machines. This form of bonding
does not really get you 2Gb to each machine due to the way it uses a
sort of MAC hashing algorith to decide which link to send data over
but given at least two machines to talk to you can get a combined
total of 200MB/s if the disks in the machine can support it. 

I always deploy AoE target machines in pairs. I internally RAID0 the
disk (just two disks per RAID0 so far) and then the virtual machine
does software mirroring over the two AoE machines. This provides
reliability in the event that one goes down (which happened to one of
my machines just this past week, all of the VMs kicked out the failed
disk from the mirror and continues uninterrupted as expected). So
technically, all of my VMs are sitting on RAID 10. Deploying in pairs
also helps to increase the chances of being able to take full
advantage of the 2Gb/s channel bond also.

The RAM in the AoE target ends up being used like disk cache which can
be very nice for performance. Just make sure you have the thing on a
UPS because if it loses power you could lose writes that it has
cached. This has never caused me any problems as the kernel is pretty
good about getting writes committed quickly but it is something to
bear in mind. Retrieving something from the RAM of the other machine
via the network is often faster than retrieving something from local
disk.

> Performance of the Linux ATAoE client for "extra" disk storage, rather than
> storage needed to actually boot the system.  I am happy with a couple of local
> disks for storing the OS.

Aside from a couple milliseconds of extra latency due to the network
round trip time you should be able to utilize the full potential of
your disks up to the 1Gb each link provides you. AoE and even the
ethernet network get in the way very little compared to the mechanical
limitations of the disks.

Don't make the same mistake I have made (a couple of times) in setting
this stuff up: You can very easily throw so many virtual machines onto
a single 1T disk that you totally max out the disk for IOPS. If you
think your SAN is running slow take a look at how much bandwidth you
are passing. Odds are it won't be nearly as much as you think, nowhere
near 1Gb, and that your disk heads are racing back and forth trying to
psatisfy all of the IO requests. Lots of spindles are what you need. Of
course, this goes for pretty much any SAN or disk system in general.

> Performance of the vblade and ggaoed software ATAoE target implementations,
> hosted on Linux.  Backing would be solid local storage on LSI hardware RAID,
> and currently these give great performance, so I am happy we have plenty of
> IOPS and disk bandwidth to play with there.

The modern vblade stuff works great. Make sure you provide larger than
the default buffer count using the -b option. Especially with gigabit
the default buffers will get you very poor performance. They really
need to up this default to something reasonable as it may be turning
people off who get poor performance. I invoke vblade like so:

/usr/sbin/vbladed -b 16384 1 1 bond0 /dev/diskb/e1.1 # admin1

This forks off a vblade process with 16384 buffers, shelf 1 slot 1
(AoE equivalent of a LUN) listening on interface bond0 and exporting
the logical volume /dev/diskb/e1.1. I currently name my LVs for their
AoE device name which will be /dev/etherd/e1.1 on the initiator but I
may change that if I go with cluster lvm and then maybe I can just
name the lv for the virtual machine it goes with.

The AoE kernel module that comes with the kernel is horribly out of
date and totally unusable. Be sure to upgrade the kernel module to the
latest.

> Performance and manageability of the Coraid hardware — and, ideally, how well
> it plays in a mixed software and dedicated environment with the software
> targets.

I have only used the Coraid hardware once a little over 3 years
ago. It has changed a lot since then so I can't comment much on
it. Since Coraid builds the hardware and distributes vblade I would
bet things work well in a mixed hardware/software AoE target
environment. Coraid should be able to give you guidance on this.

-- 
Tracy Reed
http://tracyreed.org

pgpL15Xc9DQB1.pgp
Description: PGP signature

_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] ATA over Ethernet experience...

Reply via email to