On Sat, Aug 07, 2010 at 08:06:02PM +1000, Daniel Pittman spake thusly: > I was wondering what sort of experiences y'all had with ATA over Ethernet in a > primarily Linux environment. This would be operated as backing storage for > KVM based virtual machines, Linux host and guest.
For the last 3 years I have been using AoE extensively to back Xen based VMs although soon I will be using KVM. > We would be looking, at this stage, to attach the AOE devices to the host, > then use an LVM layer atop that, then the KVM guest devices stored as raw LVM > logical volumes.[1] I use LVM on the AoE target to carve out volumes to export to the VMs and then the VMs use LVM on their volume to divvy up the space however they like. I need to be able to do VM migration. It has been a great help in avoiding VM service interruptions when I need to do something with a particular piece of hardware. This means multiple Xen machines need to be able to see the same disk. So I cannot export the whole disk to the multiple initiators and let them use LVM to carve it up. I need to look into cluster lvm for this. So far I have not played with it at all wanting to keep things simple. But there is some added complexity in having to log into the target to do lvm there and keep track of what lv goes to what VM so clvm it may be worth it. > The deployment would be Gigabit Ethernet, typically using a dedicated SAN NIC > in each host — but we have a bunch of single NIC legacy hardware that I would > be loath to throw out just for this, so comments on the cost of sharing that > port with regular TCP would be interesting. I find minimal cost in sharing that port with regular TCP. In fact, all of my machines have two interfaces which I bond using 802.3ad channel bonding. Then I run VLANs in the switch configure each vlan on the Xen Dom0 using vconfig and appropriate bridges with brctl. Actually, RedHat does this for me in their init scripts if I do things like this: # cat /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 ONBOOT=yes BOOTPROTO=none USERCTL=no MASTER=bond0 SLAVE=yes # cat /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 ONBOOT=yes BOOTPROTO=none USERCTL=no MASTER=bond0 SLAVE=yes # cat /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 ONBOOT=yes BOOTPROTO=none USERCTL=no MTU=9000 # cat /etc/sysconfig/network-scripts/ifcfg-bond0.2 DEVICE=bond0.2 ONBOOT=yes BOOTPROTO=none USERCTL=no VLAN=yes BRIDGE=external MTU=1500 # cat /etc/sysconfig/network-scripts/ifcfg-bond0.3 DEVICE=bond0.3 ONBOOT=yes BOOTPROTO=none USERCTL=no VLAN=yes BRIDGE=dmz MTU=1500 # cat /etc/sysconfig/network-scripts/ifcfg-bond0.5 DEVICE=bond0.5 NETMASK=255.255.255.0 ONBOOT=yes BOOTPROTO=none USERCTL=no VLAN=yes MTU=9000 BRIDGE=san You need to make sure you use 9000mtu packets and a quality switch to get good performance. Also make sure your NICs can do 9000 mtu of course. By using bonding in this way I can get 200MB/s if talking to enough disks spread across different machines. This form of bonding does not really get you 2Gb to each machine due to the way it uses a sort of MAC hashing algorith to decide which link to send data over but given at least two machines to talk to you can get a combined total of 200MB/s if the disks in the machine can support it. I always deploy AoE target machines in pairs. I internally RAID0 the disk (just two disks per RAID0 so far) and then the virtual machine does software mirroring over the two AoE machines. This provides reliability in the event that one goes down (which happened to one of my machines just this past week, all of the VMs kicked out the failed disk from the mirror and continues uninterrupted as expected). So technically, all of my VMs are sitting on RAID 10. Deploying in pairs also helps to increase the chances of being able to take full advantage of the 2Gb/s channel bond also. The RAM in the AoE target ends up being used like disk cache which can be very nice for performance. Just make sure you have the thing on a UPS because if it loses power you could lose writes that it has cached. This has never caused me any problems as the kernel is pretty good about getting writes committed quickly but it is something to bear in mind. Retrieving something from the RAM of the other machine via the network is often faster than retrieving something from local disk. > Performance of the Linux ATAoE client for "extra" disk storage, rather than > storage needed to actually boot the system. I am happy with a couple of local > disks for storing the OS. Aside from a couple milliseconds of extra latency due to the network round trip time you should be able to utilize the full potential of your disks up to the 1Gb each link provides you. AoE and even the ethernet network get in the way very little compared to the mechanical limitations of the disks. Don't make the same mistake I have made (a couple of times) in setting this stuff up: You can very easily throw so many virtual machines onto a single 1T disk that you totally max out the disk for IOPS. If you think your SAN is running slow take a look at how much bandwidth you are passing. Odds are it won't be nearly as much as you think, nowhere near 1Gb, and that your disk heads are racing back and forth trying to psatisfy all of the IO requests. Lots of spindles are what you need. Of course, this goes for pretty much any SAN or disk system in general. > Performance of the vblade and ggaoed software ATAoE target implementations, > hosted on Linux. Backing would be solid local storage on LSI hardware RAID, > and currently these give great performance, so I am happy we have plenty of > IOPS and disk bandwidth to play with there. The modern vblade stuff works great. Make sure you provide larger than the default buffer count using the -b option. Especially with gigabit the default buffers will get you very poor performance. They really need to up this default to something reasonable as it may be turning people off who get poor performance. I invoke vblade like so: /usr/sbin/vbladed -b 16384 1 1 bond0 /dev/diskb/e1.1 # admin1 This forks off a vblade process with 16384 buffers, shelf 1 slot 1 (AoE equivalent of a LUN) listening on interface bond0 and exporting the logical volume /dev/diskb/e1.1. I currently name my LVs for their AoE device name which will be /dev/etherd/e1.1 on the initiator but I may change that if I go with cluster lvm and then maybe I can just name the lv for the virtual machine it goes with. The AoE kernel module that comes with the kernel is horribly out of date and totally unusable. Be sure to upgrade the kernel module to the latest. > Performance and manageability of the Coraid hardware — and, ideally, how well > it plays in a mixed software and dedicated environment with the software > targets. I have only used the Coraid hardware once a little over 3 years ago. It has changed a lot since then so I can't comment much on it. Since Coraid builds the hardware and distributes vblade I would bet things work well in a mixed hardware/software AoE target environment. Coraid should be able to give you guidance on this. -- Tracy Reed http://tracyreed.org
pgpL15Xc9DQB1.pgp
Description: PGP signature
_______________________________________________ Tech mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
