2011/7/6 Adi Kriegisch <a...@cg.tuwien.ac.at>:
> Hi!
>
>> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
>> >> single block device.
>> >> The RAID itself is a RAID6 configuration, using default settings.
>> >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
>> > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
>> > see them.
>> >
>>
>> I'm too happy about this either.
>> My intention in the start was to get the RAID controller to just
>> expose the disks,
>> and let Linux handle the RAID side of things.
>> However, I was unsuccessful in convincing the RAID controller to do so.
> Too bad... I'd prefer a Linux software RAID too...
> btw. there are hw-raid management tools available for linux. You probably
> want to check out http://hwraid.le-vert.net/wiki.
>

Unfortunately, there doesn't seem be any free or open tool available
for the line
of cards I'm using.
http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

>> > Could you try to create the file system with "-E 
>> > stride=16,stripe-width=16*(N-2)"
>> > where N is the number of disks in the array. There are plenty of sites out
>> > there about finding good parameters for mkfs and RAID (like
>> > http://www.altechnative.net/?p=96 or
>> > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
>> >
>>
>> The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.
> correct.
>
>> I created the filesystem as you suggested, the resulting output from mkfs 
>> was:
>> root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 
>> /dev/aoepool0/aoetest
> [SNIP]
>> I then mounted the newly created filesystem on the server, and have it
>> a run with bonnie.
>> Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s 
>> with
>> the default settings.
>>
>> When I exported it using AoE, the throughput on the client was ~60
>> MB/s, down from ~70 MB/s.
> The values you used are correct for 3 data disks with 64K chunk size.
> Probably this issue is related to a misalignment of LVM. LVM adds a header
> which has a default size of 192K -- that would perfectly match your
> RAID: 3*64K = 192K...
> but the default "physical extent" size does not match your RAID: 4MB cannot
> be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks
> aren't propperly aligned -- I doubt you can align them, because the
> physical extent size needs to be a power of two and > 1K and to be aligned
> with the RAID divideable by 192... The only way could be to change the
> number of disks in the array to 4 or 6. :-(
> Could you just once try to use the raw device with the above used stride
> and stripe-width values? (without LVM inbetween)
>

I've reinstalled the server, so that I can easily try different configurations
on the RAID controller.
However, none of the settings I have tried goes any faster than 70 MB/s.
I've tried adjusting the stripe size and create filesystems accordingly,
but I haven't seen any improvements in throughput.

In my latest test, the RAID volume is just a simple 2 disk stripe.
This volume is then exported directly with AoE, no LVM or mdadm.
With this test I hoped to eliminate any problem related to having
the RAID controller generate parity for unaligned writes.
However, I'm still seeing writes of ~70 MB/s.

I also tested the network with iperf, and iperf said it could copy at
~960 Mbit/s, as expected.

>> Thanks, I appreciate the help from you and all the others
>> who have been very helpful here on aoetools-discuss.
> You're welcome! And thank you very much for always reporting back the
> results.
>
>> What I'm not quite understanding is how exporting a device via AoE
>> would introduce new alignment problems or similar.
>> When I can write to the local filesystem at ~370 MB/s, what kind of
>> problem is introduced by using AoE or other network storage solution ?
>>
>> I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
>> exact same ~70 MB/s throughput there, so I guess this isn't related to
>> AoE in itself.
> There are two root causes for these issues:
> * SAN protocols force a "commit" of unwritten data, be it a "sync", direct
>  i/o or whatever, way more often than local disks -- for the sake of data
>  integrity. (actually write barriers should be enabled for all those AoE
>  devices -- especially with newer kernels.)

I guess this is different from doing everything with "sync" enabled, though ?
If I mount the filesystem with the "sync" option, I get a different throughput.

> * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
>  into a jumbo frame). So all I/O is aligned around this size. When using a
>  filesystem like ext4 or xfs one can influence the block sizes by creating
>  the file system properly.
>
> And now for some ascii art:
> lets say a simple hard disk has the following physical blocks:
> +----+----+----+----+----+----+----+----+----+----+-..-+
> | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | .. |
> +----+----+----+----+----+----+----+----+----+----+-..-+
>
> then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks
> looks like this (D1 1-2 means disk1 block 1 and 2):
> +----+----+----+----+----+----+----+----+----+----+-..-+
> | D1 1-2  | D2 1-2  | D3 1-2  | D4 1-2  | D1 3-4  | .. |
> +----+----+----+----+----+----+----+----+----+----+-..-+
> \------------ DATA -----------/\-PARITY-/
> \                                      / \
>  ----------- RAID block 1 -------------   --------- ..
>
> One data block of this RAID can only be written at once. So whenever only
> one bit within that block changes, the whole block has to written again
> (because the checksum is only valid for the block as a whole).
>
> Now imagine, you write you have a lvm header that has half of the size of a
> RAID block: it will fill the first half of the block and the first lvm
> volume will then fill the rest of the first block plus some more blocks and
> a half at the end. Write operations are not alligned then and cause massive
> rewrites in the backend.
>
> From my point of view there are several ways to find the root cause of the
> issues:
> * try a different RAID level (like 10 or so)
> * (re)-try to export the disks to Linux as JBODs.
> * try different filesystem and lvm parameters (actually you better write a
>  script for that... ;-)
>
> And, let us know about the results!
> Thanks,
>        Adi
>

Thank you for that very thorough explanation, I've just learned a lot
about I/O and alignment.

As I mentioned, I have tried different configurations, trying to avoid
any source of alignment issues.
My last attempt has no parity in the RAID setup, the virtual device
from the controller is partitioned
and exported via AoE.

With this setup, I get the same ~70 MB/s I have been fighting with for
a while now.
It seems curious to me that I get ~70 MB/s seemingly no matter what changes
I do to the the configuration, so I'm beginning to suspect my testing
method is broken.

-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Reply via email to