2011/7/6 Adi Kriegisch <a...@cg.tuwien.ac.at>:
> Hi!
>
>> >> What filesystem do you use? XFS is known to be the recommended
>> >> filesystem for AoE.
>> > Actually I think this could be due to RAID block sizes: most AoE
>> > implementations assume a block size of 512Byte. If you're using a linux
>> > software RAID5 with a default chunk size of 512K and you're using 4 disks,
>> > a single "block" has 3*512K block size. This is what has to be written when
>> > changing data in a file for example.
>> > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
>> > width and so on (see man pages) when the information is available (which is
>> > not the case when creating a file system on an AoE device.
>>
>> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
>> single block device.
>> The RAID itself is a RAID6 configuration, using default settings.
>> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
> Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
> see them.
>

I'm too happy about this either.
My intention in the start was to get the RAID controller to just
expose the disks,
and let Linux handle the RAID side of things.
However, I was unsuccessful in convincing the RAID controller to do so.

>> It seems I get the same filesystem settings if I create the filesystem
>> right on the LVM volume,
>> or if I create it on the AoE volume.
> Hmmm... that means that the controller does not expose its chunk size to
> the operating system. The most important parameters here are:
> * stride = number of blocks on one raid disk (aka chunk-size/block-size)
> * stripe-width = number of strides of one data block in the raid
>
> Could you try to create the file system with "-E 
> stride=16,stripe-width=16*(N-2)"
> where N is the number of disks in the array. There are plenty of sites out
> there about finding good parameters for mkfs and RAID (like
> http://www.altechnative.net/?p=96 or
> http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
>

The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.

I created the filesystem as you suggested, the resulting output from mkfs was:
root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=48 blocks
1310720 inodes, 5242880 blocks

I then mounted the newly created filesystem on the server, and have it
a run with bonnie.
Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with
the default settings.

When I exported it using AoE, the throughput on the client was ~60
MB/s, down from ~70 MB/s.

So these particular settings for the filesystem doesn't seem to be
right on the money,
but I guess it's a matter of tuning these settings.
I didn't see a massive increase in read operations with these
settings, but I guess
there was a bit more read activity going on.

>> I have iostat running continually, and I have seen that "massive read"
>> problem earlier.
> The "problem" with AoE (or whatever intermediate network protocol iscsi,
> fcoe, ... you will use) is, that it needs to force writes to happen. The
> Linux kernel tries to assume the physical layout of the underlaying disk by
> at least using the file system layout on disk and tries to write one
> "physical block" at a time. (blockdev --report /dev/sdX reports what the
> kernel thinks how the physical layout looks like)
> Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is
> 64K that means one "physical block" has a size of 4*64K = 256K. The file
> systems you created had a block size of 4K -- so in case AoE forces the
> kernel to commit every 4K, the RAID-Controller needs to read 256K, update
> 4K, calculate checksums and write 256K again. This is what is behind the
> "massive read" issue.
>
> Write rate should improve by creating the file system with correct stride
> size and stripe width. But there are other factors for this as well:
> * You're using lvm (which is an excellent tool). You need to create your
>  physical volumes with parameters that fit your RAID too. That is use
>  "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is
>  that it exports "physical extents" which need to be alligned to the
>  beginning of your RAID's boundaries. For testing purposes you might start
>  without LVM and try to align and export the filesystem via AoE first.
>  That way you get better reference numbers for further experiments.)
> * For real world scenarios it might be a better idea to recreate the RAID
>  with a smaller chunk size. This -- of course -- depends on what kind of
>  files you intend to store on that RAID. You should try to fit an average
>  file in more than just one "physical block"...
>

I haven't investigated this level of detail in storage before, so this
is the first time
I'm tuning a system like this for production.
I'll read up and try to see if I can't get all these settings to align.

>> However, when I'm doing these tests, I have a bare minimum of reads,
>> it's mostly all writes.
> As mentioned above: this is due to the controller "hiding" real disk
> operation away.
>
> Hope, this helps... and please send back results!
>
> -- Adi
>

Thanks, I appreciate the help from you and all the others
who have been very helpful here on aoetools-discuss.

What I'm not quite understanding is how exporting a device via AoE
would introduce new alignment problems or similar.
When I can write to the local filesystem at ~370 MB/s, what kind of
problem is introduced by using AoE or other network storage solution ?

I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
exact same ~70 MB/s throughput there, so I guess this isn't related to
AoE in itself.

-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Reply via email to