Hi!

> >> What filesystem do you use? XFS is known to be the recommended
> >> filesystem for AoE.
> > Actually I think this could be due to RAID block sizes: most AoE
> > implementations assume a block size of 512Byte. If you're using a linux
> > software RAID5 with a default chunk size of 512K and you're using 4 disks,
> > a single "block" has 3*512K block size. This is what has to be written when
> > changing data in a file for example.
> > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
> > width and so on (see man pages) when the information is available (which is
> > not the case when creating a file system on an AoE device.
> 
> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
> single block device.
> The RAID itself is a RAID6 configuration, using default settings.
> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
see them.

> It seems I get the same filesystem settings if I create the filesystem
> right on the LVM volume,
> or if I create it on the AoE volume.
Hmmm... that means that the controller does not expose its chunk size to
the operating system. The most important parameters here are:
* stride = number of blocks on one raid disk (aka chunk-size/block-size)
* stripe-width = number of strides of one data block in the raid

Could you try to create the file system with "-E 
stride=16,stripe-width=16*(N-2)"
where N is the number of disks in the array. There are plenty of sites out
there about finding good parameters for mkfs and RAID (like
http://www.altechnative.net/?p=96 or
http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). 

> I have iostat running continually, and I have seen that "massive read"
> problem earlier.
The "problem" with AoE (or whatever intermediate network protocol iscsi,
fcoe, ... you will use) is, that it needs to force writes to happen. The
Linux kernel tries to assume the physical layout of the underlaying disk by
at least using the file system layout on disk and tries to write one
"physical block" at a time. (blockdev --report /dev/sdX reports what the
kernel thinks how the physical layout looks like)
Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is
64K that means one "physical block" has a size of 4*64K = 256K. The file
systems you created had a block size of 4K -- so in case AoE forces the
kernel to commit every 4K, the RAID-Controller needs to read 256K, update
4K, calculate checksums and write 256K again. This is what is behind the
"massive read" issue.

Write rate should improve by creating the file system with correct stride
size and stripe width. But there are other factors for this as well:
* You're using lvm (which is an excellent tool). You need to create your
  physical volumes with parameters that fit your RAID too. That is use
  "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is
  that it exports "physical extents" which need to be alligned to the
  beginning of your RAID's boundaries. For testing purposes you might start
  without LVM and try to align and export the filesystem via AoE first.
  That way you get better reference numbers for further experiments.)
* For real world scenarios it might be a better idea to recreate the RAID
  with a smaller chunk size. This -- of course -- depends on what kind of
  files you intend to store on that RAID. You should try to fit an average
  file in more than just one "physical block"...

> However, when I'm doing these tests, I have a bare minimum of reads,
> it's mostly all writes.
As mentioned above: this is due to the controller "hiding" real disk
operation away.

Hope, this helps... and please send back results!

-- Adi

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Reply via email to