XFS alignment

Greg Freemyer Mon, 28 Jan 2008 16:23:33 -0800

On Jan 28, 2008 6:41 PM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
>
> 2008/1/28, Greg Freemyer <[EMAIL PROTECTED]>:
> > On Jan 28, 2008 3:51 PM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
> > > 2008/1/28, Greg Freemyer <[EMAIL PROTECTED]>:
> > >
> > > > On Jan 28, 2008 11:25 AM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
> > > > > Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm
> > > > > getting 20mb/s with dd and I think it can be improved. I'll add config
> > > > > parameters as soon as i get home. I'm using md raid5 on a motherboard
> > > > > with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with
> > > > > OpenSUSE [EMAIL PROTECTED]
> > > > >
> > > > > Regards,
> > > > > Ciro
> > > > > --
> > > >
> > > > I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad,
> > > > but not outrageous I suppose.  I can get about 4-5GB/min from new sata
> > > > drives.  So about 75 MB/sec from a single raw drive (ie. dd
> > > > if=/dev/zero of=/dev/sdb bs=4k)
> > > >
> > > > You don't say how your invoking dd.  The default bs is only 512 bytes
> > > > I think and that is totally inefficient with the linux kernel.
> > > >
> > > > I typically use 4k which maps to what the kernel uses.  ie. dd
> > > > if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but
> > > > meaningful test..
> > > >
> > > > I think the default stride is 64k per drive, so if your writing 3x 64K
> > > > at a time, you may get perfect alignment and miss the overhead of
> > > > having to recalculate the checksum all the time.
> > > >
> > > > As another data point, I would bump that up to 30x 64K and see if you
> > > > continue to get speed improvements.
> > > >
> > > > So tell us the write speed for
> > > > bs=512
> > > > bs=4k
> > > > bs=192k
> > > > bs=1920k
> > > >
> > > > And the read speeds for the same.  ie.  dd if=big-file of=/dev/null 
> > > > bs=4k, etc.
> > > >
> > > > I would expect the write speed to go up with each increase in bs, but
> > > > the read speed to be more or less constant.  Then you need to figure
> > > > out what sort of real world block sizes your going to be using.  Once
> > > > you have a bs, or collection of bs sizes that match your needs, then
> > > > you can start tuning your stack.
> > > >
> > > > Greg
> > >
> > > Hi, posted the first mail from my cell phone, so couldn't add more 
> > > info....
> > >
> > > - I created the raid with chunk size= 256k.
> > >
> > > mainwks:~ # mdadm --misc --detail /dev/md2
> > > /dev/md2:
> > >         Version : 01.00.03
> > >   Creation Time : Sun Jan 27 20:08:48 2008
> > >      Raid Level : raid5
> > >      Array Size : 1465151232 (1397.28 GiB 1500.31 GB)
> > >   Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
> > >    Raid Devices : 4
> > >   Total Devices : 4
> > > Preferred Minor : 2
> > >     Persistence : Superblock is persistent
> > >
> > >   Intent Bitmap : Internal
> > >
> > >     Update Time : Mon Jan 28 17:42:51 2008
> > >           State : active
> > >  Active Devices : 4
> > > Working Devices : 4
> > >  Failed Devices : 0
> > >   Spare Devices : 0
> > >
> > >          Layout : left-symmetric
> > >      Chunk Size : 256K
> > >
> > >            Name : 2
> > >            UUID : 65cb16de:d89af60e:6cac47da:88828cfe
> > >          Events : 12
> > >
> > >     Number   Major   Minor   RaidDevice State
> > >        0       8       33        0      active sync   /dev/sdc1
> > >        1       8       49        1      active sync   /dev/sdd1
> > >        2       8       65        2      active sync   /dev/sde1
> > >        4       8       81        3      active sync   /dev/sdf1
> > >
> > > - Speed reported by hdparm:
> > >
> > > mainwks:~ # hdparm -tT /dev/sdc
> > >
> > > /dev/sdc:
> > >  Timing cached reads:   1754 MB in  2.00 seconds = 877.60 MB/sec
> > >  Timing buffered disk reads:  226 MB in  3.02 seconds =  74.76 MB/sec
> > > mainwks:~ # hdparm -tT /dev/md2
> > >
> > > /dev/md2:
> > >  Timing cached reads:   1250 MB in  2.00 seconds = 624.82 MB/sec
> > >  Timing buffered disk reads:  620 MB in  3.01 seconds = 206.09 MB/sec
> > >
> > > - LVM:
> > >
> > > mainwks:~ # vgdisplay data
> > >   Incorrect metadata area header checksum
> > >   --- Volume group ---
> > >   VG Name               data
> > >   System ID
> > >   Format                lvm2
> > >   Metadata Areas        1
> > >   Metadata Sequence No  5
> > >   VG Access             read/write
> > >   VG Status             resizable
> > >   MAX LV                0
> > >   Cur LV                2
> > >   Open LV               2
> > >   Max PV                0
> > >   Cur PV                1
> > >   Act PV                1
> > >   VG Size               1.36 TB
> > >   PE Size               4.00 MB
> > >   Total PE              357702
> > >   Alloc PE / Size       51200 / 200.00 GB
> > >   Free  PE / Size       306502 / 1.17 TB
> > >   VG UUID               KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI
> > >
> > > mainwks:~ # pvdisplay /dev/sdc1
> > >   Incorrect metadata area header checksum
> > >   --- Physical volume ---
> > >   PV Name               /dev/md2
> > >   VG Name               data
> > >   PV Size               1.36 TB / not usable 3.75 MB
> > >   Allocatable           yes
> > >   PE Size (KByte)       4096
> > >   Total PE              357702
> > >   Free PE               306502
> > >   Allocated PE          51200
> > >   PV UUID               Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga
> > >
> > > - XFS:
> > >
> > > mainwks:~ # xfs_info /dev/data/test
> > > meta-data=/dev/mapper/data-test  isize=256    agcount=16, agsize=1638400 
> > > blks
> > >          =                       sectsz=512   attr=0
> > > data     =                       bsize=4096   blocks=26214400, imaxpct=25
> > >          =                       sunit=16     swidth=48 blks, unwritten=1
> > > naming   =version 2              bsize=4096
> > > log      =internal               bsize=4096   blocks=16384, version=1
> > >          =                       sectsz=512   sunit=0 blks, lazy-count=0
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > >
> > > - The reported dd
> > > mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3
> > > 100+0 records in
> > > 100+0 records out
> > > 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s
> > >
> > >
> > > - New dd (seems to give better result)
> > > mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0
> > > 1000+0 records in
> > > 1000+0 records out
> > > 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s
> > >
> > > Ciro
> > >
> >
> > Not sure I followed why the old and new dd were so different.  I do
> > see the old one only had 5 seconds worth of data.  Not much data to
> > base a test run on.
> >
> > IF you really have 1MB avg. write sizes, you should read
> > http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning
> > sample
> >
> > Basically that post recommends:
> >
> > chuck size = 256KB
> > LVM align  = 3x Chunk Size = 768KB  (assumes a 4-disk raid5)
> >
> > And tune the XFS  bsize/sunit/swidth  to match.
> >
> > But that all _assumes_ a large data write size.  If you have a more
> > typical desktop load, then the average write is way below that and you
> > need to really reduce all of the above (except bsize.  I think 4K
> > bsize is always best with Linux, but I'm not positive about that.).
> >
> > Also, dd is only able to simulate a sequential data stream.  If you
> > don't have that kind of load, once again you need to reduce the chunk
> > size.  I think the generically preferred chunk size is 64KB,  With
> > some database apps, that can drop down to 4KB.
> >
> > So really and truly, you need to characterize your workload before you
> > start tuning.
> >
> > OTOH, if you just want bragging rights, test with and tune for a big
> > average write, but be warned your typical performance will be going
> > down at the same time that your large write performance is going up.
> >
> > Greg
> > --
> > Greg Freemyer
> > Litigation Triage Solutions Specialist
> > http://www.linkedin.com/in/gregfreemyer
> > First 99 Days Litigation White Paper -
> > http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
> >
> > The Norcross Group
> > The Intersection of Evidence & Technology
> > http://www.norcrossgroup.com
> > --
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> Hi, i found that thread too, the problem is i'm not sure how to tune
> the lvm alignment, maybe --stripes & --stripesize at LV creation
> time?, can't find an option for pvcreate or vgcreate. It will be
> basically a repository for media files, movies, backups, iso images,
> etc... For the rest (documents, ebooks and music) I'll create other
> LVs with Ext3.
>
> Regards,
> Ciro


Ok, I guess you know reads are not significantly impacted by the
tuning were talking about.  This is mostly about tuning for raid5
write performance.

Anyway, are you planning to stripe together multiple md5 arrays via
LVM?  I believe that is what --stripes and --stripesize are for.  (ie.
If you have 8 drives, you could create 2 raid5 arrays, and use LVM to
interleave them by using --stripes = 2.)  I've never used that
feature.

You need to worry about the vg extents.  I think vgcreate
--physicalextentsize is what you need to tune.  I would make each
extent an even number of stripes in size.  ie. 768KB * N.  Maybe use
N=10, so -s 7680K

Assuming your not using lvm strips and since this appears to be a new
setup, I would also use -C or --contiguous to ensure all the data is
sequential.  It maybe overkill, but it will further ensure you _avoid_
LV extents that don't end on a stripe boundary.  (a stripe == 3 raid5
chunks for you).

Then if you are going to use the snapshot feature, you need to set
your chunksize efficiently.  If you only are going to have large
files, then I would use a large LVM snapshot chunksize.  256KB seems
like a good choice, but I have not benchmarked snapshot chunksizes.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] Raid5/LVM2/XFS alignment

Reply via email to