XFS alignment

Ciro Iriarte Mon, 28 Jan 2008 18:13:09 -0800

2008/1/28, Greg Freemyer <[EMAIL PROTECTED]>:
> On Jan 28, 2008 6:41 PM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
> >
> > 2008/1/28, Greg Freemyer <[EMAIL PROTECTED]>:
> > > On Jan 28, 2008 3:51 PM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
> > > > 2008/1/28, Greg Freemyer <[EMAIL PROTECTED]>:
> > > >
> > > > > On Jan 28, 2008 11:25 AM, Ciro Iriarte <[EMAIL PROTECTED]> wrote:
> > > > > > Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm
> > > > > > getting 20mb/s with dd and I think it can be improved. I'll add 
> > > > > > config
> > > > > > parameters as soon as i get home. I'm using md raid5 on a 
> > > > > > motherboard
> > > > > > with nvidia sata controller, 4x500gb samsung sata2 disks and lvm 
> > > > > > with
> > > > > > OpenSUSE [EMAIL PROTECTED]
> > > > > >
> > > > > > Regards,
> > > > > > Ciro
> > > > > > --
> > > > >
> > > > > I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad,
> > > > > but not outrageous I suppose.  I can get about 4-5GB/min from new sata
> > > > > drives.  So about 75 MB/sec from a single raw drive (ie. dd
> > > > > if=/dev/zero of=/dev/sdb bs=4k)
> > > > >
> > > > > You don't say how your invoking dd.  The default bs is only 512 bytes
> > > > > I think and that is totally inefficient with the linux kernel.
> > > > >
> > > > > I typically use 4k which maps to what the kernel uses.  ie. dd
> > > > > if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but
> > > > > meaningful test..
> > > > >
> > > > > I think the default stride is 64k per drive, so if your writing 3x 64K
> > > > > at a time, you may get perfect alignment and miss the overhead of
> > > > > having to recalculate the checksum all the time.
> > > > >
> > > > > As another data point, I would bump that up to 30x 64K and see if you
> > > > > continue to get speed improvements.
> > > > >
> > > > > So tell us the write speed for
> > > > > bs=512
> > > > > bs=4k
> > > > > bs=192k
> > > > > bs=1920k
> > > > >
> > > > > And the read speeds for the same.  ie.  dd if=big-file of=/dev/null 
> > > > > bs=4k, etc.
> > > > >
> > > > > I would expect the write speed to go up with each increase in bs, but
> > > > > the read speed to be more or less constant.  Then you need to figure
> > > > > out what sort of real world block sizes your going to be using.  Once
> > > > > you have a bs, or collection of bs sizes that match your needs, then
> > > > > you can start tuning your stack.
> > > > >
> > > > > Greg
> > > >
> > > > Hi, posted the first mail from my cell phone, so couldn't add more 
> > > > info....
> > > >
> > > > - I created the raid with chunk size= 256k.
> > > >
> > > > mainwks:~ # mdadm --misc --detail /dev/md2
> > > > /dev/md2:
> > > >         Version : 01.00.03
> > > >   Creation Time : Sun Jan 27 20:08:48 2008
> > > >      Raid Level : raid5
> > > >      Array Size : 1465151232 (1397.28 GiB 1500.31 GB)
> > > >   Used Dev Size : 976767488 (465.76 GiB 500.10 GB)
> > > >    Raid Devices : 4
> > > >   Total Devices : 4
> > > > Preferred Minor : 2
> > > >     Persistence : Superblock is persistent
> > > >
> > > >   Intent Bitmap : Internal
> > > >
> > > >     Update Time : Mon Jan 28 17:42:51 2008
> > > >           State : active
> > > >  Active Devices : 4
> > > > Working Devices : 4
> > > >  Failed Devices : 0
> > > >   Spare Devices : 0
> > > >
> > > >          Layout : left-symmetric
> > > >      Chunk Size : 256K
> > > >
> > > >            Name : 2
> > > >            UUID : 65cb16de:d89af60e:6cac47da:88828cfe
> > > >          Events : 12
> > > >
> > > >     Number   Major   Minor   RaidDevice State
> > > >        0       8       33        0      active sync   /dev/sdc1
> > > >        1       8       49        1      active sync   /dev/sdd1
> > > >        2       8       65        2      active sync   /dev/sde1
> > > >        4       8       81        3      active sync   /dev/sdf1
> > > >
> > > > - Speed reported by hdparm:
> > > >
> > > > mainwks:~ # hdparm -tT /dev/sdc
> > > >
> > > > /dev/sdc:
> > > >  Timing cached reads:   1754 MB in  2.00 seconds = 877.60 MB/sec
> > > >  Timing buffered disk reads:  226 MB in  3.02 seconds =  74.76 MB/sec
> > > > mainwks:~ # hdparm -tT /dev/md2
> > > >
> > > > /dev/md2:
> > > >  Timing cached reads:   1250 MB in  2.00 seconds = 624.82 MB/sec
> > > >  Timing buffered disk reads:  620 MB in  3.01 seconds = 206.09 MB/sec
> > > >
> > > > - LVM:
> > > >
> > > > mainwks:~ # vgdisplay data
> > > >   Incorrect metadata area header checksum
> > > >   --- Volume group ---
> > > >   VG Name               data
> > > >   System ID
> > > >   Format                lvm2
> > > >   Metadata Areas        1
> > > >   Metadata Sequence No  5
> > > >   VG Access             read/write
> > > >   VG Status             resizable
> > > >   MAX LV                0
> > > >   Cur LV                2
> > > >   Open LV               2
> > > >   Max PV                0
> > > >   Cur PV                1
> > > >   Act PV                1
> > > >   VG Size               1.36 TB
> > > >   PE Size               4.00 MB
> > > >   Total PE              357702
> > > >   Alloc PE / Size       51200 / 200.00 GB
> > > >   Free  PE / Size       306502 / 1.17 TB
> > > >   VG UUID               KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI
> > > >
> > > > mainwks:~ # pvdisplay /dev/sdc1
> > > >   Incorrect metadata area header checksum
> > > >   --- Physical volume ---
> > > >   PV Name               /dev/md2
> > > >   VG Name               data
> > > >   PV Size               1.36 TB / not usable 3.75 MB
> > > >   Allocatable           yes
> > > >   PE Size (KByte)       4096
> > > >   Total PE              357702
> > > >   Free PE               306502
> > > >   Allocated PE          51200
> > > >   PV UUID               Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga
> > > >
> > > > - XFS:
> > > >
> > > > mainwks:~ # xfs_info /dev/data/test
> > > > meta-data=/dev/mapper/data-test  isize=256    agcount=16, 
> > > > agsize=1638400 blks
> > > >          =                       sectsz=512   attr=0
> > > > data     =                       bsize=4096   blocks=26214400, 
> > > > imaxpct=25
> > > >          =                       sunit=16     swidth=48 blks, 
> > > > unwritten=1
> > > > naming   =version 2              bsize=4096
> > > > log      =internal               bsize=4096   blocks=16384, version=1
> > > >          =                       sectsz=512   sunit=0 blks, lazy-count=0
> > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > >
> > > > - The reported dd
> > > > mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3
> > > > 100+0 records in
> > > > 100+0 records out
> > > > 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s
> > > >
> > > >
> > > > - New dd (seems to give better result)
> > > > mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0
> > > > 1000+0 records in
> > > > 1000+0 records out
> > > > 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s
> > > >
> > > > Ciro
> > > >
> > >
> > > Not sure I followed why the old and new dd were so different.  I do
> > > see the old one only had 5 seconds worth of data.  Not much data to
> > > base a test run on.
> > >
> > > IF you really have 1MB avg. write sizes, you should read
> > > http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning
> > > sample
> > >
> > > Basically that post recommends:
> > >
> > > chuck size = 256KB
> > > LVM align  = 3x Chunk Size = 768KB  (assumes a 4-disk raid5)
> > >
> > > And tune the XFS  bsize/sunit/swidth  to match.
> > >
> > > But that all _assumes_ a large data write size.  If you have a more
> > > typical desktop load, then the average write is way below that and you
> > > need to really reduce all of the above (except bsize.  I think 4K
> > > bsize is always best with Linux, but I'm not positive about that.).
> > >
> > > Also, dd is only able to simulate a sequential data stream.  If you
> > > don't have that kind of load, once again you need to reduce the chunk
> > > size.  I think the generically preferred chunk size is 64KB,  With
> > > some database apps, that can drop down to 4KB.
> > >
> > > So really and truly, you need to characterize your workload before you
> > > start tuning.
> > >
> > > OTOH, if you just want bragging rights, test with and tune for a big
> > > average write, but be warned your typical performance will be going
> > > down at the same time that your large write performance is going up.
> > >
> > > Greg
> > > --
> > > Greg Freemyer
> > > Litigation Triage Solutions Specialist
> > > http://www.linkedin.com/in/gregfreemyer
> > > First 99 Days Litigation White Paper -
> > > http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
> > >
> > > The Norcross Group
> > > The Intersection of Evidence & Technology
> > > http://www.norcrossgroup.com
> > > --
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > Hi, i found that thread too, the problem is i'm not sure how to tune
> > the lvm alignment, maybe --stripes & --stripesize at LV creation
> > time?, can't find an option for pvcreate or vgcreate. It will be
> > basically a repository for media files, movies, backups, iso images,
> > etc... For the rest (documents, ebooks and music) I'll create other
> > LVs with Ext3.
> >
> > Regards,
> > Ciro
>
> Ok, I guess you know reads are not significantly impacted by the
> tuning were talking about.  This is mostly about tuning for raid5
> write performance.


Yep, i know...

>
> Anyway, are you planning to stripe together multiple md5 arrays via
> LVM?  I believe that is what --stripes and --stripesize are for.  (ie.
> If you have 8 drives, you could create 2 raid5 arrays, and use LVM to
> interleave them by using --stripes = 2.)  I've never used that
> feature.
>
No, i don't plan to use something like that

> You need to worry about the vg extents.  I think vgcreate
> --physicalextentsize is what you need to tune.  I would make each
> extent an even number of stripes in size.  ie. 768KB * N.  Maybe use
> N=10, so -s 7680K
>
Well, i'm not sure about the PE parameter, it doesn't affect every
write operation as far as i know, using a large number just helps the
allocation process (LV creation/grow) and a little number helps with
allocation granularity (slower creation/grow of LV)

> Assuming your not using lvm strips and since this appears to be a new
> setup, I would also use -C or --contiguous to ensure all the data is
> sequential.  It maybe overkill, but it will further ensure you _avoid_
> LV extents that don't end on a stripe boundary.  (a stripe == 3 raid5
> chunks for you).

Taking note...

>
> Then if you are going to use the snapshot feature, you need to set
> your chunksize efficiently.  If you only are going to have large
> files, then I would use a large LVM snapshot chunksize.  256KB seems
> like a good choice, but I have not benchmarked snapshot chunksizes.

Read about that, but probably wont use snapshots with this VG

>
> Greg
> --

Thanks,
Ciro
-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] Raid5/LVM2/XFS alignment

Reply via email to