Disk IO issues

2008-12-31 Thread Mike McGrath
Lets pool some knowledge together because at this point, I'm missing
something.

I've been doing all measurements with sar as bonnie, etc, causes builds to
timeout.

Problem: We're seeing slower then normal disk IO.  At least I think we
are.  This is a PERC5/E and MD1000 array.

When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
around 4-6MBytes/s

When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
30-40MBytes/s.

Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.

If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.

The above tests are pretty consistent.  /dev/sde is a raid5 array,
hardware raid.

So my question here is, wtf?  I've been working to do a backup which I
would think would either cause network utilization to max out, or disk io
to max out.  I'm not seeing either.  Sar says the disks are 100% utilized
but I can cause major increases in actual disk reads and writes by just
running additional commands.  Also, if the disks were 100% utilized I'd
expect we would see lots more iowait.  We're not though, iowait on the box
is only %0.06 today.

So, long story short, we're seeing much better performance when just
reading or writing lots of data (though dd is many times slower then cat).
But with our real-world traffic, we're just seeing crappy crappy IO.

Thoughts, theories or opinions?  Some of the sysadmin noc guys have access
to run diagnostic commands, if you want more info about a setting, let me
know.

I should also mention there's lots going on with this box, for example its
hardware raid, lvm and I've got xen running on it (though the tests above
were not in a xen guest).

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Xavier Lamien
On Wed, Dec 31, 2008 at 9:42 PM, Mike McGrath  wrote:
> Lets pool some knowledge together because at this point, I'm missing
> something.
>
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
>
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
>
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s
>
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
>
> Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.
>
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.
>
> The above tests are pretty consistent.  /dev/sde is a raid5 array,
> hardware raid.
>
> So my question here is, wtf?  I've been working to do a backup which I
> would think would either cause network utilization to max out, or disk io
> to max out.  I'm not seeing either.  Sar says the disks are 100% utilized
> but I can cause major increases in actual disk reads and writes by just
> running additional commands.  Also, if the disks were 100% utilized I'd
> expect we would see lots more iowait.  We're not though, iowait on the box
> is only %0.06 today.
>
> So, long story short, we're seeing much better performance when just
> reading or writing lots of data (though dd is many times slower then cat).
> But with our real-world traffic, we're just seeing crappy crappy IO.
>
> Thoughts, theories or opinions?  Some of the sysadmin noc guys have access
> to run diagnostic commands, if you want more info about a setting, let me
> know.
>
> I should also mention there's lots going on with this box, for example its
> hardware raid, lvm and I've got xen running on it (though the tests above
> were not in a xen guest).
>

Could you perform an hdparm -tT on that disk ?
Also, output an strace against your cat & dd commands.

if my memory is good enough, cat use mmap() which is faster than
read() (which is used by dd)

-- 
Xavier.t Lamien
--

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Sascha Thomas Spreitzer
Hello Mike,

maybe the RAID mirror failed and is auto healing in background. Is
there a way to determine the RAID state?
Is the RAID controller showing any errors? Physical disks reporting seek errors?
I assume a problem with either hardware buffers of disk or RAID
controller or faulty disk or RAID hardware.

regards,
Sascha

2008/12/31 Mike McGrath :
> Lets pool some knowledge together because at this point, I'm missing
> something.
>
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
>
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
>
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s
>
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
>
> Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.
>
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.
>
> The above tests are pretty consistent.  /dev/sde is a raid5 array,
> hardware raid.
>
> So my question here is, wtf?  I've been working to do a backup which I
> would think would either cause network utilization to max out, or disk io
> to max out.  I'm not seeing either.  Sar says the disks are 100% utilized
> but I can cause major increases in actual disk reads and writes by just
> running additional commands.  Also, if the disks were 100% utilized I'd
> expect we would see lots more iowait.  We're not though, iowait on the box
> is only %0.06 today.
>
> So, long story short, we're seeing much better performance when just
> reading or writing lots of data (though dd is many times slower then cat).
> But with our real-world traffic, we're just seeing crappy crappy IO.
>
> Thoughts, theories or opinions?  Some of the sysadmin noc guys have access
> to run diagnostic commands, if you want more info about a setting, let me
> know.
>
> I should also mention there's lots going on with this box, for example its
> hardware raid, lvm and I've got xen running on it (though the tests above
> were not in a xen guest).
>
>-Mike
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>



-- 
Mit freundlichen Grüßen, / with kind regards,
Sascha Thomas Spreitzer
http://spreitzer.name/

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Ricky Zhou
On 2008-12-31 10:49:56 PM, Xavier Lamien wrote:
> Could you perform an hdparm -tT on that disk ?
/dev/sde:
 Timing cached reads:   2668 MB in  2.00 seconds = 1336.06 MB/sec
 Timing buffered disk reads:  1024 MB in  3.01 seconds = 340.69 MB/sec

> Also, output an strace against your cat & dd commands.
> 
> if my memory is good enough, cat use mmap() which is faster than
> read() (which is used by dd)
I just straced dd and cat, and it looks like cat is using a block size
of 4096 bytes while dd is using 512 bytes.  I *think* they were both
just using read().  Perhaps dd if=/dev/sde of=/dev/null bs=4096 would be
a better command to compare against.

Thanks,
Ricky



pgpTsY6RpUIsH.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Corey Chandler

Mike McGrath wrote:

Lets pool some knowledge together because at this point, I'm missing
something.

I've been doing all measurements with sar as bonnie, etc, causes builds to
timeout.

Problem: We're seeing slower then normal disk IO.  At least I think we
are.  This is a PERC5/E and MD1000 array.
  


1. Are we sure the array hasn't lost a drive?
2. What's your scheduler set to?  CFQ tends to not work in many 
applications where the deadline scheduler works better...



-- Corey "Jay" Chandler

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Wed, 31 Dec 2008, Ricky Zhou wrote:

> On 2008-12-31 10:49:56 PM, Xavier Lamien wrote:
> > Could you perform an hdparm -tT on that disk ?
> /dev/sde:
>  Timing cached reads:   2668 MB in  2.00 seconds = 1336.06 MB/sec
>  Timing buffered disk reads:  1024 MB in  3.01 seconds = 340.69 MB/sec
>
> > Also, output an strace against your cat & dd commands.
> >
> > if my memory is good enough, cat use mmap() which is faster than
> > read() (which is used by dd)
> I just straced dd and cat, and it looks like cat is using a block size
> of 4096 bytes while dd is using 512 bytes.  I *think* they were both
> just using read().  Perhaps dd if=/dev/sde of=/dev/null bs=4096 would be
> a better command to compare against.
>

Ok, that explains the difference between dd and cat.  Now why is the rest
of it so bad.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Stephen John Smoogen
On Wed, Dec 31, 2008 at 1:42 PM, Mike McGrath  wrote:
> Lets pool some knowledge together because at this point, I'm missing
> something.
>
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
>
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
>
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s
>
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
>
> Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.
>
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.

I thought the /dev/null numbers were not good indicators  (I remember
Stephen Tweedie or someone in the RH Kernel team lecturing that
numbers while consistent would not show real world issues and could be
much higher than what really happens.) The lesson was always send it
to a real file that its going to open/close/deal with.. even if the
file is in a ram disk.

I do know that dd defaults to 512 block size which makes it different
speeds for copies (whoops Ricky confirms this ). Also stuff that will
fit inside of the PERC Cache and how the journal is going to be
written/committed are going to make differences...

The next difference is how a system sees inside of a disk and how the
disk sees itself. The /dev/xxx are always going to be much higher
because there is no filesystem interaction and the controller is going
to be just pulling from hardware.. it might even optimize doing that
(raw partition style) so that you get insane speeds but as soon as you
put in a filesystem poof.





-- 
Stephen J Smoogen. -- BSD/GNU/Linux
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. "The Merchant of Venice"

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Wed, 31 Dec 2008, Corey Chandler wrote:

> Mike McGrath wrote:
> > Lets pool some knowledge together because at this point, I'm missing
> > something.
> >
> > I've been doing all measurements with sar as bonnie, etc, causes builds to
> > timeout.
> >
> > Problem: We're seeing slower then normal disk IO.  At least I think we
> > are.  This is a PERC5/E and MD1000 array.
> >
>
> 1. Are we sure the array hasn't lost a drive?

I can't physically look at the drive (they're a couple hundred miles away)
but we've seen no reports of it (via the drac anyway).  I'll have to get
the raid software on there to be for sure.  I'd think a degraded raid
array would affect both direct block access and file level access.

> 2. What's your scheduler set to?  CFQ tends to not work in many applications
> where the deadline scheduler works better...
>

I'd tried other schedulers earlier but they didn't seem to make much of a
difference.  Even still, I'll get dealine setup and take a look.

At least we've got the dd and cat problem figured out.  Now to figure out
why there's such a discrepancy between file level reads and block level
reads.  Anyone else have an array of this type and size to run those tests
on?  I'd be curious to see what others are getting.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Greg Swift
On Wed, Dec 31, 2008 at 17:35, Mike McGrath  wrote:

> On Wed, 31 Dec 2008, Corey Chandler wrote:
>
> > Mike McGrath wrote:
> > > Lets pool some knowledge together because at this point, I'm missing
> > > something.
> > >
> > > I've been doing all measurements with sar as bonnie, etc, causes builds
> to
> > > timeout.
> > >
> > > Problem: We're seeing slower then normal disk IO.  At least I think we
> > > are.  This is a PERC5/E and MD1000 array.
> > >
> >
> > 1. Are we sure the array hasn't lost a drive?
>
> I can't physically look at the drive (they're a couple hundred miles away)
> but we've seen no reports of it (via the drac anyway).  I'll have to get
> the raid software on there to be for sure.  I'd think a degraded raid
> array would affect both direct block access and file level access.
>
> > 2. What's your scheduler set to?  CFQ tends to not work in many
> applications
> > where the deadline scheduler works better...
> >
>
> I'd tried other schedulers earlier but they didn't seem to make much of a
> difference.  Even still, I'll get dealine setup and take a look.
>
> At least we've got the dd and cat problem figured out.  Now to figure out
> why there's such a discrepancy between file level reads and block level
> reads.  Anyone else have an array of this type and size to run those tests
> on?  I'd be curious to see what others are getting.
>

we are working on a rhel3 to 5 migration at my job.  We have 2 primary
filesystems.  one is large database files and the other is lots of small
documents.  As we were testing backup software for rhel5 we noticed a 60%
decrease in speed moving from rhel3 to rhel5 with the same file system, but
only on the document filesystem, the db file system was perfectly snappy.

After a lot of troubleshooting it was deemed to be related to the dir_index
btree hash.  The path was to long before there was a difference in the names
of the files, making the index incredibly slow.  Removing dir_index
recovered a bit of the difference, but didn't resolve the issue.  A quick
rename of one of the base directories recovered almost the entire 60%.

Thought I'd at least throw it out there, although I'm not sure that it is
the exact issue, it doesn't hurt to have it floating in the background.

-greg/xaeth
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Sascha Thomas Spreitzer
If its related to the FS driver ( inode table or algorithms ) the
program "slabtop" might give an indication of the kernel processes
eating system performance.
Slabtop is in the ps-tools suite, should be on any major linux distribution.

2009/1/1 Greg Swift :
> On Wed, Dec 31, 2008 at 17:35, Mike McGrath  wrote:
>>
>> On Wed, 31 Dec 2008, Corey Chandler wrote:
>>
>> > Mike McGrath wrote:
>> > > Lets pool some knowledge together because at this point, I'm missing
>> > > something.
>> > >
>> > > I've been doing all measurements with sar as bonnie, etc, causes
>> > > builds to
>> > > timeout.
>> > >
>> > > Problem: We're seeing slower then normal disk IO.  At least I think we
>> > > are.  This is a PERC5/E and MD1000 array.
>> > >
>> >
>> > 1. Are we sure the array hasn't lost a drive?
>>
>> I can't physically look at the drive (they're a couple hundred miles away)
>> but we've seen no reports of it (via the drac anyway).  I'll have to get
>> the raid software on there to be for sure.  I'd think a degraded raid
>> array would affect both direct block access and file level access.
>>
>> > 2. What's your scheduler set to?  CFQ tends to not work in many
>> > applications
>> > where the deadline scheduler works better...
>> >
>>
>> I'd tried other schedulers earlier but they didn't seem to make much of a
>> difference.  Even still, I'll get dealine setup and take a look.
>>
>> At least we've got the dd and cat problem figured out.  Now to figure out
>> why there's such a discrepancy between file level reads and block level
>> reads.  Anyone else have an array of this type and size to run those tests
>> on?  I'd be curious to see what others are getting.
>
> we are working on a rhel3 to 5 migration at my job.  We have 2 primary
> filesystems.  one is large database files and the other is lots of small
> documents.  As we were testing backup software for rhel5 we noticed a 60%
> decrease in speed moving from rhel3 to rhel5 with the same file system, but
> only on the document filesystem, the db file system was perfectly snappy.
>
> After a lot of troubleshooting it was deemed to be related to the dir_index
> btree hash.  The path was to long before there was a difference in the names
> of the files, making the index incredibly slow.  Removing dir_index
> recovered a bit of the difference, but didn't resolve the issue.  A quick
> rename of one of the base directories recovered almost the entire 60%.
>
> Thought I'd at least throw it out there, although I'm not sure that it is
> the exact issue, it doesn't hurt to have it floating in the background.
>
> -greg/xaeth
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>
>



-- 
Mit freundlichen Grüßen, / with kind regards,
Sascha Thomas Spreitzer
http://spreitzer.name/

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Thu, 1 Jan 2009, Sascha Thomas Spreitzer wrote:

> If its related to the FS driver ( inode table or algorithms ) the
> program "slabtop" might give an indication of the kernel processes
> eating system performance.
> Slabtop is in the ps-tools suite, should be on any major linux distribution.
>

Interesting, never used slabtop.  I'm not quite sure what I'm looking for
but I'll read up on it.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Wed, 31 Dec 2008, Greg Swift wrote:

> On Wed, Dec 31, 2008 at 17:35, Mike McGrath  wrote:
>   On Wed, 31 Dec 2008, Corey Chandler wrote:
>
>   > Mike McGrath wrote:
>   > > Lets pool some knowledge together because at this point, I'm missing
>   > > something.
>   > >
>   > > I've been doing all measurements with sar as bonnie, etc, causes 
> builds to
>   > > timeout.
>   > >
>   > > Problem: We're seeing slower then normal disk IO.  At least I think 
> we
>   > > are.  This is a PERC5/E and MD1000 array.
>   > >
>   >
>   > 1. Are we sure the array hasn't lost a drive?
>
> I can't physically look at the drive (they're a couple hundred miles away)
> but we've seen no reports of it (via the drac anyway).  I'll have to get
> the raid software on there to be for sure.  I'd think a degraded raid
> array would affect both direct block access and file level access.
>
> > 2. What's your scheduler set to?  CFQ tends to not work in many applications
> > where the deadline scheduler works better...
> >
>
> I'd tried other schedulers earlier but they didn't seem to make much of a
> difference.  Even still, I'll get dealine setup and take a look.
>
> At least we've got the dd and cat problem figured out.  Now to figure out
> why there's such a discrepancy between file level reads and block level
> reads.  Anyone else have an array of this type and size to run those tests
> on?  I'd be curious to see what others are getting.
>
>
> we are working on a rhel3 to 5 migration at my job.  We have 2 primary 
> filesystems.  one is large database files and the
> other is lots of small documents.  As we were testing backup software for 
> rhel5 we noticed a 60% decrease in speed moving
> from rhel3 to rhel5 with the same file system, but only on the document 
> filesystem, the db file system was perfectly
> snappy.
>

Our files are some smaller logs, but mostly rpms.

> After a lot of troubleshooting it was deemed to be related to the dir_index 
> btree hash.  The path was to long before
> there was a difference in the names of the files, making the index incredibly 
> slow.  Removing dir_index recovered a bit
> of the difference, but didn't resolve the issue.  A quick rename of one of 
> the base directories recovered almost the
> entire 60%.
>

I'd be curious to hear more about this.  How long was your path?  Our
paths aren't short but I don't think they'd be approaching any limits.
For example:

/mnt/koji/packages/nagios/3.0.5/1.fc11/x86_64/nagios-3.0.5-1.fc11.x86_64.rpm

> Thought I'd at least throw it out there, although I'm not sure that it is the 
> exact issue, it doesn't hurt to have it
> floating in the background.
>

thanks.

-Mike___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Wed, 31 Dec 2008, Mike McGrath wrote:

> Lets pool some knowledge together because at this point, I'm missing
> something.
>
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
>
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
>
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s
>
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
>
> Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.
>
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.
>
> The above tests are pretty consistent.  /dev/sde is a raid5 array,
> hardware raid.
>
> So my question here is, wtf?  I've been working to do a backup which I
> would think would either cause network utilization to max out, or disk io
> to max out.  I'm not seeing either.  Sar says the disks are 100% utilized
> but I can cause major increases in actual disk reads and writes by just
> running additional commands.  Also, if the disks were 100% utilized I'd
> expect we would see lots more iowait.  We're not though, iowait on the box
> is only %0.06 today.
>
> So, long story short, we're seeing much better performance when just
> reading or writing lots of data (though dd is many times slower then cat).
> But with our real-world traffic, we're just seeing crappy crappy IO.
>
> Thoughts, theories or opinions?  Some of the sysadmin noc guys have access
> to run diagnostic commands, if you want more info about a setting, let me
> know.
>
> I should also mention there's lots going on with this box, for example its
> hardware raid, lvm and I've got xen running on it (though the tests above
> were not in a xen guest).
>

Also for the curious:

dumpe2fs 1.39 (29-May-2006)
Filesystem volume name:   
Last mounted on:  
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index filetype 
needs_recovery sparse_super large_file
Default mount options:(none)
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  1342177280
Block count:  2684354560
Reserved block count: 134217728
Free blocks:  1407579323
Free inodes:  1336866363
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  384
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 16384
Inode blocks per group:   512
Filesystem created:   Thu Jan 17 14:52:03 2008
Last mount time:  Fri Dec  5 18:51:44 2008
Last write time:  Fri Dec  5 18:51:44 2008
Mount count:  17
Maximum mount count:  24
Last checked: Sat May 24 03:14:41 2008
Check interval:   15552000 (6 months)
Next check after: Thu Nov 20 03:14:41 2008
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   128
Journal inode:8
Default directory hash:   tea
Directory Hash Seed:  1b6393b1-472c-4005-ae87-9603eea9f45b
Journal backup:   inode blocks
Journal size: 128M

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread James Antill
On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
> Lets pool some knowledge together because at this point, I'm missing
> something.
> 
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
> 
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
> 
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s

 This _might_ not be "IO" in a normal sense, -a to cp means:

 file data + file inode + ACLs + selinux + xattrs [+ file capabilities]

...esp. given that you aren't getting large IOWait times, you might want
to strace -T the cp and do some perl/whatever on the result to see what
is eating up the time.
 This is a straight 5.2, yeh?

-- 
James Antill 
Fedora

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2008-12-31 Thread Mike McGrath
On Thu, 1 Jan 2009, James Antill wrote:

> On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
> > Lets pool some knowledge together because at this point, I'm missing
> > something.
> >
> > I've been doing all measurements with sar as bonnie, etc, causes builds to
> > timeout.
> >
> > Problem: We're seeing slower then normal disk IO.  At least I think we
> > are.  This is a PERC5/E and MD1000 array.
> >
> > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> > around 4-6MBytes/s
>
>  This _might_ not be "IO" in a normal sense, -a to cp means:
>
>  file data + file inode + ACLs + selinux + xattrs [+ file capabilities]
>
> ...esp. given that you aren't getting large IOWait times, you might want
> to strace -T the cp and do some perl/whatever on the result to see what
> is eating up the time.

Even with non cp type things (like a bacula backup) it just doesn't seem
as fast as I would expect it to be.  I've never actually done trending at
this level / scale on a filesystem / drive before.  So I really don't have
a good baseline except that it just seems slow to me.

Other then the much faster direct block access and the large file reads, I
don't have much else to go on that makes me think its slow.

>  This is a straight 5.2, yeh?
>

Correct.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Kostas Georgiou
On Thu, Jan 01, 2009 at 01:17:38AM -0600, Mike McGrath wrote:

> On Thu, 1 Jan 2009, James Antill wrote:
> 
> > On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
> > > Lets pool some knowledge together because at this point, I'm missing
> > > something.
> > >
> > > I've been doing all measurements with sar as bonnie, etc, causes builds to
> > > timeout.
> > >
> > > Problem: We're seeing slower then normal disk IO.  At least I think we
> > > are.  This is a PERC5/E and MD1000 array.
> > >
> > > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> > > around 4-6MBytes/s
> >
> >  This _might_ not be "IO" in a normal sense, -a to cp means:
> >
> >  file data + file inode + ACLs + selinux + xattrs [+ file capabilities]
> >
> > ...esp. given that you aren't getting large IOWait times, you might want
> > to strace -T the cp and do some perl/whatever on the result to see what
> > is eating up the time.
> 
> Even with non cp type things (like a bacula backup) it just doesn't seem
> as fast as I would expect it to be.  I've never actually done trending at
> this level / scale on a filesystem / drive before.  So I really don't have
> a good baseline except that it just seems slow to me.
> 
> Other then the much faster direct block access and the large file reads, I
> don't have much else to go on that makes me think its slow.

Do writes show the same pattern? If you use selinux/ACLs/xattrs the default
inode size of 128 can cause slowdowns (#205161 for example).  

Can you run blktrace+seekwatcher (both in EPEL) to get an idea on
what is going on? An iostat -x -k /dev/sde 1 output will also be
helpfull.

Kostas

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Mike McGrath
On Thu, 1 Jan 2009, Kostas Georgiou wrote:

> On Thu, Jan 01, 2009 at 01:17:38AM -0600, Mike McGrath wrote:
>
> > On Thu, 1 Jan 2009, James Antill wrote:
> >
> > > On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
> > > > Lets pool some knowledge together because at this point, I'm missing
> > > > something.
> > > >
> > > > I've been doing all measurements with sar as bonnie, etc, causes builds 
> > > > to
> > > > timeout.
> > > >
> > > > Problem: We're seeing slower then normal disk IO.  At least I think we
> > > > are.  This is a PERC5/E and MD1000 array.
> > > >
> > > > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> > > > around 4-6MBytes/s
> > >
> > >  This _might_ not be "IO" in a normal sense, -a to cp means:
> > >
> > >  file data + file inode + ACLs + selinux + xattrs [+ file capabilities]
> > >
> > > ...esp. given that you aren't getting large IOWait times, you might want
> > > to strace -T the cp and do some perl/whatever on the result to see what
> > > is eating up the time.
> >
> > Even with non cp type things (like a bacula backup) it just doesn't seem
> > as fast as I would expect it to be.  I've never actually done trending at
> > this level / scale on a filesystem / drive before.  So I really don't have
> > a good baseline except that it just seems slow to me.
> >
> > Other then the much faster direct block access and the large file reads, I
> > don't have much else to go on that makes me think its slow.
>
> Do writes show the same pattern? If you use selinux/ACLs/xattrs the default
> inode size of 128 can cause slowdowns (#205161 for example).
>

One reason I'm trying to ramp this up now is because the koji share is
still under 50% utilized.  If it turns out to be something in the
filesystem, its not too late for us to shrink the main filesystem, create
the new, copy, and grow the new.

> Can you run blktrace+seekwatcher (both in EPEL) to get an idea on
> what is going on? An iostat -x -k /dev/sde 1 output will also be
> helpfull.
>

I'll take a look at those two applications as well, here's the iostat:

Linux 2.6.18-92.1.18.el5xen (xen2.fedora.phx.redhat.com)01/01/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.550.011.350.106.28   91.71

Device: rrqm/s   wrqm/s   r/s   w/srkB/swkB/s avgrq-sz avgqu-sz 
  await  svctm  %util
sde1389.2295.13 161.74 270.46  6693.75  1670.1638.70   1.09 
   2.51   1.48  64.04


-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Stephen John Smoogen
On Thu, Jan 1, 2009 at 12:17 AM, Mike McGrath  wrote:
> On Thu, 1 Jan 2009, James Antill wrote:
>
>> On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
>> > Lets pool some knowledge together because at this point, I'm missing
>> > something.
>> >
>> > I've been doing all measurements with sar as bonnie, etc, causes builds to
>> > timeout.
>> >
>> > Problem: We're seeing slower then normal disk IO.  At least I think we
>> > are.  This is a PERC5/E and MD1000 array.
>> >
>> > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
>> > around 4-6MBytes/s
>>
>>  This _might_ not be "IO" in a normal sense, -a to cp means:
>>
>>  file data + file inode + ACLs + selinux + xattrs [+ file capabilities]
>>
>> ...esp. given that you aren't getting large IOWait times, you might want
>> to strace -T the cp and do some perl/whatever on the result to see what
>> is eating up the time.
>
> Even with non cp type things (like a bacula backup) it just doesn't seem
> as fast as I would expect it to be.  I've never actually done trending at
> this level / scale on a filesystem / drive before.  So I really don't have
> a good baseline except that it just seems slow to me.

Well bacula should be doing the same thing as a cp in that it needs to
log all those things (ACL, selinux, xattrs, mother maiden name, etc).
Normally I have found that the bigger the disk the slower the copies
on journaled file systems. I don't currently have anything as big as
you have (this is over a TB correct?) but the speed fixes used to be
changing block sizes and journal parameters to allow for speed through
(oh and turning off certain hardware parameters in the raid controller
to allow for writethroughs there.


> Other then the much faster direct block access and the large file reads, I
> don't have much else to go on that makes me think its slow.
>
>>  This is a straight 5.2, yeh?
>>
>
> Correct.
>
>-Mike
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>



-- 
Stephen J Smoogen. -- BSD/GNU/Linux
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. "The Merchant of Venice"

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Mike McGrath
On Wed, 31 Dec 2008, Sascha Thomas Spreitzer wrote:

> Hello Mike,
>
> maybe the RAID mirror failed and is auto healing in background. Is
> there a way to determine the RAID state?

It dawns on me I never answered these questions.  The raid array is fine,
its got 14 drives in a raid5 configuration and one hot spare (I double
checked this just now)

> Is the RAID controller showing any errors? Physical disks reporting seek 
> errors?
> I assume a problem with either hardware buffers of disk or RAID
> controller or faulty disk or RAID hardware.
>

No errors that I could find on the drives or the controller.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Jon Stanley
On Thu, Jan 1, 2009 at 7:17 AM, Kostas Georgiou
 wrote:

> Can you run blktrace+seekwatcher (both in EPEL) to get an idea on
> what is going on? An iostat -x -k /dev/sde 1 output will also be
> helpfull.

Here's a slabinfo that someone else requested and the iostat.  I don't
have access to the xen dom0 though, but I don't suspect it'd show much
different:

I put it up on a webserver since gmail loves to chop up my lines and
make something like this unusable.  See
http://palladium.jds2001.org/pub/nfs1-stats.txt

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-01 Thread Mike McGrath
On Thu, 1 Jan 2009, Kostas Georgiou wrote:

> On Thu, Jan 01, 2009 at 01:17:38AM -0600, Mike McGrath wrote:
>
> > On Thu, 1 Jan 2009, James Antill wrote:
> >
> > > On Wed, 2008-12-31 at 14:42 -0600, Mike McGrath wrote:
> > > > Lets pool some knowledge together because at this point, I'm missing
> > > > something.
> > > >
> > > > I've been doing all measurements with sar as bonnie, etc, causes builds 
> > > > to
> > > > timeout.
> > > >
> > > > Problem: We're seeing slower then normal disk IO.  At least I think we
> > > > are.  This is a PERC5/E and MD1000 array.
> > > >
> > > > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> > > > around 4-6MBytes/s
> > >
> > >  This _might_ not be "IO" in a normal sense, -a to cp means:
> > >
> > >  file data + file inode + ACLs + selinux + xattrs [+ file capabilities]
> > >
> > > ...esp. given that you aren't getting large IOWait times, you might want
> > > to strace -T the cp and do some perl/whatever on the result to see what
> > > is eating up the time.
> >
> > Even with non cp type things (like a bacula backup) it just doesn't seem
> > as fast as I would expect it to be.  I've never actually done trending at
> > this level / scale on a filesystem / drive before.  So I really don't have
> > a good baseline except that it just seems slow to me.
> >
> > Other then the much faster direct block access and the large file reads, I
> > don't have much else to go on that makes me think its slow.
>
> Do writes show the same pattern? If you use selinux/ACLs/xattrs the default
> inode size of 128 can cause slowdowns (#205161 for example).
>
> Can you run blktrace+seekwatcher (both in EPEL) to get an idea on
> what is going on? An iostat -x -k /dev/sde 1 output will also be
> helpfull.
>

Here's a seekwatcher of a find I ran:

http://mmcgrath.fedorapeople.org/find2.png

I had to kill it, I'll have a more full run soon.  Doing some other tests
now.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Sascha Thomas Spreitzer
Hello again,

this line looks suspicious to me:

# name   
 : tunables:
slabdata   
ext3_inode_cache   98472 15026076051 : tunables   54   27
  8 : slabdata  30052  30052189

Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
amount ever be tested in the wild?
The Filesystem is btw. marked as needs_recovery.

regards,
Sascha

2009/1/2 Jon Stanley :
> On Thu, Jan 1, 2009 at 7:17 AM, Kostas Georgiou
>  wrote:
>
>> Can you run blktrace+seekwatcher (both in EPEL) to get an idea on
>> what is going on? An iostat -x -k /dev/sde 1 output will also be
>> helpfull.
>
> Here's a slabinfo that someone else requested and the iostat.  I don't
> have access to the xen dom0 though, but I don't suspect it'd show much
> different:
>
> I put it up on a webserver since gmail loves to chop up my lines and
> make something like this unusable.  See
> http://palladium.jds2001.org/pub/nfs1-stats.txt
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>



-- 
Mit freundlichen Grüßen, / with kind regards,
Sascha Thomas Spreitzer
http://spreitzer.name/

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Michael Schwendt
On Fri, 2 Jan 2009 09:38:43 +0100, Sascha wrote:

> The Filesystem is btw. marked as needs_recovery.

Which can be harmless, because it is a feature flag that is also
set if dumpe2fs is run on a mounted fs. It means that there are blocks
that still need to be committed, which is pretty normal for a mounted
active fs.

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Mike McGrath
On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:

> Hello again,
>
> this line looks suspicious to me:
>
> # name   
>  : tunables:
> slabdata   
> ext3_inode_cache   98472 15026076051 : tunables   54   27
>   8 : slabdata  30052  30052189
>
> Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> amount ever be tested in the wild?

Not sure if it has been tested in the wild or not but the filesystem
itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
big purposes of this filesystem.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Mike McGrath
On Fri, 2 Jan 2009, Mike McGrath wrote:

> On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
>
> > Hello again,
> >
> > this line looks suspicious to me:
> >
> > # name   
> >  : tunables:
> > slabdata   
> > ext3_inode_cache   98472 15026076051 : tunables   54   27
> >   8 : slabdata  30052  30052189
> >
> > Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> > amount ever be tested in the wild?
>
> Not sure if it has been tested in the wild or not but the filesystem
> itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> big purposes of this filesystem.
>

Just as a side note, this is the real problem I'm trying to fix:

  Elapsed time:   17 days 15 hours 8 mins 7 secs
  Priority:   10
  FD Files Written:   9,284,599
  SD Files Written:   9,284,599
  FD Bytes Written:   4,890,877,712,334 (4.890 TB)
  SD Bytes Written:   4,892,855,186,414 (4.892 TB)
  Rate:   3210.7 KB/s


-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Stephen John Smoogen
On Fri, Jan 2, 2009 at 10:57 AM, Mike McGrath  wrote:
> On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
>
>> Hello again,
>>
>> this line looks suspicious to me:
>>
>> # name   
>>  : tunables:
>> slabdata   
>> ext3_inode_cache   98472 15026076051 : tunables   54   27
>>   8 : slabdata  30052  30052189
>>
>> Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
>> amount ever be tested in the wild?
>
> Not sure if it has been tested in the wild or not but the filesystem
> itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> big purposes of this filesystem.
>

Well then my idea of making smaller filesystems would break that
then... hmmm I would say that its time to escalate this to Level 2
support :). What do the filesystem kernel people think? I would bring
them in to see if there is something we are missing. Maybe something
in the dealing with that many inodes per file is causing a problem (or
maybe this is just known behaviour for large filesystems.) By the way,
this is a 64 bit OS correct?


-- 
Stephen J Smoogen. -- BSD/GNU/Linux
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. "The Merchant of Venice"

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread James Antill
On Fri, 2009-01-02 at 11:57 -0600, Mike McGrath wrote:
> On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
> 
> > Hello again,
> >
> > this line looks suspicious to me:
> >
> > # name   
> >  : tunables:
> > slabdata   
> > ext3_inode_cache   98472 15026076051 : tunables   54   27
> >   8 : slabdata  30052  30052189
> >
> > Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> > amount ever be tested in the wild?
> 
> Not sure if it has been tested in the wild or not but the filesystem
> itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> big purposes of this filesystem.

 Ah ha ... I bet that you'll find tar/cp-a/whatever is having a major
problem keeping tabs on which inodes it's "seen", so it doesn't copy the
same data N times. Try running: cp -a --no-preserve=links ... and see if
that is much faster?

-- 
James Antill 
Fedora

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Mike McGrath
On Fri, 2 Jan 2009, James Antill wrote:

> On Fri, 2009-01-02 at 11:57 -0600, Mike McGrath wrote:
> > On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
> >
> > > Hello again,
> > >
> > > this line looks suspicious to me:
> > >
> > > # name   
> > >  : tunables:
> > > slabdata   
> > > ext3_inode_cache   98472 15026076051 : tunables   54   27
> > >   8 : slabdata  30052  30052189
> > >
> > > Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> > > amount ever be tested in the wild?
> >
> > Not sure if it has been tested in the wild or not but the filesystem
> > itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> > big purposes of this filesystem.
>
>  Ah ha ... I bet that you'll find tar/cp-a/whatever is having a major
> problem keeping tabs on which inodes it's "seen", so it doesn't copy the
> same data N times. Try running: cp -a --no-preserve=links ... and see if
> that is much faster?
>

Naw, I've been testing on the non-link portions.  Dennis, Jesse, etc,
correct me if I'm wrong on this:

We've got a dir /mnt/koji/packages/ that contains all of the packages.
You can actually view this dir yourself at:

http://kojipkgs.fedoraproject.org/packages/glibc/

There are other directories at /mnt/koji/static-repos/.  A directory like
static-repos contains almost exclusively hardlinks to those packages.

Since many of those hardlink oriented directories can be recreated, we
don't bother backing them up so I haven't been testing with them.

One thing I'm going to try to do is re-index the filesystem (e2fsck -D).
I figure its a worthwhile thing to do.  I'm testing on a snapshot first.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Mike McGrath
On Fri, 2 Jan 2009, Stephen John Smoogen wrote:

> On Fri, Jan 2, 2009 at 10:57 AM, Mike McGrath  wrote:
> > On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
> >
> >> Hello again,
> >>
> >> this line looks suspicious to me:
> >>
> >> # name   
> >>  : tunables:
> >> slabdata   
> >> ext3_inode_cache   98472 15026076051 : tunables   54   27
> >>   8 : slabdata  30052  30052189
> >>
> >> Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> >> amount ever be tested in the wild?
> >
> > Not sure if it has been tested in the wild or not but the filesystem
> > itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> > big purposes of this filesystem.
> >
>
> Well then my idea of making smaller filesystems would break that
> then... hmmm I would say that its time to escalate this to Level 2
> support :). What do the filesystem kernel people think? I would bring
> them in to see if there is something we are missing. Maybe something
> in the dealing with that many inodes per file is causing a problem (or
> maybe this is just known behaviour for large filesystems.) By the way,
> this is a 64 bit OS correct?
>

Correct, 64 bit OS.  I'm going to get some of our FS guys on the horn as
soon as RH is back to work.  I think most of them will return on Monday.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Stephen John Smoogen
On Fri, Jan 2, 2009 at 12:29 PM, Mike McGrath  wrote:
> On Fri, 2 Jan 2009, Stephen John Smoogen wrote:
>
>> On Fri, Jan 2, 2009 at 10:57 AM, Mike McGrath  wrote:
>> > On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
>> >
>> >> Hello again,
>> >>
>> >> this line looks suspicious to me:
>> >>
>> >> # name   
>> >>  : tunables:
>> >> slabdata   
>> >> ext3_inode_cache   98472 15026076051 : tunables   54   27
>> >>   8 : slabdata  30052  30052189
>> >>
>> >> Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
>> >> amount ever be tested in the wild?
>> >
>> > Not sure if it has been tested in the wild or not but the filesystem
>> > itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
>> > big purposes of this filesystem.
>> >
>>
>> Well then my idea of making smaller filesystems would break that
>> then... hmmm I would say that its time to escalate this to Level 2
>> support :). What do the filesystem kernel people think? I would bring
>> them in to see if there is something we are missing. Maybe something
>> in the dealing with that many inodes per file is causing a problem (or
>> maybe this is just known behaviour for large filesystems.) By the way,
>> this is a 64 bit OS correct?
>>
>
> Correct, 64 bit OS.  I'm going to get some of our FS guys on the horn as
> soon as RH is back to work.  I think most of them will return on Monday.
>

Slackers... in my day.. oh its time for my applesauce at the old
sys-admin home. Back later.

I think actually the inode/hardlink might be having an issue evne if
the files being tested aren't multiples of hardlinks. The journalling
and filesystem are going to want to optimize how they are getting
data. Hmmm if you want to completely break things... what are the
speedups/slowdowns if you mount it as ext2 instead of ext3 :).



-- 
Stephen J Smoogen. -- BSD/GNU/Linux
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. "The Merchant of Venice"

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Kostas Georgiou
On Fri, Jan 02, 2009 at 01:28:43PM -0600, Mike McGrath wrote:

> On Fri, 2 Jan 2009, James Antill wrote:
> 
> > On Fri, 2009-01-02 at 11:57 -0600, Mike McGrath wrote:
> > > On Fri, 2 Jan 2009, Sascha Thomas Spreitzer wrote:
> > >
> > > > Hello again,
> > > >
> > > > this line looks suspicious to me:
> > > >
> > > > # name   
> > > >  : tunables:
> > > > slabdata   
> > > > ext3_inode_cache   98472 15026076051 : tunables   54   27
> > > >   8 : slabdata  30052  30052189
> > > >
> > > > Is it 1 big filesystem with about 1,342,177,280 inodes. Has this
> > > > amount ever be tested in the wild?
> > >
> > > Not sure if it has been tested in the wild or not but the filesystem
> > > itself contains a _TON_ of hardlinks.  Creation of hardlinks is one of the
> > > big purposes of this filesystem.
> >
> >  Ah ha ... I bet that you'll find tar/cp-a/whatever is having a major
> > problem keeping tabs on which inodes it's "seen", so it doesn't copy the
> > same data N times. Try running: cp -a --no-preserve=links ... and see if
> > that is much faster?
> >
> 
> Naw, I've been testing on the non-link portions.  Dennis, Jesse, etc,
> correct me if I'm wrong on this:
> 
> We've got a dir /mnt/koji/packages/ that contains all of the packages.
> You can actually view this dir yourself at:
> 
> http://kojipkgs.fedoraproject.org/packages/glibc/
> 
> There are other directories at /mnt/koji/static-repos/.  A directory like
> static-repos contains almost exclusively hardlinks to those packages.
> 
> Since many of those hardlink oriented directories can be recreated, we
> don't bother backing them up so I haven't been testing with them.
> 
> One thing I'm going to try to do is re-index the filesystem (e2fsck -D).
> I figure its a worthwhile thing to do.  I'm testing on a snapshot first.

A lower vm.vfs_cache_pressure might help as well, you might need quite a
bit of memory to keep everything in cache though.

Kostas

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-02 Thread Jesse Keating
On Fri, 2009-01-02 at 13:28 -0600, Mike McGrath wrote:
> 
> There are other directories at /mnt/koji/static-repos/.  A directory like
> static-repos contains almost exclusively hardlinks to those packages.
> 
> Since many of those hardlink oriented directories can be recreated, we
> don't bother backing them up so I haven't been testing with them.

We stopped making hardlinks in those directories a while back, during
the last round of "make it faster".  /mnt/koji/repos/ contains a number
of directories that just have repodata in them, that reference the
relative path back to /mnt/koji/packages.

The /mnt/koji/mash/ tree is where all the hardlinks are.  These are
composes of koji tags for things like rawhide and releases.  It's here
that we make hardlinks back to /mnt/koji/packages/ for the individual
rpms.

-- 
Jesse Keating
Fedora -- Freedom² is a feature!
identi.ca: http://identi.ca/jkeating


signature.asc
Description: This is a digitally signed message part
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-03 Thread Matt Domsch
On Wed, Dec 31, 2008 at 02:42:27PM -0600, Mike McGrath wrote:
> Lets pool some knowledge together because at this point, I'm missing
> something.
> 
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
> 
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
> 
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s

That's sucky.
 
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
> 
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.

That's about what I would expect for straight block reads.

> The above tests are pretty consistent.  /dev/sde is a raid5 array,
> hardware raid.

Remember, RAID 5's worst performance is for writes.  In your 14-drive
array, it has to calculate parity across all the drives, then write
the data across all the drives.  As long as it's pure writes (e.g. not
read/modify/write) it's not so bad, but still slower than you might
think.

What ext3 journaling options are enabled (e.g. what does 'mount' say)?
If it's data=ordered (the default), that's OK.  If it's data=journal,
then all the data gets written twice (first to the journal, then the
journal to the disk), which is really really slow, and the size of the
journal would really make a difference too.

RAID controllers also tend to benefit from using the noop scheduler,
which effectively defers the scheduling to the RAID controller.

Note that cp doesn't fdatasync(), so the I/Os will be scheduled, but
not necessarily completed, when cp returns.  Which might make your
numbers even more optimistic than they really are. :-(

-- 
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-03 Thread Kostas Georgiou
On Sat, Jan 03, 2009 at 06:32:38PM -0600, Matt Domsch wrote:

> What ext3 journaling options are enabled (e.g. what does 'mount' say)?
> If it's data=ordered (the default), that's OK.  If it's data=journal,
> then all the data gets written twice (first to the journal, then the
> journal to the disk), which is really really slow, and the size of the
> journal would really make a difference too.

For an NFS server (assuming that you aren't exporting as async)
data=journal can give you better performance than anything else
actually. The NFS howto has a brief note in the performance section
about this.

Kostas

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-03 Thread Matt Domsch
On Sun, Jan 04, 2009 at 03:02:55AM +, Kostas Georgiou wrote:
> On Sat, Jan 03, 2009 at 06:32:38PM -0600, Matt Domsch wrote:
> 
> > What ext3 journaling options are enabled (e.g. what does 'mount' say)?
> > If it's data=ordered (the default), that's OK.  If it's data=journal,
> > then all the data gets written twice (first to the journal, then the
> > journal to the disk), which is really really slow, and the size of the
> > journal would really make a difference too.
> 
> For an NFS server (assuming that you aren't exporting as async)
> data=journal can give you better performance than anything else
> actually. The NFS howto has a brief note in the performance section
> about this.

Yes, if the slowness is seen by applications on the client side of the
NFS server, data=journal on the NFS server can help.

Mike, your tests were all on the local file system, not across an NFS
connection, right?

data=journal can only buffer up to the size of the journal.  Given the
comments about speed with "large files", unless the journal is
specifically tuned to be large enough to handle them, no dice.


-- 
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-04 Thread Mike McGrath
On Sat, 3 Jan 2009, Matt Domsch wrote:

> On Sun, Jan 04, 2009 at 03:02:55AM +, Kostas Georgiou wrote:
> > On Sat, Jan 03, 2009 at 06:32:38PM -0600, Matt Domsch wrote:
> >
> > > What ext3 journaling options are enabled (e.g. what does 'mount' say)?
> > > If it's data=ordered (the default), that's OK.  If it's data=journal,
> > > then all the data gets written twice (first to the journal, then the
> > > journal to the disk), which is really really slow, and the size of the
> > > journal would really make a difference too.
> >
> > For an NFS server (assuming that you aren't exporting as async)
> > data=journal can give you better performance than anything else
> > actually. The NFS howto has a brief note in the performance section
> > about this.
>
> Yes, if the slowness is seen by applications on the client side of the
> NFS server, data=journal on the NFS server can help.
>
> Mike, your tests were all on the local file system, not across an NFS
> connection, right?
>

Correct, though (obviously) we're seeing the slownees remotely as well.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-04 Thread Ramez Hanna
On Sun, Jan 4, 2009 at 7:59 PM, Mike McGrath  wrote:

> On Sat, 3 Jan 2009, Matt Domsch wrote:
>
> > On Sun, Jan 04, 2009 at 03:02:55AM +, Kostas Georgiou wrote:
> > > On Sat, Jan 03, 2009 at 06:32:38PM -0600, Matt Domsch wrote:
> > >
> > > > What ext3 journaling options are enabled (e.g. what does 'mount'
> say)?
> > > > If it's data=ordered (the default), that's OK.  If it's data=journal,
> > > > then all the data gets written twice (first to the journal, then the
> > > > journal to the disk), which is really really slow, and the size of
> the
> > > > journal would really make a difference too.
> > >
> > > For an NFS server (assuming that you aren't exporting as async)
> > > data=journal can give you better performance than anything else
> > > actually. The NFS howto has a brief note in the performance section
> > > about this.
> >
> > Yes, if the slowness is seen by applications on the client side of the
> > NFS server, data=journal on the NFS server can help.
> >
> > Mike, your tests were all on the local file system, not across an NFS
> > connection, right?
> >
>
> Correct, though (obviously) we're seeing the slownees remotely as well.
>
>-Mike
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>

Hi,

I've had no previous experience with such issues but here are my 2 cents
IMHO if the slowness is seen locally as well as remotely then i would start
thinking about filesystem options, or even consider a different filesystem.
I think that you need to eliminate first the HW issues (raid, disk speed,
etc) then look more into fs specific options wich were discussed in several
previous emails
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-04 Thread FD Cami
On Sun, 4 Jan 2009 11:59:38 -0600 (CST)
Mike McGrath  wrote:

> On Sat, 3 Jan 2009, Matt Domsch wrote:
> 
> > On Sun, Jan 04, 2009 at 03:02:55AM +, Kostas Georgiou wrote:
> > 
> > Mike, your tests were all on the local file system, not across an NFS
> > connection, right?
> >
> 
> Correct, though (obviously) we're seeing the slownees remotely as well.


Hi Mike, list,

The dd and cat numbers in your email are consistent with what I get from
both my RAID5 arrays (PERC5/i controllers), with 4 and 6 15kRPM drives
(in PowerEdge 2900s).

Have you tried experimenting with stride and stripe_width ?
stride needs to be the same as whatever per disk chunk size the RAID array
was configured with (that should show up in the PERC5/E BIOS at least), and
stripe_width is stride*N with N being the number of data disks, i.e. without
parity.
I'm paraphrasing "man tune2fs", it's probably better explained there.
Those can be tuned with tune2fs (-E), although I've never done that to a
live FS, so the usual caveats about backups apply.

Sorry about the noise if that was done or discussed before, I've just read
the thread back and did find anything related to this.

Best,

Francois Cami

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-19 Thread Mike McGrath
On Wed, 31 Dec 2008, Mike McGrath wrote:

> Lets pool some knowledge together because at this point, I'm missing
> something.
>
> I've been doing all measurements with sar as bonnie, etc, causes builds to
> timeout.
>
> Problem: We're seeing slower then normal disk IO.  At least I think we
> are.  This is a PERC5/E and MD1000 array.
>
> When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get
> around 4-6MBytes/s
>
> When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get
> 30-40MBytes/s.
>
> Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read.
>
> If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read.
>
> The above tests are pretty consistent.  /dev/sde is a raid5 array,
> hardware raid.
>
> So my question here is, wtf?  I've been working to do a backup which I
> would think would either cause network utilization to max out, or disk io
> to max out.  I'm not seeing either.  Sar says the disks are 100% utilized
> but I can cause major increases in actual disk reads and writes by just
> running additional commands.  Also, if the disks were 100% utilized I'd
> expect we would see lots more iowait.  We're not though, iowait on the box
> is only %0.06 today.
>
> So, long story short, we're seeing much better performance when just
> reading or writing lots of data (though dd is many times slower then cat).
> But with our real-world traffic, we're just seeing crappy crappy IO.
>
> Thoughts, theories or opinions?  Some of the sysadmin noc guys have access
> to run diagnostic commands, if you want more info about a setting, let me
> know.
>
> I should also mention there's lots going on with this box, for example its
> hardware raid, lvm and I've got xen running on it (though the tests above
> were not in a xen guest).
>

So we all talked about this quite a bit so I felt the need to let everyone
know the latest status.  One of our goals was to lower utilization on the
netapp.  While high utilization itself isn't a problem, its just a
measurement after all, we did decide other problems could be solved if we
could get utilization to go down.

So after a bunch of tweaking on the share and in the scripts we run,
average utilization has dropped significantly.  Take a look here:

http://mmcgrath.fedorapeople.org/util.html

Thats a latest 30 day view (from a couple days ago).  You'll notice it was
around 90-100% pretty much all the time.  That went on like that for
MONTHS.  Even christmas day was pretty busy even though that whole period
we generally saw low traffic everywhere else in Fedora.

Now we're sitting pretty with a 20% utilization average.  You'll also
notice generally our service time and await are lower.  I'm trying to get
a bigger view of those numbers over time so we'll see if thats an actual
trend or not.

The big changers?  1) Better use of the share in our scripts.  2) A larger
readahead value (blockdev)

Some smaller changes included changing from cfq to deadline (and now
noop).

In the future there are two things I'd still like to do long term.

1) Move our snapshots to different devices to lower our seeks
2) Full re-index of the filesystem (requiring around 24-36 hours of
downtime) but I'm going to schedule this sometime after the Alpha ships.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-19 Thread Mike McGrath
On Mon, 19 Jan 2009, Mike McGrath wrote:
>
> The big changers?  1) Better use of the share in our scripts.  2) A larger
> readahead value (blockdev)
>

I forgot one more big change, kojipkgs (the web server our builders use to
get the packages off the nfs share) now has a squid server on it.  Instead
of pulling from the nfs share for every build, it pulls from squid now
instead.  We're seeing a 98% hit rate. (So only 2% of the requests for our
builds now actually hit our NFS share)

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-19 Thread Jesse Keating
On Mon, 2009-01-19 at 10:02 -0600, Mike McGrath wrote:
>  on the
> netapp.

Er, this is on nfs1 right, not the netapp?

-- 
Jesse Keating
Fedora -- Freedom² is a feature!
identi.ca: http://identi.ca/jkeating


signature.asc
Description: This is a digitally signed message part
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-19 Thread Mike McGrath
On Mon, 19 Jan 2009, Jesse Keating wrote:

> On Mon, 2009-01-19 at 10:02 -0600, Mike McGrath wrote:
> >  on the
> > netapp.
>
> Er, this is on nfs1 right, not the netapp?
>

My mistake, correct.  All this is on nfs1 which has directly attached
storage.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-21 Thread Ray Van Dolson
On Mon, Jan 19, 2009 at 10:55:35AM -0600, Mike McGrath wrote:
> On Mon, 19 Jan 2009, Jesse Keating wrote:
> 
> > On Mon, 2009-01-19 at 10:02 -0600, Mike McGrath wrote:
> > >  on the
> > > netapp.
> >
> > Er, this is on nfs1 right, not the netapp?
> >
> 
> My mistake, correct.  All this is on nfs1 which has directly attached
> storage.
> 

Which is backed by an MD1000?  MD3000?  The stats you generated in your
link are from sar?

Ray

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: Disk IO issues

2009-01-22 Thread Mike McGrath
On Wed, 21 Jan 2009, Ray Van Dolson wrote:

> On Mon, Jan 19, 2009 at 10:55:35AM -0600, Mike McGrath wrote:
> > On Mon, 19 Jan 2009, Jesse Keating wrote:
> >
> > > On Mon, 2009-01-19 at 10:02 -0600, Mike McGrath wrote:
> > > >  on the
> > > > netapp.
> > >
> > > Er, this is on nfs1 right, not the netapp?
> > >
> >
> > My mistake, correct.  All this is on nfs1 which has directly attached
> > storage.
> >
>
> Which is backed by an MD1000?  MD3000?  The stats you generated in your
> link are from sar?
>

MD1000 and yes.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list