Re: Linux, gobs of RAM, RAID and performance suckage...

2006-12-01 Thread Bill McGonigle

On Nov 30, 2006, at 11:59, Paul Lussier wrote:


However, now when backups are run, the system becomes completely
unresponsive from an NFS client perspective, and the load average
skyrockets (e.g. into the 40s!).

Does anyone have any ideas ?  I'm at a complete loss on this one.


Have you tried taking the RAM out and seeing if performance reverts?

If so... are the RAM sticks the correct type, properly installed for  
memory interleaving, etc?


Did the BIOS decide to reset itself when new RAM was installed?   
Could the BIOS be buggy with 4GB of RAM (that thing with the upper  
640MB of RAM being masked)?  When you're getting desperate you can  
reset the BIOS to known-good/failsafe/optimized defaults and see if  
anything changes.


I've forgotten some 2.4 stuff but there was a big-mem version of the  
2.4 kernel at one point to work around problems with too much RAM.


All guesses...

-Bill

-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Cell: 603.252.2606
http://www.bfccomputing.com/Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf

For fastest support contact, please follow:
http://bfccomputing.com/support_contact.html

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-12-01 Thread Dave Johnson
Bill McGonigle writes:
 I've forgotten some 2.4 stuff but there was a big-mem version of the  
 2.4 kernel at one point to work around problems with too much RAM.

Ah, that's right. If you're going to run 2.4 with 1GB RAM you need to
apply the rmap patches or performance will get worse the more RAM you
have.  Severely so with 2GB.

Google says rmap patches are here, but the server is giving me a 403
at the moment.

http://www.surriel.com/patches/


-- 
Dave

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Neil Joseph Schelly
On Thursday 30 November 2006 11:59 am, Paul Lussier wrote:
 Before yesterday we were noticing lots of NFS drop-outs on the clients
 (300+ of them) and we correllated this pretty much to the backups
 (amanda).  The theory was that local disk I/O was beating out
 nfs-client requests.

I'm not sure the topology of your SAN, but can you connect another machine to 
the SAN with read-only access to those filesystems to do backups without 
involving the NFS server at all?
-N
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Thomas Charron

On 11/30/06, Paul Lussier [EMAIL PROTECTED] wrote:



This is bizarre.



 Spare memory will ALWAYS be used to cache.  This is fine and 'normal'.


--
-- Thomas
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Paul Lussier
Neil Joseph Schelly [EMAIL PROTECTED] writes:

 I'm not sure the topology of your SAN,

It's not a SAN.  It's direct-attached storage.

 but can you connect another machine to the SAN with read-only access
 to those filesystems to do backups without involving the NFS server
 at all?  -N

In theory, yes, in practicality, no.  Also, the backups are not
touching the NFS server. 

Cast of characters:
  Playing the part of:Actor:
the NFS server space-monster
the backup server  amanda

Amanda connects to the amanda daemon on space-monster (via inetd) and
requests that he start his backup process.  space-monster in turn
kicks off a gtar process which sends the data back to amanda.  The
gtar process is reading the local disk, not NFS.  Apparently, this
local disk I/O request supercedes NFS disk requests.  But there's much
more than that happening here.  CPU utilization is through the roof on
space-monster.  That wasn't the case yesterday before the memory
upgrade.

-- 
Seeya,
Paul
--
Key fingerprint = 1660 FECC 5D21 D286 F853  E808 BB07 9239 53F1 28EE

A: Yes.   
 Q: Are you sure?
 A: Because it reverses the logical flow of conversation.   
 Q: Why is top posting annoying in email?
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Paul Lussier
Thomas Charron [EMAIL PROTECTED] writes:

 On 11/30/06, Paul Lussier [EMAIL PROTECTED] wrote:


 This is bizarre.


   Spare memory will ALWAYS be used to cache.  This is fine and 'normal'.

I was not implying that the use of spare memory as cache was bizarre.
I *know* that spare memory will be used this way, and that in general
this is a good thing.

What is bizaarre is that one could *INCREASE* system resources and
performance *GETS WORSE*. This is unintuitive and inverse to most
normal expectations, IOW, BIZARRE!

-- 
Seeya,
Paul
--
Key fingerprint = 1660 FECC 5D21 D286 F853  E808 BB07 9239 53F1 28EE

A: Yes.   
 Q: Are you sure?
 A: Because it reverses the logical flow of conversation.   
 Q: Why is top posting annoying in email?
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Bruce Dawson
Neil Joseph Schelly wrote:
 On Thursday 30 November 2006 11:59 am, Paul Lussier wrote:
 Before yesterday we were noticing lots of NFS drop-outs on the clients
 (300+ of them) and we correllated this pretty much to the backups
 (amanda).  The theory was that local disk I/O was beating out
 nfs-client requests.

Its been years since I've been in the guts of NFS. Things I remember for
server tuning are:

   * Make sure the user-mode processes aren't running into ulimit problems.
   * Lockd chews up a lot of kernel resources - just which ones depends
on its implementation. Check shared memory and semaphores for resource
depletion.
   * NFS over TCP is a real system hog.
   * The socket queue size (which used to be a kernel configurable
option, I don't about today's kernels), if set larger would improve
performance.
   * MTU should be uniform across the network. Things will work if its
not uniform, but performance will suffer. In general, bigger is better -
up to the maximum size the servers/clients can handle.

And old problem was that connections would linger after they've been
closed. I think the modern kernels/NFS distributions have fixed that though.

The fastest way to clog NFS is with a bunch of small random access
read-writes. This is typically manifested in directory updates (for
example: updating a netnews site over NFS), trying to run a database
across an NFS connection with a lot of simultaneous writers, and having
too many umbrellas in your empty rum glass.

Otherwise, if you can wait until I'm able to dig up my old notes on NFS
load-testing, I can get some more ideas. Right now, I'm several thousand
miles from my filing cabinet.

--Bruce
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Neil Joseph Schelly
On Thursday 30 November 2006 01:51 pm, Paul Lussier wrote:

 It's not a SAN.  It's direct-attached storage.

  Winchester OpenSAN FC-based RAID array

Isn't this the storage?  I assume the description meant it was a SAN.  What is 
the topology of that FC network?

 In theory, yes, in practicality, no.  Also, the backups are not
 touching the NFS server.

 Amanda connects to the amanda daemon on space-monster (via inetd) and
 requests that he start his backup process.  space-monster in turn
 kicks off a gtar process which sends the data back to amanda.  The
 gtar process is reading the local disk, not NFS.  Apparently, this
 local disk I/O request supercedes NFS disk requests.  But there's much
 more than that happening here.  CPU utilization is through the roof on
 space-monster.  That wasn't the case yesterday before the memory
 upgrade.

It sounds like the NFS server, space-monster in your description, is having to 
read all that data from the filesystem and send all that data to the backup 
server called amanda in your description.

If you could put another server on the SAN, in the FC network, that machine 
could also mount the partition (read-only obviously) on the SAN and backup 
the contents without the space-monster even knowing anything is happening.
-N
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Dave Johnson
Paul Lussier writes:
 
 This is bizarre.
 
 We've got an NFS server with Dual 3Ghz Xeon CPUs as our NFS server
 connected to a Winchester OpenSAN FC-based RAID array.  The array is a
 single 1TB partition (unfortunately).
 
 Before yesterday we were noticing lots of NFS drop-outs on the clients
 (300+ of them) and we correllated this pretty much to the backups
 (amanda).  The theory was that local disk I/O was beating out
 nfs-client requests.
 
 We also noticed that our memory utilization was through the roof.  We
 had 2GB of PC2300, ECC, DDR, Registered RAM.  That RAM was averaging
 the following usage patterns:
 
  active - 532M
  inactive   - 1.2G
  unused -  39M
  cache  - 1.3G
  slab cache - 255M
  swap cache -   6M
  apps   -  78M
  buffers- 350M
  swap   -  11M
 
 We were topping out our memory usage and occasionally dipping into
 swap space on disk.
 
 Yesterday we added 2GB of RAM and our memory utilization now looks like this:
 
  active - 793M
  inactive   - 2.3G
  unused - 213M
  cache  - 2.9G
  slab cache - 194M
  swap cache -   2M
  apps   -  71M
  buffers- 313M
  swap   - 4.5M
 
 So, it appears we really only succeeded in doubling the cache
 available to the system, and a little more than halving the amount of
 swap that was getting touched.
 
 However, now when backups are run, the system becomes completely
 unresponsive from an NFS client perspective, and the load average
 skyrockets (e.g. into the 40s!).
 
 Does anyone have any ideas ?  I'm at a complete loss on this one.


General nfs server comments:

1)
Make sure you are exporting the nfs shares async otherwise most
operations will seem slow from the clients point of view. Check
/proc/fs/nfs/exports for 'async'  If not there, set it in your
/etc/exports

2)
Make sure you server has enough nfsd threads to handle the client
load.  With 300 clients, you should have at least 100 nfsd threads in
my opinion. Check your init.d scripts for how to set this.


Other stuff:

11MB of swap doesn't mean anything is wrong.  It's actually a
good thing meaning your kernel found 11MB of stuff that wasn't needed
and booted it out to swap space in order to make more room available
for cache.

You should check the disk rates and cpu usage with 'vmstat 5' for a
few minutes.  This will also show how much of the cpus are spending in
wait. The full outout /proc/meminfo may also be useful.

I assume you are using a x86_64 kernel?  Using a 32bit kernel is ok as
long as you dont run out of low memory.  The kernel's heap/slab as
well as filesystem metadata (buffers) must come from low memory while
filesystem data (cache) and userspace processes can be in high memory.

If your SAN is on a dedicated lan to the server only you should
investigate converting that subnet to support jumbo frames.

Since your using Xeons not Opterons you should make sure irqbalance is
installed and running to spread irq load across all cpu cores (this
may not be a good idea on a multi-node Opteron system though)  You can
run top in multi-cpu mode (press '1') to see if any cpus are
overloaded with irq or wait load while others are idle.


-- 
Dave

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Paul Lussier
Drew Van Zandt [EMAIL PROTECTED] writes:

 Can you ask amanda to do bandwidth limiting?  I'd think this would be
 a standard feature...

Yes, at least with respect to network bandwidth.  I'm not sure about
Disk I/O bandwidth. But I don't think it's an amanda thing, I think
it's a poorly tuned kernel thing. Possibly combined with an ancient
buggy kernel thing.  And not just little of a Linux makes a crappy
NFS server thing :)

-- 
Seeya,
Paul
--
Key fingerprint = 1660 FECC 5D21 D286 F853  E808 BB07 9239 53F1 28EE

A: Yes.   
 Q: Are you sure?
 A: Because it reverses the logical flow of conversation.   
 Q: Why is top posting annoying in email?
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Dave Johnson
Paul Lussier writes:
 Yesterday we added 2GB of RAM and our memory utilization now looks like this:
 
  active - 793M
  inactive   - 2.3G
  unused - 213M
  cache  - 2.9G
  slab cache - 194M
  swap cache -   2M
  apps   -  71M
  buffers- 313M
  swap   - 4.5M

When you are in this condition, how large is Dirty and Writeback in
/proc/meminfo?

If you have a large amount of Dirty (read: 300MB) and the underlying
filesystem is using ordered journaling (the default for ext3) you can
cause very long delays if a process requests a fsync on a file (vi
does this alot).  Instead of writing a single file to disk right away
the filesystem must write all other outstanding data to disk first!

If your clients are write heavy and you are using ext3 you should
consider data=writeback when mounting the filesystem.

I've run into this issue on a server with 64GB of RAM with dirty
commonly exceeding 4GB under load.  Poor users of vi had to keep
waiting for the server to write out 4GB of data every time they saved
a file. data=writeback helped tremendously as only the previous
metadata is forced to disk on a fsync() instead of everything.


 Correct.  The nfsd's on the server get pushed to the bottom of the
 queue (in top) and what I see is [kupdated] and tar (from the backups)
 rotating between the top position.  There was one or two other
 processes up there too, one of which I think was a kernel thread
 ([bdflush] ?).  As far a I know, the tar's were not making processes,
 but I can't be totally sure of that.  Nothing *else* was making
 progress, that's for sure.

kupdated?!?! A 2.4 kernel??!?!? 


-- 
Dave

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: Linux, gobs of RAM, RAID and performance suckage...

2006-11-30 Thread Paul Lussier
Dave Johnson [EMAIL PROTECTED] writes:

 kupdated?!?! A 2.4 kernel??!?!? 

Ahm, yeah. Are you shocked at kupdated running with a 2.4 kernel
because it shouldn't be there with a 2.4 kernel, or shocked I'm still
running a 2.4 kernel?

-- 
Seeya,
Paul
--
Key fingerprint = 1660 FECC 5D21 D286 F853  E808 BB07 9239 53F1 28EE

A: Yes.   
 Q: Are you sure?
 A: Because it reverses the logical flow of conversation.   
 Q: Why is top posting annoying in email?
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/