Re: [ccp4bb] RAID array

2019-11-27 Thread Kay Diederichs
Hi Peter,

indeed I was not trying to say that RAID1 is an insurance against disks going 
bad. It is only the first defense against sudden and unpredictable failure (and 
has saved us a couple of times). On the contrary, we regularly inspect 
/var/log/messages since (on RHELx) this has mdadm-related messages and also 
shows non-RAID disk and other errors. We also regularly run SMART tests, run 
mdadm RAID checks, look at webserver error logs, read security logs ... I 
didn't want to expand on this since it doesn't solve Vaheh's problem ...


Best,
Kay
Am 27. November 2019 17:53:37 MEZ schrieb Peter Keller 
:

Dear all,

On 27/11/2019 14:03, Kay Diederichs wrote:
> As an example, by default in my lab we have the operating system on mdadm 
> RAID1 which consists of two disks that mirror each other. If one of the disks 
> fails, typically we only notice this when inspecting the system log files.

This won't help Vaheh, but I highly recommend configuring notifications of 
changes in the state of the RAID in this kind of setup:

(1) configure /etc/mdadm.conf with (as a minumum) the MAILADDR keyword (see 
'man mdadm.conf' for a complete list of keywords). Alternatively, use the 
PROGRAM keyword and write a script that uses something like 'wall' to notify 
users.

(2) test that notifications get through to their intended destination with 
'mdadm --monitor --test'. If the local MTA (postfix, sendmail,...) isn't set up 
correctly, you may need to do that too.

(3) make sure that 'mdadm --monitor --scan' is running. Depending on 
distro, this will be done with the usual service enable and startup commands, 
something like

   systemctl enable --now mdmonitor

or

   chkconfig --add mdmonitor

   service mdmonitor start

And yes, you've guessed it, we got bitten once by a software RAID with 
multiple disk failures, and we only noticed it when an application complained 
that it couldn't write to a file ;-)

Regards,

Peter.

-- 
Peter Keller Tel.: +44 (0)1223 353033
Global Phasing Ltd., Fax.: +44 (0)1223 366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom





To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [ccp4bb] RAID array

2019-11-27 Thread George N. Reeke
On Wed, 2019-11-27 at 14:03 +, Kay Diederichs wrote:
> Hi Vaheh,
> 
> RAID on Linux comes in different flavours and levels; the flavours are 
> software RAID (mdadm) and hardware RAID (dedicated RAID controller or 
> motherboard), and the levels are RAID0 RAID1 RAID5 RAID6 RAID10 and a few 
> others. These details influence what the user will notice when a disk goes 
> bad. Without knowing what you have, it is difficult to help.
> 
> As an example, by default in my lab we have the operating system on mdadm 
> RAID1 which consists of two disks that mirror each other. If one of the disks 
> fails, typically we only notice this when inspecting the system log files. 
> Replacing the disk, and re-silvering the RAID1 is not trivial and requires 
> some reading of material on the web. 
> 
> It sounds like you don't have this type of RAID1, or maybe there is some 
> mis-configuration.
> 
> good luck,
> Kay
> 
Just to add a bit of advice based on long experience:  Unless you really
do it every day, looking at system logs can be an unreliable way to see
when you have a bad disk.  Assuming you have root access on your Linux
system, there is a tool called logwatch that can be set to look at your
logs for you and send you an email every day with selected excerpts.
So I add a script (I call it runmdadm) to root's /etc/cron.daily to
run mdadm every night and get the RAID info into the log.  I use egrep
in the script to pick out the important output.  Part of my script looks
like this (details changed):

#!/bin/sh
# Run mdadm to check status of RAID arrays
echo "mdadm /dev/md0" > /var/tmp/mdadm.log
/sbin/mdadm --detail /dev/md0 | egrep 'active|removed' >> \
/var/tmp/mdadm.log

Then you can follow the instructions for logwatch to add this new
/var/tmp/mdadm.log to the list of logs it looks at.  (Or change the
above script to add the mdadm info to a log logwatch already looks at.)
If you have hardware RAID, the idea is the same, but you need to use
a command to the hardware controller instead of mdadm.

George Reeke (an old lurker)



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1


Re: [ccp4bb] RAID array

2019-11-27 Thread Peter Keller

Dear all,

On 27/11/2019 14:03, Kay Diederichs wrote:

As an example, by default in my lab we have the operating system on mdadm RAID1 
which consists of two disks that mirror each other. If one of the disks fails, 
typically we only notice this when inspecting the system log files.


This won't help Vaheh, but I highly recommend configuring notifications 
of changes in the state of the RAID in this kind of setup:


(1) configure /etc/mdadm.conf with (as a minumum) the MAILADDR keyword 
(see 'man mdadm.conf' for a complete list of keywords). Alternatively, 
use the PROGRAM keyword and write a script that uses something like 
'wall' to notify users.


(2) test that notifications get through to their intended destination 
with 'mdadm --monitor --test'. If the local MTA (postfix, sendmail,...) 
isn't set up correctly, you may need to do that too.


(3) make sure that 'mdadm --monitor --scan' is running. Depending on 
distro, this will be done with the usual service enable and startup 
commands, something like


   systemctl enable --now mdmonitor

or

   chkconfig --add mdmonitor

   service mdmonitor start

And yes, you've guessed it, we got bitten once by a software RAID with 
multiple disk failures, and we only noticed it when an application 
complained that it couldn't write to a file ;-)


Regards,

Peter.

--
Peter Keller Tel.: +44 (0)1223 353033
Global Phasing Ltd., Fax.: +44 (0)1223 366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1


Re: [ccp4bb] RAID array

2019-11-27 Thread David J. Schuller
I agree about the complexity of the RAID situation. But it can be narrowed down 
a bit.

Since the claim is made there were only two hard drives, the only possibilities 
are:
RAID 0 - "striping". In which case his data will probably not be recoverable, 
and he would be unable to boot.
RAID 1 - "mirroring". In which case the data may be recoverable. This is 
probably the case, since from the description the machine is booting but the 
login process is unsuccessful.

If the RAID is set up with hardware, there may be entry into the controller 
from the BIOS boot process.

If the RAID is set up in Linux software, then you need to get logged in somehow 
with root privileges. Details may vary a bit depending on which Linux 
distribution is involved (which was not mentioned).

You could try to boot into "rescue mode" from the Linux installation media (DVD 
or USB stick, usually). Or you could try to boot into "single user mode" by 
altering the boot command during the boot procedure (details will vary with 
distribution). Or you could try setting to boot up in nongraphical mode.

Like I say, this varies with Linux distribution so I won't try to give a line 
by line set of instructions. My experience is all with Fedor and Scientific 
Linux.

Once you are in, you need to assess the situation. If RAID is set up with 
software, there will be a configuration file, typically /etc/mdadm.conf, and a 
status file, typically /proc/mdstat. You can look for these files for clues to 
how the RAID is set up, and what its status is.
You would need to figure out how your existing drive is configured by checking 
out the partition arrangement with /usr/sbin/fdisk or /usr/sbin/parted.
Then you would need to set up partitions on the brand new drive in the same way.
Once the partitions exist, you would try to add the partitions into the RAID 
volume with /usr/sbin/mdadm. You are going to need to read the man pages for 
mdadm and probably search for help online. Once the RAID set is successfully 
reconfigured, it would take probably a few hours to copy the data from the good 
device to the new one.

BTW, if anyone is using HPE SSDs they need to be aware of this:
https://www.zdnet.com/article/hpe-tells-users-to-patch-ssds-to-prevent-failure-after-32768-hours-of-operation/
HPE tells users to patch SSDs to prevent failure after 32,768 hours of operation



On 2019-11-27 09:03, Kay Diederichs wrote:

Hi Vaheh,

RAID on Linux comes in different flavours and levels; the flavours are software 
RAID (mdadm) and hardware RAID (dedicated RAID controller or motherboard), and 
the levels are RAID0 RAID1 RAID5 RAID6 RAID10 and a few others. These details 
influence what the user will notice when a disk goes bad. Without knowing what 
you have, it is difficult to help.

As an example, by default in my lab we have the operating system on mdadm RAID1 
which consists of two disks that mirror each other. If one of the disks fails, 
typically we only notice this when inspecting the system log files. Replacing 
the disk, and re-silvering the RAID1 is not trivial and requires some reading 
of material on the web.

It sounds like you don't have this type of RAID1, or maybe there is some 
mis-configuration.

good luck,
Kay



On Tue, 26 Nov 2019 21:05:54 +, Oganesyan, Vaheh 
 wrote:



Hello ccp4-ers,

A bit off topic (actually a lot off topic) question regarding RAID array 
system. On linux box one of two hard drives failed. I've found identical one 
and replaced it. Can someone point me in the direction where I can get 
instructions on what to do next to be able to login? Currently computer starts 
with no error messages, comes to the login window, takes username and password 
from me and then flashes and comes back to login window.

Thank you for your help.
Regards,

Vaheh Oganesyan, Ph.D.
Scientist, Biologic Therapeutics

AstraZeneca
R | Antibody Discovery and Protein Engineering
One Medimmune Way, Gaithersburg, MD 20878
T:  301-398-5851
vaheh.oganes...@astrazeneca.com





To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1






To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1



--
===
All Things Serve the Beam
===
   David J. Schuller

Re: [ccp4bb] RAID array

2019-11-27 Thread Kay Diederichs
Hi Vaheh,

RAID on Linux comes in different flavours and levels; the flavours are software 
RAID (mdadm) and hardware RAID (dedicated RAID controller or motherboard), and 
the levels are RAID0 RAID1 RAID5 RAID6 RAID10 and a few others. These details 
influence what the user will notice when a disk goes bad. Without knowing what 
you have, it is difficult to help.

As an example, by default in my lab we have the operating system on mdadm RAID1 
which consists of two disks that mirror each other. If one of the disks fails, 
typically we only notice this when inspecting the system log files. Replacing 
the disk, and re-silvering the RAID1 is not trivial and requires some reading 
of material on the web. 

It sounds like you don't have this type of RAID1, or maybe there is some 
mis-configuration.

good luck,
Kay



On Tue, 26 Nov 2019 21:05:54 +, Oganesyan, Vaheh 
 wrote:

>Hello ccp4-ers,
>
>A bit off topic (actually a lot off topic) question regarding RAID array 
>system. On linux box one of two hard drives failed. I've found identical one 
>and replaced it. Can someone point me in the direction where I can get 
>instructions on what to do next to be able to login? Currently computer starts 
>with no error messages, comes to the login window, takes username and password 
>from me and then flashes and comes back to login window.
>
>Thank you for your help.
>Regards,
>
>Vaheh Oganesyan, Ph.D.
>Scientist, Biologic Therapeutics
>
>AstraZeneca
>R | Antibody Discovery and Protein Engineering
>One Medimmune Way, Gaithersburg, MD 20878
>T:  301-398-5851
>vaheh.oganes...@astrazeneca.com
>
>
>
>
>
>To unsubscribe from the CCP4BB list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1
>
>
>
>To unsubscribe from the CCP4BB list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1
>



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB=1


Re: [ccp4bb] raid array load question

2008-01-15 Thread Tim Gruene
Interesting and simple way to test the write performance. Simultaneous 
writes could then be tested by putting an ampersand ('') at the end of 
the 'dd' command, couldn't they? And if you get tired of typing all the 
number, you could use the 'seq' command instead.


Cheers, Tim


/bin/tcsh
set time
foreach file ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 )
dd if=/dev/zero bs=2G count=1 of=/home/username/deleteme$file
end

-James Holton
MAD Scientist


Harry M. Greenblatt wrote:

BSD

To those hardware oriented:

  We have a compute cluster with 23 nodes (dual socket, dual core Intel 
servers).  Users run simulation jobs on the nodes from the head node.  At 
the end of each simulation, a result file is compressed to 2GB, and copied 
to the file server for the cluster (not the head node) via NFS.   Each node 
is connected via a Gigabit line to a switch.  The file server has a 4-link 
aggregated Ethernet trunk (4Gb/S) to the switch.  The file server also has 
two sockets, with Dual Core Xeon 2.1GHz CPU's and 4 GB of memory, running 
RH4.  There are two raid arrays (RAID 5), each consisting of 8x500GB SATA 
II WD server drives, with one file system on each.  The raid cards are AMCC 
3WARE  9550 and 9650SE (PCI-Express) with 256 MB of cache memory . 
When several (~10)  jobs finish at once, and the nodes start copying the 
compressed file to the file server, the load on the file server gets very 
high (~10), and the users whose home directory are on the file server 
cannot work at their stations.  Using nmon to locate the bottleneck, it 
appears that disk I/O is the problem.  But the numbers being reported are a 
bit strange.  It reports a throughput of only about 50MB/s, and claims the 
disk is 100% busy.  These raid cards should give throughput in the 
several hundred MB/s range, especially the 9650 which is rated at 600MB/s 
RAID 6 write (and we have RAID 5).


1)  Is there a more friendly system load monitoring tool we can use?

2)  The users may be able to stagger the output schedule of their jobs, but 
based on the numbers, we get the feeling the RAID arrays are not performing 
as they should.  Any suggestions?


Thanks

Harry


-

Harry M. Greenblatt

Staff Scientist

Dept of Structural Biology   [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]


Weizmann Institute of SciencePhone:  972-8-934-3625

Rehovot, 76100   Facsimile:   972-8-934-4159

Israel 







Re: [ccp4bb] raid array load question

2008-01-15 Thread James Holton
Woops!  Yes, of course you would want an ampersand in my little 
pseudo-script to background the dd jobs.  My mistake.  seq is also 
one of my favorite commands, but some systems are so stripped-down that 
they don't have it!


-James

Tim Gruene wrote:
Interesting and simple way to test the write performance. Simultaneous 
writes could then be tested by putting an ampersand ('') at the end 
of the 'dd' command, couldn't they? And if you get tired of typing all 
the number, you could use the 'seq' command instead.


Cheers, Tim


/bin/tcsh
set time
foreach file ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
22 23 )

dd if=/dev/zero bs=2G count=1 of=/home/username/deleteme$file
end

-James Holton
MAD Scientist


Harry M. Greenblatt wrote:

BSD

To those hardware oriented:

  We have a compute cluster with 23 nodes (dual socket, dual core 
Intel servers).  Users run simulation jobs on the nodes from the 
head node.  At the end of each simulation, a result file is 
compressed to 2GB, and copied to the file server for the cluster 
(not the head node) via NFS.   Each node is connected via a Gigabit 
line to a switch.  The file server has a 4-link aggregated Ethernet 
trunk (4Gb/S) to the switch.  The file server also has two sockets, 
with Dual Core Xeon 2.1GHz CPU's and 4 GB of memory, running RH4.  
There are two raid arrays (RAID 5), each consisting of 8x500GB SATA 
II WD server drives, with one file system on each.  The raid cards 
are AMCC 3WARE  9550 and 9650SE (PCI-Express) with 256 MB of cache 
memory . When several (~10)  jobs finish at once, and the nodes 
start copying the compressed file to the file server, the load on 
the file server gets very high (~10), and the users whose home 
directory are on the file server cannot work at their stations.  
Using nmon to locate the bottleneck, it appears that disk I/O is the 
problem.  But the numbers being reported are a bit strange.  It 
reports a throughput of only about 50MB/s, and claims the disk is 
100% busy.  These raid cards should give throughput in the several 
hundred MB/s range, especially the 9650 which is rated at 600MB/s 
RAID 6 write (and we have RAID 5).


1)  Is there a more friendly system load monitoring tool we can use?

2)  The users may be able to stagger the output schedule of their 
jobs, but based on the numbers, we get the feeling the RAID arrays 
are not performing as they should.  Any suggestions?


Thanks

Harry


- 



Harry M. Greenblatt

Staff Scientist

Dept of Structural Biology   [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]


Weizmann Institute of SciencePhone:  972-8-934-3625

Rehovot, 76100   Facsimile:   972-8-934-4159

Israel





[ccp4bb] raid array load comments

2008-01-15 Thread Harry M. Greenblatt

BSD

Thank you to all the respondents.

Some comments:

1.  Some believe that the write performance in RAID 5 is only as good  
as performance to one disk.  This is true only in RAID 3 (under  
certain conditions), where parity is written as a separate operation  
to one dedicated parity disk.  With distributed parity in RAID 5 this  
bottleneck is overcome, and write performance is better than single  
disk limits.


2.   These are SATA II drives which are rated at 3Gb/s, or 375MB/s;  
even if single-drive I/O limits were a problem, we are not reaching  
this value.


I think James Holton's suggestions are more on the mark.  We will try  
and investigate.


Thanks

Harry

On Jan 15, 2008, at 9:54 AM, James Holton wrote:

Ahh, there is nothing quite like a nice big cluster to bring any  
file server to its knees.





 
-

Harry M. Greenblatt
Staff Scientist
Dept of Structural Biology   [EMAIL PROTECTED]
Weizmann Institute of SciencePhone:  972-8-934-3625
Rehovot, 76100   Facsimile:   972-8-934-4159
Israel




[ccp4bb] raid array load question

2008-01-14 Thread Harry M. Greenblatt

BSD

To those hardware oriented:

  We have a compute cluster with 23 nodes (dual socket, dual core  
Intel servers).  Users run simulation jobs on the nodes from the head  
node.  At the end of each simulation, a result file is compressed to  
2GB, and copied to the file server for the cluster (not the head  
node) via NFS.   Each node is connected via a Gigabit line to a  
switch.  The file server has a 4-link aggregated Ethernet trunk (4Gb/ 
S) to the switch.  The file server also has two sockets, with Dual  
Core Xeon 2.1GHz CPU's and 4 GB of memory, running RH4.  There are  
two raid arrays (RAID 5), each consisting of 8x500GB SATA II WD  
server drives, with one file system on each.  The raid cards are AMCC  
3WARE  9550 and 9650SE (PCI-Express) with 256 MB of cache memory .


When several (~10)  jobs finish at once, and the nodes start copying  
the compressed file to the file server, the load on the file server  
gets very high (~10), and the users whose home directory are on the  
file server cannot work at their stations.  Using nmon to locate the  
bottleneck, it appears that disk I/O is the problem.  But the numbers  
being reported are a bit strange.  It reports a throughput of only  
about 50MB/s, and claims the disk is 100% busy.  These raid cards  
should give throughput in the several hundred MB/s range, especially  
the 9650 which is rated at 600MB/s RAID 6 write (and we have RAID 5).


1)  Is there a more friendly system load monitoring tool we can use?

2)  The users may be able to stagger the output schedule of their  
jobs, but based on the numbers, we get the feeling the RAID arrays  
are not performing as they should.  Any suggestions?


Thanks

Harry


 
-

Harry M. Greenblatt
Staff Scientist
Dept of Structural Biology   [EMAIL PROTECTED]
Weizmann Institute of SciencePhone:  972-8-934-3625
Rehovot, 76100   Facsimile:   972-8-934-4159
Israel




Re: [ccp4bb] raid array load question

2008-01-14 Thread Kay Diederichs

Harry M. Greenblatt schrieb:

BSD

To those hardware oriented:

  We have a compute cluster with 23 nodes (dual socket, dual core Intel 
servers).  Users run simulation jobs on the nodes from the head node. 
 At the end of each simulation, a result file is compressed to 2GB, and 
copied to the file server for the cluster (not the head node) via NFS.   
Each node is connected via a Gigabit line to a switch.  The file server 
has a 4-link aggregated Ethernet trunk (4Gb/S) to the switch.  The file 
server also has two sockets, with Dual Core Xeon 2.1GHz CPU's and 4 GB 
of memory, running RH4.  There are two raid arrays (RAID 5), each 
consisting of 8x500GB SATA II WD server drives, with one file system on 
each.  The raid cards are AMCC 3WARE  9550 and 9650SE (PCI-Express) with 
256 MB of cache memory . 

When several (~10)  jobs finish at once, and the nodes start copying the 
compressed file to the file server, the load on the file server gets 
very high (~10), and the users whose home directory are on the file 
server cannot work at their stations.  Using nmon to locate the 
bottleneck, it appears that disk I/O is the problem.  But the numbers 
being reported are a bit strange.  It reports a throughput of only about 
50MB/s, and claims the disk is 100% busy.  These raid cards should 
give throughput in the several hundred MB/s range, especially the 9650 
which is rated at 600MB/s RAID 6 write (and we have RAID 5).


1)  Is there a more friendly system load monitoring tool we can use?

2)  The users may be able to stagger the output schedule of their jobs, 
but based on the numbers, we get the feeling the RAID arrays are not 
performing as they should.  Any suggestions?


Thanks

Harry


-

Harry M. Greenblatt

Staff Scientist

Dept of Structural Biology   [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]


Weizmann Institute of SciencePhone:  972-8-934-3625

Rehovot, 76100   Facsimile:   972-8-934-4159

Israel 






Harry,

to my understanding, the WRITE performance of RAID5 is no more than what 
a _single_ disk gives (essentially because almost the _same_ data have 
to be written to _all_ disks at the same time). This is different from 
the READ situation - here RAID5 should give (maybe much) more than a 
single disk.


Thus I don't find it surprising that your RAID5 write operation has 
only 50 MB/s. If you need more, you should use RAID0, or RAID10 (twice 
the number of disks compared to RAID0).


HTH,

Kay
--
Kay Diederichshttp://strucbio.biologie.uni-konstanz.de
email: [EMAIL PROTECTED]Tel +49 7531 88 4049 Fax 3183
Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [ccp4bb] raid array load question

2008-01-14 Thread James Holton
Ahh, there is nothing quite like a nice big cluster to bring any file 
server to its knees.


My experience with cases like this is that the culprit is usually NFS, 
or the disk file system being used on the RAID array.  It is generally 
NOT a bandwidth problem.  Basically, I suspect the bottleneck is the 
kernel on your file server trying to figure out where to put all these 
blocks that are flying in from the cluster.  One of the most expensive 
things you can do to a file server is an NFS write.  This is the reason 
why NFS has the noatime option, since reading a file normally involves 
updating the access time for that file (and this is a write 
operation).  Alternately, the disk file system itself (ext3?) can also 
get bogged down with many simultaneous writes.  XFS is supposed to be 
less prone to this problem, but I have heard mixed reviews.  In 
addition, writing a large number of files simultaneously seems to be a 
great way to fragment your file system.  I don't know why this is so, 
but I once used it as a protocol to create a small, heavily fragmented 
file system for testing purposes!  If the file system is fragmented, 
then access to it just starts getting slow.  No warnings, no CPU 
utilization, just really really long delays to get an ls.


That is my guess.  What I suggest doing is to come up with a series of 
tests that simulate this event at various stages.  That is, first use 
the timeless classic unix dd command to generate a crunch of 2GB files 
pseudo-simultaneously LOCALLY on the file server:

/bin/tcsh
set time
foreach file ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 )
dd if=/dev/zero bs=2G count=1 of=/home/username/deleteme$file
end
If this hoses the home directories, then you know the cluster and 
network has nothing to do with your problem.  You can do a 
quick-and-dirty benchmark of your RAID performance like this too.  
Setting the time variable above means you will get timing statistics 
for each command.  If you divide the total number of bytes by the total 
time, then you get the average write performance.  It will be 
interesting to play with the number of files as well as the division of 
the 2GB into blocks with the bs and count options.  Large block 
sizes will eat a lot of memory and small block sizes will eat a lot of 
CPU.  Setting your NFS block size to the maximum (in /etc/fstab) is 
generally a good idea for scientific computing.  Also, network blocks 
are typically 1500 bytes (the MTU size).  If this block size is a 
problem, then you might want to consider jumbo frames on your cluster 
subnet.


However, I expect the best thing is to find a way to avoid a large 
number of simultaneous NFS writes.  Either use a different transfer 
protocol (such as rcp or nc), or use some kind of lock file to prevent 
your completing jobs from copying their files all at the same time.


HTH

-James Holton
MAD Scientist


Harry M. Greenblatt wrote:

BSD

To those hardware oriented:

  We have a compute cluster with 23 nodes (dual socket, dual core 
Intel servers).  Users run simulation jobs on the nodes from the head 
node.  At the end of each simulation, a result file is compressed to 
2GB, and copied to the file server for the cluster (not the head node) 
via NFS.   Each node is connected via a Gigabit line to a switch.  The 
file server has a 4-link aggregated Ethernet trunk (4Gb/S) to the 
switch.  The file server also has two sockets, with Dual Core Xeon 
2.1GHz CPU's and 4 GB of memory, running RH4.  There are two raid 
arrays (RAID 5), each consisting of 8x500GB SATA II WD server drives, 
with one file system on each.  The raid cards are AMCC 3WARE  9550 and 
9650SE (PCI-Express) with 256 MB of cache memory . 

When several (~10)  jobs finish at once, and the nodes start copying 
the compressed file to the file server, the load on the file server 
gets very high (~10), and the users whose home directory are on the 
file server cannot work at their stations.  Using nmon to locate the 
bottleneck, it appears that disk I/O is the problem.  But the numbers 
being reported are a bit strange.  It reports a throughput of only 
about 50MB/s, and claims the disk is 100% busy.  These raid cards 
should give throughput in the several hundred MB/s range, especially 
the 9650 which is rated at 600MB/s RAID 6 write (and we have RAID 5).


1)  Is there a more friendly system load monitoring tool we can use?

2)  The users may be able to stagger the output schedule of their 
jobs, but based on the numbers, we get the feeling the RAID arrays are 
not performing as they should.  Any suggestions?


Thanks

Harry


-

Harry M. Greenblatt

Staff Scientist

Dept of Structural Biology   [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED]


Weizmann Institute of SciencePhone:  972-8-934-3625

Rehovot, 76100   Facsimile:   972-8-934-4159

Israel