Howdy yall,

Earlier this month I downloaded and installed the latest copy of OpenSolaris 
(2008.05) so that I could test out some of the newer features I've heard so 
much about, primarily ZFS. 

My goal was to replace our aging linux-based (SuSE 10.1) file and media server 
with a new machine running Sun's OpenSolaris and ZFS. Our old server ran your 
typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used lvm, mdadm, and 
xfs to help keep things in order, and relied on NFS to export users' shares. It 
was solid, stable, and worked wonderfully well.

I would like to replicate this experience using the tools OpenSolaris has to 
offer, taking advantages of ZFS. However, there are enough differences between 
the two OSes - especially with respect to the filesystems and (for lack of a 
better phrase) "RAID managers" - to cause me to consult (on numerous occasions) 
the likes of Google, these forums, and other places for help.

I've been successful in troubleshooting all problems up until now.

On our old media server (the SuSE 10.1 one), when a disk failed, the machine 
would send out an e-mail detailing the type of failure, and gracefully fall 
into a degraded state, but would otherwise continue to operate using the 
remaining 3 disks in the system. After the faulty disk was replaced, all of the 
data from the old disk would be replicated onto the new one (I think the term 
is "resilvered" around here?), and after a few hours, the RAID5 array would be 
seamlessly promoted from "degraded" back up to a healthy "clean" (or "online") 
state.

Throughout the entire process, there would be no interruptions to the end user: 
all NFS shares still remained mounted, there were no noticeable drops in I/O, 
files, directories, and any other user-created data still remained available, 
and if everything went smoothly, no one would notice a failure had even 
occurred.

I've tried my best to recreate something similar in OpenSolaris, but I'm stuck 
on making it all happen seamlessly.

For example, I have a standard beige box machine running OS 2008.05 with a 
zpool that contains 4 disks, similar to what the old SuSE 10.1 server had. 
However, whenever I unplug the SATA cable from one of the drives (to simulate a 
catastrophic drive failure) while doing moderate reading from the zpool (such 
as streaming HD video), not only does the video hang on the remote machine 
(which is accessing the zpool via NFS), but the server running OpenSolaris 
seems to either hang, or become incredibly unresponsive. 

And when I write unresponsive, I mean that when I type the command "zpool 
status" to see what's going on, the command hangs, followed by a frozen 
Terminal a few seconds later. After just a few more seconds, the entire GUI - 
mouse included - locks up or freezes, and all NFS shares become unavailable 
from the perspective of the remote machines. The whole machine locks up hard.

The machine then stays in this frozen state until I plug the hard disk back in, 
at which point everything, quite literally, pops back into existence all at 
once: the output of the "zpool status" command flies by (with all disks listed 
as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as "0"), the 
mouse jumps to a different part of the screen, the NFS share becomes available 
again, and the movie resumes right where it had left off.

While such a quick resume is encouraging, I'd like to avoid the freeze in the 
first place.

How can I keep any hardware failures like the above transparent to my users?

-Todd

PS: I've done some researching, and while my problem is similar to the 
following:

http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481

most of these posts are quite old, and do not offer any solutions.

PSS: I know I haven't provided any details on hardware, but I feel like this is 
more likely a higher-level issue (like some sort of configuration file or 
setting is needed) rather than a lower-level one (like faulty hardware). 
However, if someone were to give me a command to run, I'd gladly do it... I'm 
just not sure which ones would be helpful, or if I even know which ones to run. 
It took me half an hour of searching just to find out how to list the disks 
installed in this system (it's "format") so that I could build my zpool in the 
first place. It's not quite as simple as writing out /dev/hda, /dev/hdb, 
/dev/hdc, /dev/hdd. ;)
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to