Re: [zfs-discuss] I/O Read starvation
Hi, it seems you might have somekind of hardware issue there, I have no way reproducing this. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of bank kus Sent: 10. tammikuuta 2010 7:21 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] I/O Read starvation Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more disastrous. The GUI stops moving caps lock stops responding for large intervals no clue why. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] possible to remove a mirror pair from a zpool?
Suppose the requirements for storage shrink ( it can happen ) is it possible to remove a mirror set from a zpool? Given this : # zpool status array03 pool: array03 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h41m with 0 errors on Sat Jan 9 22:54:11 2010 config: NAME STATE READ WRITE CKSUM array03 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors Suppose I want to power down the disks c2t19d0 and c5t5d0 because they are not needed. One can easily picture a thumper with many disks unused and see reasons why one would want to power off disks. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] possible to remove a mirror pair from a zpool?
No, sorry Dennis, this functionality doesn't exist yet, but is being worked, but will take a while, lots of corner cases to handle. James Dickens uadmin.blogspot.com On Sun, Jan 10, 2010 at 3:23 AM, Dennis Clarke dcla...@blastwave.orgwrote: Suppose the requirements for storage shrink ( it can happen ) is it possible to remove a mirror set from a zpool? Given this : # zpool status array03 pool: array03 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h41m with 0 errors on Sat Jan 9 22:54:11 2010 config: NAME STATE READ WRITE CKSUM array03 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 spares c2t22d0AVAIL errors: No known data errors Suppose I want to power down the disks c2t19d0 and c5t5d0 because they are not needed. One can easily picture a thumper with many disks unused and see reasons why one would want to power off disks. -- Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
What version of Solaris / OpenSolaris are you using? Older versions use mmap(2) for reads in cp(1). Sadly, mmap(2) does not jive well with ZFS. To be sure, you could check how your cp(1) is implemented using truss(1) (i.e. does it do mmap/write or read/write?) aside I find it interesting that ZFS's mmap(2) deficiencies are now dictating implementation of utilities which may benefit from mmap(2) on other filesystems. And whilst some might argue that mmap(2) is dead for file I/O, I think it's interesting to note that Linux appears to have a relatively efficient mmap(2) implementation. Sadly, this means that some commercial apps which are mmap(2) heavy currently perform much bettter on Linux than Solaris, especially ZFS. However, I doubt that Linux uses mmap(2) for reads in cp(1). /aside You could also try using dd(1) instead of cp(1). However, it seems to me that you are using bs=1G count=8 as a lazy way to generate 8GB (because you don't want to do the math on smaller blocksizes?) Did you know that you are asking dd(1) to do 1GB read(2) and write(2) systems calls using a 1GB buffer? This will cause further pressure on the memory system. In performance terms, you'll probably find that block sizes beyond 128K add little benefit. So I'd suggest something like: dd if=/dev/urandom of=largefile.txt bs=128k count=65536 dd if=largefile.txt of=./test/1.txt bs=128k dd if=largefile.txt of=./test/2.txt bs=128k Phil http://harmanholistix.com bank kus wrote: dd if=/dev/urandom of=largefile.txt bs=1G count=8 cp largefile.txt ./test/1.txt cp largefile.txt ./test/2.txt Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state. Question: I m guessing this because ZFS doesnt use CFQ and that one process is allowed to queue up all its I/O reads ahead of other processes? Is there a concept of priority among I/O reads? I only ask because if root were to launch some GUI application they dont start up until both copies are done. So there is no concept of priority? Needless to say this does not exist on Linux 2.60... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] possible to remove a mirror pair from a zpool?
No, sorry Dennis, this functionality doesn't exist yet, but is being worked, but will take a while, lots of corner cases to handle. James Dickens uadmin.blogspot.com 1 ) dammit 2 ) looks like I need to do a full offline backup and then restore to shrink a zpool. As usual, Thanks for always being there James. Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs, raidz, spare and jbod
We had a similar problem on Areca 1680. It was caused by a drive that didn't properly reset (took ~2 seconds each time according to the drive tray's led). Replacing the drive solved this problem, but then we hit another problem which you can see in this thread : http://opensolaris.org/jive/thread.jspa?threadID=121335tstart=0 I'm curious wether you have a similar setup and encounter the same problems. How did you setup your pools ? Please tell me if you have any luck setting the drives to pass-through. Thanks, Arnaud Le 09/01/10 14:26, Rob a écrit : I have installed Opensolaris build 129 on our server. It has 12 disk at a Areca 1130 controller. Using the latest firmware. I have put all the disk in jbod and running them in raidz2. After a while the systems hangs with arcmsr0: tran reset )level 0x1) called for target4 lun 0 target reset not supported What can I do? I really want it to work! (I am gone set all the disk to pass-through monday) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
Hi Phil You make some interesting points here: - yes bs=1G was a lazy thing - the GNU cp I m using does __not__ appears to use mmap open64 open64 read write close close is the relevant sequence - replacing cp with dd 128K * 64K does not help no new apps can be launched until the copies complete. Regards banks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
Hello again, On Jan 10, 2010, at 5:39 AM, bank kus wrote: Hi Henrik I have 16GB Ram on my system on a lesser RAM system dd does cause problems as I mentioned above. My __guess__ dd is probably sitting in some in memory cache since du -sh doesnt show the full file size until I do a sync. At this point I m less looking for QA type repro questions and/or speculations rather looking for ZFS design expectations. What is the expected behaviour, if one thread queues 100 reads and another thread comes later with 50 reads are these 50 reads __guaranteed__ to fall behind the first 100 or is timeslice/fairshre done between two streams? Btw this problem is pretty serious with 3 users using the system one of them initiating a large copy grinds the other 2 to a halt. Linux doesnt have this problem and this is almost a switch O/S moment for us unfortunately :-( Have you reproduced the problem without using /dev/urandom? I can only get this behavior when using dd from urandom, not using files with cp, and not even files with dd. This could then be related the random driver spending kernel time in high priority threads. So while I agree that this is not optimal, there is a huge difference in how bad it is, if it's urandom generated there is no problem with copying files. Since you also found that it's not related to ZFS (also tmpfs, and perhaps only urandom?) we are on the wrong list. Please isolate the problem, can we put aside any filesystem, if so we are on the wrong list, i've added perf-discuss also. Regards Henrik http://sparcv9.blogspot.com Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Repeating scrub does random fixes
I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. fmdump only reports, three types of errors: ereport.fs.zfs.checksum ereport.io.scsi.cmd.disk.tran ereport.io.scsi.cmd.disk.recovered The middle one seems to be the issue I'd like to track down the source. Any docs on how to do this? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Sun, 10 Jan 2010, Phil Harman wrote: In performance terms, you'll probably find that block sizes beyond 128K add little benefit. So I'd suggest something like: dd if=/dev/urandom of=largefile.txt bs=128k count=65536 dd if=largefile.txt of=./test/1.txt bs=128k dd if=largefile.txt of=./test/2.txt bs=128k As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does not produce the expected file size when using /dev/urandom as input: % /bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536 0+65536 records in 0+65536 records out % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home 65M Jan 10 09:32 largefile.txt % /opt/sfw/bin/dd if=/dev/urandom of=largefile.txt bs=131072 count=65536 0+65536 records in 0+65536 records out 68157440 bytes (68 MB) copied, 1.9741 seconds, 34.5 MB/s % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home 65M Jan 10 09:33 largefile.txt % df -h . FilesystemSize Used Avail Use% Mounted on Sun_2540/zfstest/defaults 1.2T 66M 1.2T 1% /Sun_2540/zfstest/defaults However: % dd if=/dev/urandom of=largefile.txt bs=1024 count=8388608 8388608+0 records in 8388608+0 records out 8589934592 bytes (8.6 GB) copied, 255.06 seconds, 33.7 MB/s % ls -lh largefile.txt -rw-r--r-- 1 bfriesen home8.0G Jan 10 09:40 largefile.txt % dd if=/dev/urandom of=largefile.txt bs=8192 count=1048576 0+1048576 records in 0+1048576 records out 1090519040 bytes (1.1 GB) copied, 31.8846 seconds, 34.2 MB/s It seems that on my system dd + /dev/urandom is willing to read 1k blocks from /dev/urandom but with even 8K blocks, the actual blocksize is getting truncated down (without warning), producing much less data than requested. Testing with /dev/zero produces different results: % dd if=/dev/zero of=largefile.txt bs=8192 count=1048576 1048576+0 records in 1048576+0 records out 8589934592 bytes (8.6 GB) copied, 20.7434 seconds, 414 MB/s WTF? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
place a sync call after dd ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote: I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. That is a good indication for hardware related errors. Software will do the same thing every time but hardware errors are often random. But you are running an older version now, I would recommend an upgrade. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS cache flush ignored by certain devices ?
A very interesting thread (http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/) and some thinking about the design of SSD's lead to a experiment I did with the Intel X25-M SSD. The question was: Is my data safe, once it has reached the disk and has been commited to my application ? All transactional safety in ZFS requires the correct impementation of the synchronize cache command (see http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg27264.html, where someone used Opensolaris within VirtualBox, which per default - does ignore the cache flush command). Thus qualified hardware is VERY essential (also see http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf). What I did (for a Intel X25-M G2 (default settings = write cache on) and a Seagate SATA drive (ST3500418AS)): a) Create a Pool b) Create a Programm that opens a file synchronously and writes to the file. It also prints the latest record written successfully. c) pull the power of the SATA disk d) power cycle everything e) open the pool again and verify the content of the file is the one that has been to the application e1) if it is the same - nice hardware e2) if it is NOT the same - BAD hardware What I found out was: Intel X25-M G2: - If I pull the power cable much data is lost, altought commited to the app (some hundred) - If I pull the sata cable no data is lost ST3500418AS: - If I pull the power cable almost no data is lost, but still the last write is lost (strange!) - If I pull the sata cable no data is lost Actually this result was partially expected. Howerver the one missing transaction in my SATA HDD Disk (Seagate) is strange. Unfortunately I do not have enterprise SAS hardware handy to verify that my test procedure is correct. Maybe someone can run this test on a SAS test machine ? (see script attached) --- Attachments --- --- script (call it with script.pl --file /mypool/testfile) --- #!/usr/bin/env perl # for O_SYNC use Fcntl qw(:DEFAULT :flock SEEK_CUR SEEK_SET SEEK_END); use IO::File; use Getopt::Long; my $pool=disk; my $mountroot=/volumes; my $file=$mountroot/$pool/testfile; my $abort=0; my $count=0; GetOptions( pool=s = \$pool, testfile|file=s = \$file, count=i = \$count, ); my $dir = $file; $dir =~ s/[^\/]+$//g; if (-e $file) { print ERROR: File $file already exists\n; exit 1; } if (! -d $dir ) { print ERROR: Directory $dir does not exist\n; exit 1; } sysopen (FILE, $file, O_RDWR | O_CREAT | O_EXCL | O_SYNC) or die ERROR Opening file $file: $!\n; $SIG{INT}= sub { print ... signalling Abort ... (file: $file)\n; $abort=1; }; $|=1; my $lastok=undef; my $i=0; my $msg=sprintf(This is round number %20s, $i); # O_SYNC, O_CREAT while (!$abort) { $i++; if ($count $i$count) { last; }; $msg=sprintf(This is round number %20s, $i); sysseek (FILE, 0, SEEK_SET); print $msg; my $rc=syswrite FILE,$msg; if (!defined($rc)) { print ERROR\n; print ERROR While writing $msg\n; print ERROR: $!\n; last; } else { print DONE \n; $lastok=$msg; } } close(FILE); print \nTHE LAST MESSAGE WRITTEN to file $file was:\n\n\t\$lastok\\n\n; Here's the logs of my tests 1) Test the SATA SSD (Intel X25-M) -- .. start write.pl This is round number67482 This is round number67483 This is round number67484 This is round number67485 This is round number67486 This is round number67487 This is round number67488 This is round number67489 This is round number67490 ( .. I pull the POWER CABLE of the SATA SSD .. ) .. I/O hangs .. zpool status shows zpool status -v pool: ssd state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-JQ scrub: none requested config: NAMESTATE READ WRITE CKSUM ssd UNAVAIL 011 0 insufficient replicas c3t5d0UNAVAIL 3 2 0 cannot open errors: Permanent errors have been detected in the following files: ssd:0x0 /volumes/ssd/ /volumes/ssd/testfile ... now I power cycled the machine and put back the power cable ... lets see the pool status pool: ssd state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM ssd ONLINE 0 0 0 c3t5d0ONLINE 0 0 0 errors: No known data errors lets
Re: [zfs-discuss] I/O Read starvation
Hello Bob, On Jan 10, 2010, at 4:54 PM, Bob Friesenhahn wrote: On Sun, 10 Jan 2010, Phil Harman wrote: In performance terms, you'll probably find that block sizes beyond 128K add little benefit. So I'd suggest something like: dd if=/dev/urandom of=largefile.txt bs=128k count=65536 dd if=largefile.txt of=./test/1.txt bs=128k dd if=largefile.txt of=./test/2.txt bs=128k As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does not produce the expected file size when using /dev/urandom as input: Do you feel this is related to the filesystem, is there any difference between putting the data in a file on ZFS or just throwing it away? $(dd if=/dev/urandom of=/dev/null bs=1048576k count=16) gives me a quite unresponsive system too. Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
I managed to disable the write cache (did not know a tool on Solaris, hoever hdadm from the EON NAS binary_kit does the job): Same power discuption test with Seagate HDD and write cache disabled ... --- r...@nexenta:/volumes# .sc/bin/hdadm write_cache display c3t5 c3t5 write_cache disabled ... pull power cable of Seagate SATA Disk This is round number 4543 DONE This is round number 4544 DONE This is round number 4545 DONE This is round number 4546 DONE This is round number 4547 DONE This is round number 4548 DONE This is round number 4549 DONE This is round number 4550 ... hangs here ... power cycle everything node1:/mnt/disk# cat testfile This is round number 4549 ... this looks good. So disabeling the write cache helps, but limits the performance really (not for synchronous but for async writes). Test with Intel X25-M -- ... Same with SSD r...@nexenta:/volumes# hdadm write_cache off c3t5 c3t5 write_cache disabled r...@nexenta:/volumes# hdadm write_cache display c3t5 c3t5 write_cache disabled .. pull SSD power cable This is round number 9249 DONE This is round number 9250 DONE This is round number 9251 DONE This is round number 9252 DONE This is round number 9253 DONE This is round number 9254 DONE This is round number 9255 DONE This is round number 9256 DONE This is round number 9257 ... hangs here .. power cycle everything ... test node1:/mnt/ssd# cat testfile This is round number 9256 So without a write cache the device works correctly However be warned on boot the cache is enabled again: DeviceSerialVendor Model Rev Temperature ------ - --- c3t5d0p0 7200Y5160AGN ATA INTEL SSDSA2M160 02HD 255 C (491 F) r...@nexenta:/volumes# hdadm write_cache display c3t5 c3t5 write_cache enabled Question: Does anyone know the impact of disabeling the write cache for the write amplification factor of the intel SSD's ? I would deploy Intel X25-M only for mostly read workloads anyway. Thus the performance impact of disabeling the write cache can be ignored. However if the life expectency of the device goes down without a write cache (I means it is MLC already!) - Bummer. And another Question: How can I permanently disable the write cache on the Intel X25-M SSD's ? Regards -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Sun, 10 Jan 2010, Henrik Johansson wrote: As an interesting aside, on my Solaris 10U8 system (plus a zfs IDR), dd (Solaris or GNU) does not produce the expected file size when using /dev/urandom as input: Do you feel this is related to the filesystem, is there any difference between putting the data in a file on ZFS or just throwing it away? My guess is that is due to the implementation of /dev/urandom. It seems to be blocked-up at 1024 bytes and 'dd' is just using that block size. It is interesting that OpenSolaris is different, and this seems like a bug in Solaris 10. It seems like a new bug to me. The /dev/random and /dev/urandom devices are rather special since reading from them consumes a precious resource -- entropy. Entropy is created based on other activities of the system, which are expected to be random. Using up all the available entropy could dramatically slow-down software which uses /dev/random, such as ssh or ssl. The /dev/random device will completely block when the system runs out of entropy. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs, raidz, spare and jbod
Hello Arnaud, Thanks for your reply. We have a system ( 2 x Xeon 5410, Intel S5000PSL mobo and 8 GB memory) with 12 x 500 GB SATA disks on a Areca 1130 controller. rpool is a mirror over 2 disks. 8 disks in raidz2, 1 spare. We have 2 aggr links. Our goal is a ESX storage system, I am using ISCSI and NFS to serve space to our ESX 4.0 servers. We can remove a disk, with no problem. I can do a replace and the disk is being resilverd. That works fine here. Our problem comes when we make it the server a little bit harder! When we give the server a hard time, copy 60G+ of data or do some other stuff to give the system some load it hangs. This happens after 5 minutes or after 30 minutes or later but it hangs. Then we get the problems of the attached pictures. I have also emaild Areca. I'll hope the can fix it.. Regards, Rob -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Actually the performance decrease when disableing the write cache on the SSD is aprox 3x (aka 66%). Setup: node1 = Linux Client with open-iscsi server = comstar (cache=write through) + zvol (recordsize=8k, compression=off) --- with SSD-Disk-write cache disabled: node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Iozone: Performance Test of File I/O Version $Revision: 3.327 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Sun Jan 10 20:14:46 2010 Include fsync in write timing Include close in write timing Record Size 8 KB File size set to 131072 KB SYNC Mode. O_DIRECT feature enabled Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Output is in Kbytes/sec Time Resolution = 0.02 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 2 Max process = 2 Throughput test with 2 processes Each process writes a 131072 Kbyte file in 8 Kbyte records Children see throughput for 2 initial writers =1324.45 KB/sec Parent sees throughput for 2 initial writers =1291.27 KB/sec Min throughput per process = 646.07 KB/sec Max throughput per process = 678.38 KB/sec Avg throughput per process = 662.23 KB/sec Min xfer= 124832.00 KB Children see throughput for 2 rewriters=4360.29 KB/sec Parent sees throughput for 2 rewriters =4360.08 KB/sec Min throughput per process =2158.82 KB/sec Max throughput per process =2201.47 KB/sec Avg throughput per process =2180.15 KB/sec Min xfer= 128536.00 KB Children see throughput for 2 random readers= 43930.41 KB/sec Parent sees throughput for 2 random readers = 43914.01 KB/sec Min throughput per process = 21768.16 KB/sec Max throughput per process = 22162.25 KB/sec Avg throughput per process = 21965.21 KB/sec Min xfer= 128760.00 KB Children see throughput for 2 random writers=5561.01 KB/sec Parent sees throughput for 2 random writers =5560.41 KB/sec Min throughput per process =2780.37 KB/sec Max throughput per process =2780.64 KB/sec Avg throughput per process =2780.50 KB/sec Min xfer= 131064.00 KB ... with SSD write cache enabled node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Iozone: Performance Test of File I/O Version $Revision: 3.327 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Sun Jan 10 20:22:14 2010 Include fsync in write timing Include close in write timing Record Size 8 KB File size set to 131072 KB SYNC Mode. O_DIRECT feature enabled Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Output is in Kbytes/sec Time Resolution = 0.02 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 2 Max process = 2 Throughput test with 2 processes Each process writes a 131072 Kbyte file in 8 Kbyte records Children see throughput for 2 initial writers =3387.15 KB/sec Parent sees throughput for 2 initial writers =3258.90 KB/sec
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
On Sun, 10 Jan 2010, Lutz Schumann wrote: Talking about read performance. Assuming a reliable ZIL disk (cache flush = working): The ZIL can guarantee data integrity, however if the backend disks (aka pool disks) do not properly implement cache flush - a reliable ZIL device does not workaround the bad backend disks rigth ??? (meaning: having a reliable ZIL + some MLC SSD with write cache enabled is not reliable at the end) As soon as there is more than one disk in the pool, it is necessary for cache flush to work or else the devices may contain content from entirely different transaction groups, resulting in a scrambled pool. If you just had one disk which tended to ignore cache flush requests, then you should be ok as long as the disk writes the data in order. In that case any unwritten data would be lost, but the pool should not be lost. If the device ignores cache flush requests and writes data in some random order, then the pool is likely to eventually fail. I think that zfs mirrors should be safer than raidz when faced with devices which fail to flush (should be similar to the single-disk case), but only if there is one mirror pair. A scary thing about SSDs is that they may re-write old data while writing new data, which could result in corruption of the old data if the power fails while it is being re-written. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Sun, Jan 10, 2010 at 09:54:56AM -0600, Bob Friesenhahn wrote: WTF? urandom is a character device and is returning short reads (note the 0+n vs n+0 counts). dd is not padding these out to the full blocksize (conv=sync) or making multiple reads to fill blocks (conv=fullblock). Evidently the urandom device changed behaviour along the way, with regards to producing/buffering additional requested data, possibly as a result of a changed source implementation that stretches better/faster. No bug here, just bad assumptions. -- Dan. pgpcePztLEpd4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Jan 8, 2010, at 7:49 PM, bank kus wrote: dd if=/dev/urandom of=largefile.txt bs=1G count=8 cp largefile.txt ./test/1.txt cp largefile.txt ./test/2.txt Thats it now the system is totally unusable after launching the two 8G copies. Until these copies finish no other application is able to launch completely. Checking prstat shows them to be in the sleep state. What disk drivers are you using? IDE? -- richard Question: I m guessing this because ZFS doesnt use CFQ and that one process is allowed to queue up all its I/O reads ahead of other processes? Is there a concept of priority among I/O reads? I only ask because if root were to launch some GUI application they dont start up until both copies are done. So there is no concept of priority? Needless to say this does not exist on Linux 2.60... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
Mattias Pantzare wrote: On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote: I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. That is a good indication for hardware related errors. Software will do the same thing every time but hardware errors are often random. But you are running an older version now, I would recommend an upgrade. I would have thought that too if it didn't start right after the switch from SXCE to OSOL. As for an upgrade, use the dev repository on my laptop and I find that OSOL updates aren't nearly as stable as SXCE was. I tried for a bit, but always had to go back to 111b because something crucial broke. I was hoping to wait until the official release in March in order to let things stabilize. This is my main web/mail/file/etc. server and I don't really want to muck too much. That said, I may take a gambol on upgrading as we're getting closer to the 2010.x release. Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS, xVM and NFS file serving
I'm hoping someone has already come across this problem and solved it. I'm using xVM to create 2 NFS fileservers. The Dom0 has a zpool with 8 zvols in it. 4 of the zvols are used for 4 DomU. 2 of the DomUs are fileservers, and each fileserver attaches 2 more of the zvols for the user filespace. The user filespace is then exported via NFS. The zvols attached as the user filespace are formatted UFS. For the first 25 or 30 nfs clients to mount the exported filespace performance is OK, however it then drops off dramatically - 90 seconds to do an ls -l I've done no tuning, everything is stock. Would using zfs for the user filespace rather than UFS help? Any other ideas for a) determining where the problem is b) improving the throughput. Thanks john Landamore Dept Computer Science University of Leicester UK ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Cache + ZIL on single SSD
Hi all, Sorry for spamming your mailinglist, but since I could not find a direct awnser on the internet and archives, I give this a try! I am building an ZFS filesystem to export iSCSI LUN's. Now I was wondering if the L2ARC has the ability to cache non-filesystem iscsi lun's? Or does it only work in combination with the ZFS mounted filesystem? Next to that I am reading all kind of performance benefits using seperate devices for the ZIL (write) and the Cache (read). I was wondering if I could share a single SSD between both ZIL and Cache device? Or is this not recommended? This because the ZIL needs 1 GB top from what I read? But since SSD is not cheap, I would like to make use of the other GB's on the disk. Thank you. Armand___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Cache + ZIL on single SSD
Next to that I am reading all kind of performance benefits using seperate devices for the ZIL (write) and the Cache (read). I was wondering if I could share a single SSD between both ZIL and Cache device? Or is this not recommended? i asked something similar recently. The answers i got were along these lines: you can use a single ssd but it's not a great idea. If you DO need to use a single ssd for such a thing, make sure it's one of the more expensive SLC variety like the intel x25-e The MLC variety of SSD works well for L2ARC and is much cheaper (you can pick up some for less than 100 bucks) while the ZIL really should have the SLC variety. I'm not an expert though, i'm just passing on advice i've been given. For reads and dedup L2ARC does make a dramatic difference, and for NFS and database stuff an SSD ZIL will make a huge difference. I've heard of people getting 5-10x's performance increase on reads just by adding a cheap ssd so i'd say it's worth it if that's the type of dataset you have. Another thing i was told on more than one occasion was that you may not even NEED a ssd for your ZIL (basically you should run a script to see if you need a ZIL at all, i don't remember where this script is but i'm SURE someone will reply with it) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Cache + ZIL on single SSD
Hi ; I dont think that anyone owns the list and as anyone else you are very welcome to ask any question. L2arc will cache zpool so if your iscs lun is a zvol or a file on zfs it will be cached. Please use constar if you need performance You are correct that you will only need couple of gb as zil. Performance gain of using a portion of ssd for l2arc will depend on your pool configuration, your average load mix. A side note: use raid 10 if you are using zfs and iscsi and need random iops Best regards Mertol Sent from a mobile device Mertol Ozyoney On 11.Oca.2010, at 01:05, A. Krijgsman a.krijgs...@draftsman.nl wrote: Hi all, Sorry for spamming your mailinglist, but since I could not find a direct awnser on the internet and archives, I give this a try! I am building an ZFS filesystem to export iSCSI LUN's. Now I was wondering if the L2ARC has the ability to cache non- filesystem iscsi lun's? Or does it only work in combination with the ZFS mounted filesystem? Next to that I am reading all kind of performance benefits using seperate devices for the ZIL (write) and the Cache (read). I was wondering if I could share a single SSD between both ZIL and Cache device? Or is this not recommended? This because the ZIL needs 1 GB top from what I read? But since SSD is not cheap, I would like to make use of the other GB's on the disk. Thank you. Armand ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss