Re: [zfs-discuss] ZFS performance falls off a cliff
~# uname -a SunOS nas01a 5.11 oi_147 i86pc i386 i86pc Solaris ~# zfs get version pool0 NAME PROPERTY VALUESOURCE pool0 version 5- ~# zpool get version pool0 NAME PROPERTY VALUESOURCE pool0 version 28 default -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
This is a slow operation which can only be done about 180-250 times per second for very random I/Os (may be more with HDD/Controller caching, queuing and faster spindles). I'm afraid that seeking to very dispersed metadata blocks, such as traversing the tree during a scrub on a fragmented drive, may qualify as a very random I/O. And that's the thing- I would understand if my scrub was slow because the disks were just being hammered by IOPS but- all joking aside- my pool is almost entirely idle according to an iostat -Xn -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify stmf_sbd_lu properties
I can't actually disable the STMF framework to do this but I can try renaming things and dumping the properties from one device to another and see if it works- it might actually do it. I will let you know. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
It sent a series of blocks to write from the queue, newer disks wrote them and stay dormant, while older disks seek around to fit that piece of data... When old disks complete the writes, ZFS batches them a new set of tasks. The thing is- as far as I know the OS doesn't ask the disk to find a place to fit the data. Instead the OS tracks what space on the disk is free and then tells the disk where to write the data. Even if ZFS was waiting for the IO to complete I would expect to see that delay reflected in the disk service times. In our case we see no high service times, no busy disks, nothing. It seems like ZFS is just sitting there quietly and thinking to itself. If the processor were busy that might make sense but even there- our processor seems largely idle. At the same time- even a scrub on this system is a joke right now and that's a read intensive operation. I'm seeing a scrub speed of 400K/s but almost no IO's to my disks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify stmf_sbd_lu properties
It turns out this was actually as simple as: stmfadm create-lu -p guid=XXX.. I kept looking at modify-lu to change this and never thought to check the create-lu options. Thanks to Evaldas for the suggestion. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
I've been going through my iostat, zilstat, and other outputs all to no avail. None of my disks ever seem to show outrageous service times, the load on the box is never high, and if the darned thing is CPU bound- I'm not even sure where to look. (traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I'm zeroing deleted blocks inside the internal pool. This took several days already, but recovered lots of space in my main pool also... When you say you are zeroing deleted blocks- how are you going about doing that? Despite claims to the contrary- I can understand ZFS needing some tuning. What I can't understand are the baffling differences in performance I see. For example- after deleting a large volume- suddenly my performance will skyrocket- then gradually degrade- but the question is why? I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores that also seem to be idle. I seem to have enough memory. What is ZFS doing during this time? Everything I've read suggests one of two possible causes- too full, or bad hardware. Is there anything else that might be an issue here? Another ZFS factor I haven't taken into account? Space seems to be the biggest factor in my performance difference- more free space = more performance- but as my fullest disks are less than 70% full, and my emptiest disks are less than 10% full- I can't understand why space is an issue. I have a few hardware errors for one of my pool disks- but we're talking about a very small number of errors over a long period of time. I'm considering replacing this disk but the pool is so slow at times I'm loathe to slow it down further by doing a replace unless I can be more certain that is going to fix the problem. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problem suggestions?
# dd if=/dev/zero of=/dcpool/nodedup/bigzerofile Ahh- I misunderstood your pool layout earlier. Now I see what you were doing. People on this forum have seen and reported that adding a 100Mb file tanked their multiterabyte pool's performance, and removing the file boosted it back up. Sadly I think several of those posts were mine or those of coworkers. Disks that have been in use for a longer time may have very fragmented free space on one hand, and not so much of it on another, but ZFS is still trying to push bits around evenly. And while it's waiting on some disks, others may be blocked as well. Something like that... This could explain why performance would go up after a large delete but I've not seen large wait times for any of my disks. The service time, percent busy, and every other metric continues to show nearly idle disks. If this is the problem- it would be nice if there were a simple zfs or dtrace query that would show it to you. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send receive problem/questions
Try using the -d option to zfs receive. The ability to do zfs send -R ... | zfs receive [without -d] was added relatively recently, and you may be encountering a bug that is specific to receiving a send of a whole pool. I just tried this, didn't work, new error: # zfs send -R naspool/open...@xfer-11292010 | zfs recv -d npool/openbsd cannot receive new filesystem stream: out of space The destination pool is much larger (by several TB) than the source pool, so I don't see how it can not have enough disk space: # zfs list -r npool/openbsd NAME USED AVAIL REFER MOUNTPOINT npool/openbsd 82.5G 7.18T 23.5G /npool/openbsd npool/open...@xfer-11292010 0 - 23.5G - npool/openbsd/openbsd 59.0G 7.18T 23.5G /npool/openbsd/openbsd npool/openbsd/open...@xfer-11292010 0 - 23.5G - npool/openbsd/openbsd/4.5 22.3G 7.18T 1.54G /npool/openbsd/openbsd/4.5 npool/openbsd/openbsd/4...@xfer-11292010 0 - 1.54G - npool/openbsd/openbsd/4.5/packages 18.7G 7.18T 18.7G /npool/openbsd/openbsd/4.5/packages npool/openbsd/openbsd/4.5/packa...@xfer-112920100 - 18.7G - npool/openbsd/openbsd/4.5/packages-local49.7K 7.18T 49.7K /npool/openbsd/openbsd/4.5/packages-local npool/openbsd/openbsd/4.5/packages-lo...@xfer-11292010 0 - 49.7K - npool/openbsd/openbsd/4.5/ports 288M 7.18T 259M /npool/openbsd/openbsd/4.5/ports npool/openbsd/openbsd/4.5/po...@patch00047.2K - 49.7K - npool/openbsd/openbsd/4.5/po...@patch00529.0M - 261M - npool/openbsd/openbsd/4.5/po...@xfer-11292010 0 - 259M - npool/openbsd/openbsd/4.5/release462M 7.18T 462M /npool/openbsd/openbsd/4.5/release npool/openbsd/openbsd/4.5/rele...@xfer-11292010 0 - 462M - npool/openbsd/openbsd/4.5/src728M 7.18T 703M /npool/openbsd/openbsd/4.5/src npool/openbsd/openbsd/4.5/s...@patch000 47.2K - 49.7K - npool/openbsd/openbsd/4.5/s...@patch005 25.1M - 709M - npool/openbsd/openbsd/4.5/s...@xfer-11292010 0 - 703M - npool/openbsd/openbsd/4.5/xenocara 572M 7.18T 565M /npool/openbsd/openbsd/4.5/xenocara npool/openbsd/openbsd/4.5/xenoc...@patch000 47.2K - 49.7K - npool/openbsd/openbsd/4.5/xenoc...@patch005 6.52M - 565M - npool/openbsd/openbsd/4.5/xenoc...@xfer-112920100 - 565M - npool/openbsd/openbsd/4.8 13.2G 7.18T 413M /npool/openbsd/openbsd/4.8 npool/openbsd/openbsd/4...@xfer-11292010 0 - 413M - npool/openbsd/openbsd/4.8/packages 11.9G 7.18T 11.9G /npool/openbsd/openbsd/4.8/packages npool/openbsd/openbsd/4.8/packa...@xfer-112920100 - 11.9G - npool/openbsd/openbsd/4.8/packages-local49.7K 7.18T 49.7K /npool/openbsd/openbsd/4.8/packages-local npool/openbsd/openbsd/4.8/packages-lo...@xfer-11292010 0 - 49.7K - npool/openbsd/openbsd/4.8/ports 277M 7.18T 277M /npool/openbsd/openbsd/4.8/ports npool/openbsd/openbsd/4.8/po...@patch00047.2K - 49.7K - npool/openbsd/openbsd/4.8/po...@xfer-11292010 0 - 277M - npool/openbsd/openbsd/4.8/release577M 7.18T 577M /npool/openbsd/openbsd/4.8/release npool/openbsd/openbsd/4.8/rele...@xfer-11292010 0 - 577M - npool/openbsd/openbsd/4.8/src 96.9K 7.18T 49.7K /npool/openbsd/openbsd/4.8/src npool/openbsd/openbsd/4.8/s...@patch000 47.2K - 49.7K - npool/openbsd/openbsd/4.8/s...@xfer-11292010 0 - 49.7K - npool/openbsd/openbsd/4.8/xenocara 96.9K 7.18T 49.7K /npool/openbsd/openbsd/4.8/xenocara npool/openbsd/openbsd/4.8/xenoc...@patch000 47.2K - 49.7K - npool/openbsd/openbsd/4.8/xenoc...@xfer-112920100 - 49.7K - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send receive problem/questions
Hello, I am attempting to move a bunch of zfs filesystems from one pool to another. Mostly this is working fine, but one collection of file systems is causing me problems, and repeated re-reading of man zfs and the ZFS Administrators Guide is not helping. I would really appreciate some help/advice. Here is the scenario. I have a nested (hierarchy) of zfs file systems. Some of the deeper fs are snapshotted. All this exists on the source zpool First I recursively snapshotted the whole subtree: zfs snapshot -r nasp...@xfer-11292010 Here is a subset of the source zpool: # zfs list -r naspool NAME USED AVAIL REFER MOUNTPOINT naspool 1.74T 42.4G 37.4K /naspool nasp...@xfer-11292010 0 - 37.4K - naspool/openbsd113G 42.4G 23.3G /naspool/openbsd naspool/open...@xfer-11292010 0 - 23.3G - naspool/openbsd/4.4 21.6G 42.4G 2.33G /naspool/openbsd/4.4 naspool/openbsd/4...@xfer-11292010 0 - 2.33G - naspool/openbsd/4.4/ports 592M 42.4G 200M /naspool/openbsd/4.4/ports naspool/openbsd/4.4/po...@patch00052.5M - 169M - naspool/openbsd/4.4/po...@patch00654.7M - 194M - naspool/openbsd/4.4/po...@patch00754.9M - 194M - naspool/openbsd/4.4/po...@patch01355.1M - 194M - naspool/openbsd/4.4/po...@patch01635.1M - 200M - naspool/openbsd/4.4/po...@xfer-11292010 0 - 200M - Now I want to send this whole hierarchy to a new pool. # zfs create npool/openbsd # zfs send -R naspool/open...@xfer-11292010 | zfs receive -Fv npool/openbsd receiving full stream of naspool/open...@xfer-11292010 into npool/open...@xfer-11292010 received 23.5GB stream in 883 seconds (27.3MB/sec) cannot receive new filesystem stream: destination has snapshots (eg. npool/open...@xfer-11292010) must destroy them to overwrite it What am I doing wrong? What is the proper way to accomplish my goal here? And I have a follow up question: I had to snapshot the source zpool filesystems in order to zfs send them. Once they are received on the new zpool, I really don't need nor want this snapshot on the receiving side. Is it OK to zfs destroy that snapshot? I've been pounding my head against this problem for a couple of days, and I would definitely appreciate any tips/pointers/advice. Don -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send receive problem/questions
Here is some more info on my system: This machine is running Solaris 10 U9, with all the patches as of 11/10/2010. The source zpool I am attempting to transfer from was originally created on a older OpenSolaris (specifically Nevada) release, I think it was 111. I did a zpool export on that zpool, and physically transferred those drives to the new machine, where I did a zpool import, and and then upgraded the ZFS version on the imported zpool, now: # zpool upgrade This system is currently running ZFS pool version 22. All pools are formatted using this version. The reference to OpenBSD in the directory paths in the listings I provided refers only to the data that is stored therein, the actual OS I am running here is Solaris 10. # zpool status naspool npool pool: naspool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM naspool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors pool: npool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM npool ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 errors: No known data errors -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Resizing ZFS block devices and sbdadm
sbdadm can be used with a regular ZFS file or a ZFS block device. Is there an advatage to using a ZFS block device and exporting it to comstar via sbdadm as opposed to using a file and exporting it? (e.g. performance or manageability?) Also- let's say you have a 5G block device called pool/test You can resize it by doing: zfs set volsize=10G pool/test However if the device was already imported into comstar then stmfadm list-lu -v guid will still only report the original 5G block size. You can use sbdadm modify-lu -s 10G /path_to_block_device but I'm not sure if there is a chance you might run into a size difference between ZFS and sbd. i.e.- if I specify 10G in ZFS, and I do an sbdadm modify-lu -s 10G is there any chance they won't align and I'll try to write past the end of the zvol? Thanks in advance- -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool lockup and dedupratio meanings
I've previously posted about some lockups I've experienced with ZFS. There were two suspected causes at the time: one was deduplication, and one was the 2009.06 code we were running. After upgrading the zpools and adding some more disks to the pool I initiated a zpool scrub and was rewarded with an immediate zfs lockup. I switched to my backup head, killed and restarted the scrub and poof- lockup. Anyone have any ideas why a scrub would lockup my pool? The system itself, and the root pool have no problems. The lockup occurs whether I try to write directly to the pool from the system or to the pool via comstar. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS performance questions
I have an OpenSolaris (technically OI 147) box running ZFS with Comstar (zpool version 28, zfs version 5) The box is a 2950 with 32 GB of RAM, Dell SAS5/e card connected to 6 Promise vTrak J610sD (dual controller SAS) disk shelves spread across both channels of the card (2 chains of 3 shelves). We currently have: 4 x OCZ Vertex 2 SSD's configured as a ZIL (We've been experimenting without a dedicated ZIL, with 2 mirrors, and with 4 individual drives- these are not meant to be a permanent part of the array- they were installed to evaluate limited SSD benefits) 2 x 300GB 15k RPM Hot Spare drives- one on each channel 2 x 600GB 15k RPM Hot Spare drives- one on each channel 52 x 300GB 15k RPM disks configured as 4 Disk RAIDz (13 zdevs) 20 x 600GB 15k RPM disks configured as 4 disk RAIDz (5 zdevs) (Eventually there will be 16 more 600GB disks -4 more zdevs for a total of 22 zdevs) Most of our disk access is through COMSTAR via iSCSI. That said- even performance tests direct to the local disks reveal good, but not great performance. Most of our sequential write performance tests show about 200 MB/sec to the storage which seems pretty low given the disk's and their individual performance. I'd love to have configured the disks as mirrors but I needed a minimum of 20 TB in the space provided and I could not achieve that when using mirrors. Can anyone provide a link to good performance analysis resources so I can try to track down where my limited write performance is coming from? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool lockup and dedupratio meanings
I've previously posted about some lockups I've experienced with ZFS. There were two suspected causes at the time: one was deduplication, and one was the 2009.06 code we were running. I upgraded to OpenIndiana 147 (initially without upgrading the zpool and zfs disk versions). The lockups reduced in frequency but still occurred. I've since upgraded the zpool and zfs versions and we'll see what happens. Dedup was the more likely causes and so we turned it off and recreated all the iscsi LUNS that were being exported so as to eliminate the deduplicated data. That almost entirely eliminated the lockups. Having said all that- I have two questions: When I query for the dedupratio, I still a value of 2.37x -- r...@nas:~# zpool get dedupratio pool0 NAME PROPERTYVALUE SOURCE pool0 dedupratio 2.37x - -- Considering the fact that all of the iscsi LUNS that were created when dedup was on were subsequently deleted and recreated with dedup disabled- I don't understand why the value is still 2.37x. It should be near zero (There are probably a couple of small luns that were not removed but they are rarely used). Am I misinterpreting the meaning of this number? Second question: The most recent pool lockup was caused when a zpool scrub was kicked off. Initially we see 0 values for the write bandwidth in a zpool iostat and average numbers for the read. After a few minutes we see the read numbers jump to several hundred megs/second and the write performance fluctuate between 0 and a few kilobytes/second. Anyone else see this behavior and can provide some insight? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?
Now, if someone would make a Battery FOB, that gives broken SSD 60 seconds of power, then we could use the consumer SSD's in servers again with real value instead of CYA value. You know- it would probably be sufficient to provide the SSD with _just_ a big capacitor bank. If the host lost power it would stop writing and if the SSD still had power it would probably use the idle time to flush it's buffers. Then there would be world peace! Yeah- got a little carried away there. Still this seems like an experiment I'm going to have to try on my home server out of curiosity more than anything else :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
I just spoke with a co-worker about doing something about it. He says he can design a small in-line UPS that will deliver 20-30 seconds of 3.3V, 5V, and 12V to the SATA power connector for about $50 in parts. It would be even less if only one voltage was needed. That should be enough for most any SSD to finish any pending writes. Oh I wasn't kidding when I said I was going to have to try this with my home server. I actually do some circuit board design and this would be an amusing project. All you probably need is 5v- I'll look into it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
The SATA power connector supplies 3.3, 5 and 12v. A complete solution will have all three. Most drives use just the 5v, so you can probably ignore 3.3v and 12v. I'm not interested in building something that's going to work for every possible drive config- just my config :) Both the Intel X25-e and the OCZ only uses the 5V rail. You'll need to use a step up DC-DC converter and be able to supply ~ 100mA at 5v. It's actually easier/cheaper to use a LiPoly battery charger and get a few minutes of power than to use an ultracap for a few seconds of power. Most ultracaps are ~ 2.5v and LiPoly is 3.7v, so you'll need a step up converter in either case. Ultracapacitors are available in voltage ratings beyond 12volts so there is no reason to use a boost converter with them. That eliminates high frequency switching transients right next to our SSD which is always helpful. In this case- we have lots of room. We have a 3.5 x 1 drive bay, but a 2.5 x 1/4 hard drive. There is ample room for several of the 6.3V ELNA 1F capacitors (and our SATA power rail is a 5V regulated rail so they should suffice)- either in series or parallel (Depending on voltage or runtime requirements). http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf You could 2 caps in series for better voltage tolerance or in parallel for longer runtimes. Either way you probably don't need a charge controller, a boost or buck converter, or in fact any IC's at all. It's just a small board with some caps on it. Cost for a 5v only system should be $30 - $35 in one-off prototype-ready components with a 1100mAH battery (using prices from Sparkfun.com), You could literally split a sata cable and add in some capacitors for just the cost of the caps themselves. The issue there is whether the caps would present too large a current drain on initial charge up- If they do then you need to add in charge controllers and you've got the same problems as with a LiPo battery- although without the shorter service life. At the end of the day the real problem is whether we believe the drives themselves will actually use the quiet period on the now dead bus to write out their caches. This is something we should ask the manufacturers, and test for ourselves. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for some use cases) to narrow the window of data loss from ~30 seconds to a sub-second value. There are lots of reasons to enable the ZIL now- I can throw four very inexpensive SSD's in there now in a pair of mirrors, and then when a better drive comes along I can replace each half of the mirror without bringing anything down. My slots are already allocated and it would be nice to save a few extra seconds of writes- just in case. It's not a great solution- but nothing is. I don't have access to a ZEUS- and even if I did- I wouldn't pay that kind of money for what amounts to a Vertex 2 Pro but with SLC flash. I'm kind of flabbergasted that no one has simply stuck a capacitor on a more reasonable drive. I guess the market just isn't big enough- but I find that hard to believe. Right now it seems like the options are all or nothing. There's just no %^$#^ middle ground. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Well- 40k IOPS is the current claim from ZEUS- and they're the benchmark. They use to be 17k IOPS. How real any of these numbers are from any manufacturer is a guess. Given the Intel's refusal to honor a cache flush, and their performance problems with the cache disabled- I don't trust them any more than anyone else right now. As for the Vertex drives- if they are within +-10% of the Intel they're still doing it for half of what the Intel drive costs- so it's an option- not a great option- but still an option. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Well the larger size of the Vertex, coupled with their smaller claimed write amplification should result in sufficient service life for my needs. Their claimed MTBF also matches the Intel X25-E's. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Since it ignores Cache Flush command and it doesn't have any persistant buffer storage, disabling the write cache is the best you can do. This actually brings up another question I had: What is the risk, beyond a few seconds of lost writes, if I lose power, there is no capacitor and the cache is not disabled? My ZFS system is shared storage for a large VMWare based QA farm. If I lose power then a few seconds of writes are the least of my concerns. All of the QA tests will need to be restarted and all of the file systems will need to be checked. A few seconds of writes won't make any difference unless it has the potential to affect the integrity of the pool itself. Considering the performance trade-off, I'd happily give up a few seconds worth of writes for significantly improved IOPS. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). And I don't think that bothers me. As long as the array itself doesn't go belly up- then a few seconds of lost transactions are largely irrelevant- all of the QA virtual machines are going to have to be rolled back to their initial states anyway. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). I'll pick one- performance :) Honestly- I wish I had a better grasp on the real world performance of these drives. 50k IOPS is nice- and considering the incredible likelihood of data duplication in my environment- the SandForce controller seems like a win. That said- does anyone have a good set of real world performance numbers for these drives that you can link to? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New SSD options
I'm looking for alternatives SSD options to the Intel X25-E and the ZEUS IOPS. The ZEUS IOPS would probably cost as much as my entire current disk system (80 15k SAS drives)- and that's just silly. The Intel is much less expensive, and while fast- pales in comparison to the ZEUS. I've allocated 4 disk slots in my array for ZIL SSD's and I'm trying to find the best performance for my dollar. With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL? http://www.ocztechnology.com/products/solid-state-drives/2-5--sata-ii/performance-enterprise-solid-state-drives/ocz-vertex-2-sata-ii-2-5--ssd.html They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at half the price of an Intel X25-E (3.3k IOPS, $400). Needless to say I'd love to know if anyone has evaluated these drives to see if they make sense as a ZIL- for example- do they honor cache flush requests? Are those sustained IOPS numbers? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?
Not to be a conspiracy nut but anyone anywhere could have registered that gmail account and supplied that answer. It would be a lot more believable from Mr Kay's Oracle or Sun account. On 4/20/2010 9:40 AM, Ken Gunderson wrote: On Tue, 2010-04-20 at 13:57 +0100, Dominic Kay wrote: Oracle has no plan to move from ZFS as the principle storage platform for Solaris 10 and OpenSolaris. It remains key to both data management and to the OS infrastructure such as root/boot, install and upgrade. Thanks Dominic Kay Product Manager, Filesystems Oracle I'll take that as a definitive answer;) Much appreciated. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
Yes yes- /etc/zfs/zpool.cache - we all hate typos :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I must note that you haven't answered my question... If the zpool.cache file differs between the two heads for some reason- how do I ensure that the second head has an accurate copy without importing the ZFS pool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I'm not certain if I'm misunderstanding you- or if you didn't read my post carefully. Why would the zpool.cache file be current on the _second_ node? The first node is where I've added my zpools and so on. The second node isn't going to have an updated cache file until I export the zpool from the first system and import it to the second system no? In my case- I believe both nodes have exactly the same view of the disks- all the controllers and targets are identical- but there is no reason they have to be as far as I know. As such- simply backing up the primary systems zpool.cache to the secondary could cause problems. I'm simply curious if there is a way for a node to keep it's zpool.cache up to date without actually importing the zpool. i.e. is there a scandisks command that can scan for a zpool without importing it. Am I misunderstanding something here? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
Ok- I think perhaps I'm failing to explain myself. I want to know if there is a way for a second node- connected to a set of shared disks- to keep its zpool.cache up to date _without_ actually importing the ZFS pool. As I understand it- keeping the zpool up to date on the second node would provide additional protection should the slog fail at the same time my primary head failed (it should also improve import times if what I've read is true). I understand that importing the disks to the second node will update the cache file- but by that time it may be too late. I'd like to update the cache file _before_ then. I see no reason why the second node couldn't scan the disks being used by the first node and then update it's zpool.cache. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
That section of the man page is actually helpful- as I wasn't sure what I was going to do to ensure the nodes didn't try to bring up the zpool on their own- outside of clustering software or my own intervention. That said- it still doesn't explain how I would keep the secondary nodes zpool.cache up to date. If I create a zpool on the first node. Import it on the second, then move it back to the first. Now they both have a current zpool.cache. If I add additional disks to the first node- how do I get the second nodes cache file current without first importing the disks? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
Now I'm simply confused. Do you mean one cachefile shared between the two nodes for this zpool? How, may I ask, would this work? The rpool should be in /etc/zfs/zpool.cache. The shared pool should be in /etc/cluster/zpool.cache (or wherever you prefer to put it) so it won't come up on system start. What I don't understand is how the second node is either a) supposed to share the first nodes cachefile or b) create it's own without importing the pool. You say this is the job of the cluster software- does ha-cluster already handle this with their ZFS modules? I've asked this question 5 different ways and I either still haven't gotten an answer- or still don't understand the problem. Is there a way for a passive node to generate it's _own_ zpool.cache without importing the file system. If so- how. If not- why is this unimportant? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I apologize- I didn't mean to come across as rude- I'm just not sure if I'm asking the right question. I'm not ready to use the ha-cluster software yet as I haven't finished testing it. For now I'm manually failing over from the primary to the backup node. That will change- but I'm not ready to go there yet. As such I'm trying to make sure both my nodes have a current cache file so that the targets and GUID's are ready. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I understand that important bit about having the cachefile is the GUID's (although the disk record is, I believe, helpful in improving import speeds) so we can recover in certain oddball cases. As such- I'm still confused why you say it's unimportant. Is it enough to simply copy the /etc/cluster/zpool.cache file from the primary node to the secondary so that I at least have the GUID's even if the disks references (the /dev/dsk sections) might not match? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
Continuing on the best practices theme- how big should the ZIL slog disk be? The ZFS evil tuning guide suggests enough space for 10 seconds of my synchronous write load- even assuming I could cram 20 gigabits/sec into the host (2 10 gigE NICs) That only comes out to 200 Gigabits which = 25 Gigabytes. I'm currently planning to use 4 32GB SSD's arranged in 2 2 way mirrors which should give me 64GB of log space. Is there any reason to believe that this would be insufficient (especially considering I can't begin to imagine being able to cram 5 Gb/s into the host- let alone 20). Are there any guidelines on how much ZIL performance should increase with 2 SSD slogs (4 disks with mirrors) over a single SSD slog (2 disks mirrored). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I think the size of the ZIL log is basically irrelevant That was the understanding I got from reading the various blog posts and tuning guide. only a single SSD, just due to the fact that you've probably got dozens of disks attached, and you'll probably use multiple log devices striped just for the sake of performance. I've got 72 (possibly 76) 15k RPM 300GB and 600GB SAS drives and my head has 16 GB of RAM though that can be increased at any time to 32GB. My current plan is to use 4 x 32GB SLC write optimized SSD's in a striped mirrors configuration. I'm curious if anyone knows how ZIL slog performance scales. For example- how much benefit would you expect from 2 SSD slogs over 1? Would there be a significant benefit to 3 over 2 or does it begin to taper off? I'm sure a lot of this is dependent on the environment- but rough ideas are good to know. Is it safe to assume that a stripe across two mirrored write optimized SSD's is going to give me the best performance for 4 available drive bays (assuming I want the ZIL to remain safe)? Is it even physically possible to write 4G to any device in less than 10 seconds? I wasn't actually sure the 10 second number was still accurate- that was definitely part of my question. If it is- then yes- I could never fill a 32 GB ZIL, let alone a 64GB one. Thanks for all of the help and advice. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I always try to plan for the worst case- I just wasn't sure how to arrive at the worst case. Thanks for providing the information- and I will definitely checkout the dtrace zilstat script. Considering the smallest SSD I can buy from a manufacturer that I trust seems to be 32GB- that's probably going to be my choice. As for the choice of striping across two mirrored pairs- I want every last IOP I Can get my hands on- an extra $700 isn't going to make much of a difference in a system involving 2 heads, 5 storage shelves, and 76 SAS drives- if I could think of something better to spend that money on- I would- but right now- it seems like the best option. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an Intel X-25E (~3.3K IOPS). Where can you even get the Zeus drives? I thought they were only in the OEM market and last time I checked they were ludicrously expensive. I'm looking for between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2 striped mirror would be fine). Right now we peak at about 3k IOPS (though that's not to a ZFS system) but I would like to be able to be able to burst to double that. We do have a lot of small size burst writes hence our ZIL concerns. A SRAM or DRAM-based drive (with FLASH backup) will behave dramatically differently than a typical SSD. As long as it can speak SAS or SATA and I can put it in a drive shelf I'd happily consider using it. All the DRAM devices I know are host based and that won't help my cluster. On that note- what write optimized SSD's do you recommend? I don't actually know where to buy the Zeus drives even if they've become more reasonably priced. Thanks for taking the time to share- it's been very informative. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
So if the Intel X25E is a bad device- can anyone recommend an SLC device with good firmware? (Or an MLC drive that performs as well?) I've got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS drives in 19 4 disk raidz sets, 2 hot spares, and 2 bays set aside for a mirrored ZIL) connected to two servers (so if one fails I can import on the other one). Host based cards are not an option for my ZIL- I need something that sits in the array and can be imported by the other system. I was planning on using a pair of mirrored SLC based Intel X25E's because of their superior write performance but if it's going to destroy my pool- then it's useless. Does anyone else have something that can match their write performance without breaking ZFS? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? I'd prefer not to lose the ZIL, fail over, and then suddenly find out I can't import the pool on my second head. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
But if the X25E doesn't honor cache flushes then it really doesn't matter if they are mirrored- they both may cache the data, not write it out, and leave me screwed. I'm running 2009.06 and not one of the newer developer candidates that handle ZIL losses gracefully (or at all- at least as far as I understand things). As for the optimal performance- a single pair probably won't give me optimal performance- but based on all the numbers I've seen it's still going to beat using the pool disks. If I find the ZIL is still a bottleneck I'll definitely add a second set of SSD's- but I've got a lot of testing to do before I get there. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
I'm not sure to what you are referring when you say my running BE I haven't looked at the zpool.cache file too closely but if the devices don't match between the two systems for some reason- isn't that going to cause a problem? I was really asking if there is a way to build the cache file without importing the disks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing faulty disk in ZFS pool
I believe there are a couple of ways that work. The commands I've always used are to attach the new disk as a spare (if not already) and then replace the failed disk with the spare. I don't know if there are advantages or disavantages but I also have never had a problem doing it this way. Andreas Höschler wrote: Dear managers, one of our servers (X4240) shows a faulty disk: -bash-3.00# zpool status pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: none requested config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 mirrorONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 mirrorDEGRADED 0 0 0 c1t6d0 FAULTED 019 0 too many errors c1t7d0 ONLINE 0 0 0 errors: No known data errors I derived the following possible approaches to solve the problem: 1) A way to reestablish redundancy would be to use the command zpool attach tank c1t7d0 c1t15d0 to add c1t15d0 to the virtual device c1t6d0 + c1t7d0. We still would have the faulty disk in the virtual device. We could then dettach the faulty disk with the command zpool dettach tank c1t6d0 2) Another approach would be to add a spare disk to tank zpool add tank spare c1t15d0 and the replace to replace the faulty disk. zpool replace tank c1t6d0 c1t15d0 In theory that is easy, but since I have never done that and since this is a productive server I would appreciate if somone with more experience would look on my agenda before I issue these commands. What is the difference between the two approaches? Which one do you recommend? And is that really all that has to be done or am I missing a bit? I mean can c1t6d0 be physically replaced after issuing zpool dettach tank c1t6d0 or zpool replace tank c1t6d0 c1t15d0? I also found the command zpool offline tank ... but am not sure whether this should be used in my case. Hints are greatly appreciated! Thanks a lot, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing faulty disk in ZFS pool
If her adds the spare and then manually forces a replace, it will take no more time than any other way. I do this quite frequently and without needing the scrub which does take quite a lot of time. cindy.swearin...@sun.com wrote: Hi Andreas, Good job for using a mirrored configuration. :-) Your various approaches would work. My only comment about #2 is that it might take some time for the spare to kick in for the faulted disk. Both 1 and 2 would take a bit more time than just replacing the faulted disk with a spare disk, like this: # zpool replace tank c1t6d0 c1t15d0 Then you could physically replace c1t6d0 and add it back to the pool as a spare, like this: # zpool add tank spare c1t6d0 For a production system, the steps above might be the most efficient. Get the faulted disk replaced with a known good disk so the pool is no longer degraded, then physically replace the bad disk when you have the time and add it back to the pool as a spare. It is also good practice to run a zpool scrub to ensure the replacement is operational and use zpool clear to clear the previous errors on the pool. If the system is used heavily, then you might want to run the zpool scrub when system use is reduced. If you were going to physically replace c1t6d0 while it was still attached to the pool, then you might offline it first. Cindy On 08/06/09 13:17, Andreas Höschler wrote: Dear managers, one of our servers (X4240) shows a faulty disk: -bash-3.00# zpool status pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: none requested config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 mirrorONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 mirrorDEGRADED 0 0 0 c1t6d0 FAULTED 019 0 too many errors c1t7d0 ONLINE 0 0 0 errors: No known data errors I derived the following possible approaches to solve the problem: 1) A way to reestablish redundancy would be to use the command zpool attach tank c1t7d0 c1t15d0 to add c1t15d0 to the virtual device c1t6d0 + c1t7d0. We still would have the faulty disk in the virtual device. We could then dettach the faulty disk with the command zpool dettach tank c1t6d0 2) Another approach would be to add a spare disk to tank zpool add tank spare c1t15d0 and the replace to replace the faulty disk. zpool replace tank c1t6d0 c1t15d0 In theory that is easy, but since I have never done that and since this is a productive server I would appreciate if somone with more experience would look on my agenda before I issue these commands. What is the difference between the two approaches? Which one do you recommend? And is that really all that has to be done or am I missing a bit? I mean can c1t6d0 be physically replaced after issuing zpool dettach tank c1t6d0 or zpool replace tank c1t6d0 c1t15d0? I also found the command zpool offline tank ... but am not sure whether this should be used in my case. Hints are greatly appreciated! Thanks a lot, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discus s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fed up with ZFS causing data loss
This may have been mentioned elsewhere and, if so, I apologize for repeating. Is it possible your difficulty here is with the Marvell driver and not, strictly speaking, ZFS? The Solaris Marvell driver has had many, MANY bug fixes and continues to this day to be supported by IDR patches and other quick-fix work-arounds. It is the source of many problems. Graned, ZFS handles these poorly at times (it got a lot better with ZFS v10) but it is difficult to expect the file system to deal well with underlying instability in the hardware driver I think. I'd be interested to hear if your experiences are the same using the LSI controllers which have a much better driver in Solaris. Ross wrote: Supermicro AOC-SAT2-MV8, based on the Marvell chipset. I figured it was the best available at the time since it's using the same chipset as the x4500 Thumper servers. Our next machine will be using LSI controllers, but I'm still not entirely happy with the way ZFS handles timeout type errors. It seems that it handles drive reported read or write errors fine, and also handles checksum errors, but it's completely missed drive timeout errors as used by hardware raid controllers. Personally, I feel that when a pool usually responds to requests in the order of milliseconds, a timeout of even a tenth of a second is too long. Several minutes before a pool responds is just a joke. I'm still a big fan of ZFS, and modern hardware may have better error handling, but I can't help but feel this is a little short sighted. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Losts of small files vs fewer big files
I work with Greenplum which is essentially a number of Postgres database instances clustered together. Being postgres, the data is held in a lot of individual files which can be each fairly big (hundreds of MB or several GB) or very small (50MB or less). We've noticed a performance difference when our database files are many and small versus few and large. To test this outside the database, we built a zpool using RAID-10 (it works for RAID-z too) and filled it with 800, 5MB files. Then we used 4 concurrent dd processes to read 1/4 of the files each. This reqiured 123seconds. Then we destroyed the pool, recreated it, and filled it with 20 files each 200MB and 780 files each 0bytes (same number of files, same total space consumed). The same dd reads took 15 seconds. Any idea why this is? Various configurations of our product can divide data in the databases into an enormous number of small files. varying the arc cache size limit did not have any effect. Are there other tunables available to Solaris 10 U7 (not openSolaris) that might affect this behavior? Thanks! -dt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Losts of small files vs fewer big files
Thanks for the suggestion! We've fiddled with this in the past. Our app is 32k instead of 8k blocks and it is data warehousing so the I/O model is a lot more long sequential reads generally. Changing the blocksize has very little effect on us. I'll have to look at fsync; hadn't considered that. Compression is a killer; it costs us up to 50% of the performance sadly. CPU is not always a problem for us but it can be depending on the query workload and the servers involved. Bryan Allen wrote: Have you set the recordsize for the filesystem to the blocksize Postgres is using (8K)? Note this has to be done before any files are created. Other thoughts: Disable postgres's fsync, enable filesystem compression if disk I/O is your bottleneck as opposed to CPU. I do this with MySQL and it has proven useful. My rule of thumb there is 60% for InnoDB cache, 40% for ZFS ARC, but YMMV. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Large zpool design considerations
Hi, I am looking for some best practice advice on a project that i am working on. We are looking at migrating ~40TB backup data to ZFS, with an annual data growth of 20-25%. Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) with one hotspare per 10 drives and just continue to expand that pool as needed. Between calculating the MTTDL and performance models i was hit by a rather scary thought. A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss of a vdev would render the entire pool unusable. This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 vdev should die before the resilvering of at least one disk is complete. Since most disks will be filled i do expect rather long resilvering times. We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with as much hardware redundancy as we can get ( multiple controllers, dual cabeling, I/O multipathing, redundant PSUs, etc.) I could use multiple pools but that would make data management harder which in it self is a lengthy process in our shop. The MTTDL figures seem OK so how much should i need to worry ? Anyone having experience from this kind of setup ? /Don E. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
Don Enrique wrote: Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) with one hotspare per 10 drives and just continue to expand that pool as needed. Between calculating the MTTDL and performance models i was hit by a rather scary thought. A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss of a vdev would render the entire pool unusable. This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 vdev should die before the resilvering of at least one disk is complete. Since most disks will be filled i do expect rather long resilvering times. Why are you planning on using RAIDZ-2 rather than mirroring ? Mirroring would increase the cost significantly and is not within the budget of this project. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss