Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade
How about crash dumps? michael On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood wrote: > I'm sorry. I should have mentioned it that I can't find any errors in the > logs. The last entry in /var/adm/messages is that I removed the keyboard > after the last reboot and then it shows the new boot up messages when I > boot up the system after the crash. The BIOS log is empty. I'm not sure how > to check the IPMI but IPMI is not configured and I'm not using it. > > Just another observation - the crashes are more intense the more data the > system serves (NFS). > > I'm looking into FRMW upgrades for the LSI now. > > > On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane wrote: > >> Does the Supermicro IPMI show anything when it crashes? Does anything >> show up in event logs in the BIOS, or in system logs under OI? >> >> >> On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood wrote: >> >>> I have two identical Supermicro boxes with 32GB ram. Hardware details at >>> the end of the message. >>> >>> They were running OI 151.a.5 for months. The zpool configuration was one >>> storage zpool with 3 vdevs of 8 disks in RAIDZ2. >>> >>> The OI installation is absolutely clean. Just next-next-next until done. >>> All I do is configure the network after install. I don't install or enable >>> any other services. >>> >>> Then I added more disks and rebuild the systems with OI 151.a.7 and this >>> time configured the zpool with 6 vdevs of 5 disks in RAIDZ. >>> >>> The systems started crashing really bad. They just disappear from the >>> network, black and unresponsive console, no error lights but no activity >>> indication either. The only way out is to power cycle the system. >>> >>> There is no pattern in the crashes. It may crash in 2 days in may crash >>> in 2 hours. >>> >>> I upgraded the memory on both systems to 128GB at no avail. This is the >>> max memory they can take. >>> >>> In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. >>> >>> Any idea what could be the problem. >>> >>> Thank you >>> >>> -- Peter >>> >>> Supermicro X9DRH-iF >>> Xeon E5-2620 @ 2.0 GHz 6-Core >>> LSI SAS9211-8i HBA >>> 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K >>> >>> ___ >>> zfs-discuss mailing list >>> zfs-discuss@opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >> > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade
Peter, sorry if this is so obvious that you didn't mention it: Have you checked /var/adm/messages and other diagnostic tool output? regards Michael On Wed, Mar 20, 2013 at 4:34 PM, Peter Wood wrote: > I have two identical Supermicro boxes with 32GB ram. Hardware details at > the end of the message. > > They were running OI 151.a.5 for months. The zpool configuration was one > storage zpool with 3 vdevs of 8 disks in RAIDZ2. > > The OI installation is absolutely clean. Just next-next-next until done. > All I do is configure the network after install. I don't install or enable > any other services. > > Then I added more disks and rebuild the systems with OI 151.a.7 and this > time configured the zpool with 6 vdevs of 5 disks in RAIDZ. > > The systems started crashing really bad. They just disappear from the > network, black and unresponsive console, no error lights but no activity > indication either. The only way out is to power cycle the system. > > There is no pattern in the crashes. It may crash in 2 days in may crash in > 2 hours. > > I upgraded the memory on both systems to 128GB at no avail. This is the > max memory they can take. > > In summary all I did is upgrade to OI 151.a.7 and reconfigured zpool. > > Any idea what could be the problem. > > Thank you > > -- Peter > > Supermicro X9DRH-iF > Xeon E5-2620 @ 2.0 GHz 6-Core > LSI SAS9211-8i HBA > 32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help zfs pool with duplicated and missing entry of hdd
On Thu, 10 Jan 2013, Jim Klimov wrote: On 2013-01-10 08:51, Jason wrote: Hi, One of my server's zfs faulted and it shows following: NAMESTATE READ WRITE CKSUM backup UNAVAIL 0 0 0 insufficient replicas raidz2-0 UNAVAIL 0 0 0 insufficient replicas c4t0d0 ONLINE 0 0 0 c4t0d1 ONLINE 0 0 0 c4t0d0 FAULTED 0 0 0 corrupted data c4t0d3 FAULTED 0 0 0 too many errors c4t0d4 FAULTED 0 0 0 too many errors ...(omit the rest). My question is why c4t0d0 appeared twice, and c4t0d2 is missing. Have check the controller card and hard disk, they are all working fine. This renaming does seem like an error in detecting (and further naming) of the disks - i.e. if a connector got loose, and one of the disks is not seen by the system, the numbering can shift in such manner. It is indeed strange however that only "d2" got shifted or missing and not all those numbers after it. So, you did verify that the controller sees all the disks in "format" command (and perhaps after a cold reboot - in BIOS)? Just in case, try to unplug and replug all cables (power, data) in case their pins got oxydized over time. Usually the disk numbering in any solaris based os stays the same if one disk is offline/missing, it's fixed to the controller port, or scsi target, or wwn. Imho a huge advantage of the c0t0d0 pattern, instead of the linux or freebsd numbering. I once had an old sun 5200 hooked up to a linux box and one of the 22 disks failed, every disk after the bad one had shifted, what a mess. To me the c4t0d0, c4t0d1, ... numbering looks either like a hardware raid controller not in jbod mode, or even an external san. jbods normally show up as lun 0 (d0) with different target numbers (t1, t2, ...). Maybe something wrong with lun numbering on your box? -- Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using L2ARC on an AdHoc basis.
Ok, so it is possible to remove. Good to know, thanks . I move the pool maybe once a month for a few days, on an otherwise daily used fixed location. So thought the warm up allowance may be worth it. I guess I just wanted to know if adding a cache device was a one way operation or not and whether or not it risked integrity. Sent from my iPhone On 13 Oct 2012, at 23:02, Ian Collins wrote: > On 10/14/12 10:02, Michael Armstrong wrote: >> Hi Guys, >> >> I have a "portable pool" i.e. one that I carry around in an enclosure. >> However, any SSD I add for L2ARC, will not be carried around... meaning the >> cache drive will become unavailable from time to time. >> >> My question is Will random removal of the cache drive put the pool into >> a "degraded" state or affect the integrity of the pool at all? Additionally, >> how adversely will this effect "warm up"... >> Or will moving the enclosure between machines with and without cache, just >> automatically work, and offer benefits when cache is available, and less >> benefits when it isn't? > > Why bother with cache devices at all if you are moving the pool around? As > you hinted above, the cache can take a while to warm up and become useful. > > You should zpool remove the cache device before exporting the pool. > > -- > Ian. > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Using L2ARC on an AdHoc basis.
Hi Guys, I have a "portable pool" i.e. one that I carry around in an enclosure. However, any SSD I add for L2ARC, will not be carried around... meaning the cache drive will become unavailable from time to time. My question is Will random removal of the cache drive put the pool into a "degraded" state or affect the integrity of the pool at all? Additionally, how adversely will this effect "warm up"... Or will moving the enclosure between machines with and without cache, just automatically work, and offer benefits when cache is available, and less benefits when it isn't? I hope this question isn't too much of a rambling :) thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Mon, 6 Aug 2012, Christopher George wrote: I mean this as constructive criticism, not as angry bickering. I totally respect you guys doing your own thing. Thanks, I'll try my best to address your comments... *) At least updated benchmarks your site to compare against modern flash-based competition (not the Intel X25-E, which is seriously stone age by now...) I completely agree we need to refresh the website, not even the photos are representative of our shipping product (we now offer VLP DIMMs). We are engineers first and foremost, but an updated website is in the works. In the mean time, we have benchmarked against both the Intel 320/710 in my OpenStorage Summit 2011 presentation which can be found at: http://www.ddrdrive.com/zil_rw_revelation.pdf Very impressive iops numbers. Although I have some thoughts on the benchmarking method itself. Imho the comparison shouldn't be raw iops numbers on the ddrdrive itself as tested with iometer (it's only 4gb), but real world numbers on a real world pool consisting of spinning disks with ddrdrive acting as zil accelerator. I just introduced an intel 320 120gb as zil accelerator for a simple zpool with two sas disks in raid0 configuration, and it's not as bad as in your presentation. It shows about 50% of the possible nfs ops with the ssd as zil versus no zil (sync=disabled on oi151), and about 6x-8x the performance compared to the pool without any accelerator and sync=standard. The case with no zil is the upper limit one can achieve on a given pool, in my case creation of about 750 small files/sec via nfs. With the ssd it's 380 files/sec (nfs stack is a limiting factor, too). Or about 2400 8k write iops with the ssd vs. 11900 iops with zil disabled, and 250 iops without accelerator (gnu dd with oflag=sync). Not bad at all. This could be just good enough for small businesses and moderate sized pools. Michael -- Michael Hase edition-software GmbH http://edition-software.de ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Very poor small-block random write performance
I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with FreeBSD 9). The system has 8gigs and I'm letting FreeBSD auto-size the ARC. Running iozone (from ports), everything is fine for file sizes up to 8GB, but when it runs with a 16GB file the random write performance plummets using 64K record sizes. 8G - 64K -> 52mB/s 8G - 128K -> 713mB/s 8G - 256K -> 442mB/s 16G - 64K -> 7mB/s 16G - 128K -> 380mB/s 16G - 256K -> 392mB/s Also, sequential small block performance doesn't show such a dramatic slowdown either. 16G - 64K -> 108mB/s (sequential) There's nothing else using the zpool at the moment, the system is on a separate ssd. I was expecting performance to drop off at 16GB b/c that's well above the available ARC but see that dramatic of a drop off and then the sharp improvement at 128K and 256K is surprising. Are there any configuration settings I should be looking at? Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Bob Friesenhahn wrote: On Tue, 17 Jul 2012, Michael Hase wrote: If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be > 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) Here are some results from my own machine based on the 'virgin mount' test approach. The results show less boost than is reported by a benchmark tool like 'iozone' which sees benefits from caching. I get an initial sequential read speed of 657 MB/s on my new pool which has 1200 MB/s of raw bandwidth (if mirrors could produce 100% boost). Reading the file a second time reports 6.9 GB/s. The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. To work around these caching effects just use a file > 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) % pfexec zfs create tank/zfstest/defaults % cd /tank/zfstest/defaults % pfexec dd if=/dev/urandom of=random.dat bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s % cd .. % pfexec zfs umount tank/zfstest/defaults % pfexec zfs mount tank/zfstest/defaults % cd defaults % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s % pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k count=2000 2000+0 records in 2000+0 records out 262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s % bc scale=8 657/150 4.3800 It is very difficult to benchmark with a cache which works so well: % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s This is not my point, I'm pretty sure I did not measure any arc effects - maybe with the one exception of the raid0 test on the scsi array. Don't know why the arc had this effect, filesize was 2x of ram. The point is: I'm searching for an explanation for the relative slowness of a mirror pair of sata disks, or some tuning knobs, or something like "the disks are plain crap", or maybe: zfs throttles sata disks in general (don't know the internals). In the range of > 600 MB/s other issues may show up (pcie bus contention, hba contention, cpu load). And performance at this level could be just good enough, not requiring any further tuning. Could you recheck with only 4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the same problem as with my boxes. All sata disks? Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
sorry to insist, but still no real answer... On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Tue, 17 Jul 2012, Michael Hase wrote: So only one thing left: mirror should read 2x I don't think that mirror should necessarily read 2x faster even though the potential is there to do so. Last I heard, zfs did not include a special read scheduler for sequential reads from a mirrored pair. As a result, 50% of the time, a read will be scheduled for a device which already has a read scheduled. If this is indeed true, the typical performance would be 150%. There may be some other scheduling factor (e.g. estimate of busyness) which might still allow zfs to select the right side and do better than that. If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be > 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) It is really difficult to measure zfs read performance due to caching effects. One way to do it is to write a large file (containing random data such as returned from /dev/urandom) to a zfs filesystem, unmount the filesystem, remount the filesystem, and then time how long it takes to read the file once. The reason why this works is because remounting the filesystem restarts the filesystem cache. Ok, did a zpool export/import cycle between the dd read and write test. This really empties the arc, checked this with arc_summary.pl. the test even uses two processes in parallel (doesn't make a difference). Result is still the same: dd write: 2x 58 MB/sec --> perfect, each disk does > 110 MB/sec dd read: 2x 68 MB/sec --> imho too slow, about 68 MB/sec per disk For writes each disk gets 900 128k io requests/sec with asvc_t in the 8-9 msec range. For reads each disk only gets 500 io requests/sec, asvc_t 18-20 msec with the default zfs_vdev_maxpending=10. When reducing zfs_vdev_maxpending the asvc_t drops accordingly, the i/o rate remains at 500/sec per disk, throughput also the same. I think iostat values should be reliable here. These high iops numbers make sense as we work on empty pools so there aren't very high seek times. All benchmarks (dd, bonnie, will try iozone) lead to the same result: on the sata mirror pair read performance is in the range of a single disk. For the sas disks (only two available for testing) and for the scsi system there is quite good throughput scaling. Here for comparison a table for 1-4 36gb 15k u320 scsi disks on an old sxde box (nevada b130): seq write factor seq read factor MB/sec MB/sec single821 78 1 mirror791137 1.75 2x mirror1201.5 251 3.2 This is exactly what's imho to be expected from mirrors and striped mirrors. It just doesn't happen for my sata pool. Still have no reference numbers for other sata pools, just one with the 4k/512bytes sector problem which is even slower than mine. It seems the zfs performance people just use sas disks and be done. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ old ibm dual opteron intellistation with external hp msa30, 36gb 15k u320 scsi disks pool: scsi1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM scsi1 ONLINE 0 0 0 c3t4d0ONLINE 0 0 0 errors: No known data errors Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfssingle 16G 137 99 82739 20 39453 9 314 99 78251 7 856.9 8 Latency 160ms4799ms5292ms 43210us3274ms2069ms Version 1.96 --Sequential Create-- Random Create zfssingle -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 8819 34 + +++ 26318 68 20390 73 + +++ 26846 72 Latency 16413us 108us 231us 12206us 46us 124us 1.96,1.96,zfssingle,1,1342514790,16G,,137,99,82739,20,39453,9,314,99,78251,7,856.9,8,16,8819,34,+,+++,26318,68,20390,73,+,+++,26846,72,160ms,4799ms,5292ms,43210us,3274ms,2069ms,16413us,108us,231us,12206us,46us,124us ## pool: scsi1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKS
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Hase got some strange results, please see attachements for exact numbers and pool config: seq write factor seq read factor MB/sec MB/sec single1231135 1 raid0 1141249 2 mirror 570.5 129 1 I agree with you these look wrong. Here is what you should expect: seq W seq R single 1.0 1.0 stripe 2.0 2.0 mirror 1.0 2.0 You have three things wrong: (a) stripe should write 2x (b) mirror should write 1x (c) mirror should read 2x I would have simply said "for some reason your drives are unable to operate concurrently" but you have the stripe read 2x. I cannot think of a single reason that the stripe should be able to read 2x, and the mirror only 1x. Yes, I think so too. In the meantime I switched the two disks to another box (hp xw8400, 2 xeon 5150 cpus, 16gb ram). On this machine I did the previous sas tests. OS is now OpenIndiana 151a (vs OpenSolaris b130 before), the mirror pool was upgraded from version 22 to 28, the raid0 pool newly created. The results look quite different: seq write factor seq read factor MB/sec MB/sec raid0 2362330 2.5 mirror1111128 1 Now the raid0 case shows excellent performance, the 330 MB/sec are a bit on the optimistic side, maybe some arc cache effects (file size 32gb, 16gb ram). iostat during sequential read shows about 115 MB/sec from each disk, which is great. The (really desired) mirror case still has a problem with sequential reads. sequential writes to the mirror are twice as fast as before, and show the expected performance for a single disk. So only one thing left: mirror should read 2x I suspect the difference is not the hardware, both boxess should have enough horsepower to easily do sequential reads with way more than 200 MB/sec. In all tests cpu time (user and system) remained quite low. I think it's an OS issue: OpenSolaris b130 is over 2 years old, OI 151a dates 11/2011. Could someone please send me some bonnie++ results for a 2 disk mirror or a 2x2 disk mirror pool with sata disks? Michael -- Michael Hase http://edition-software.de ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Mon, 16 Jul 2012, Michael Hase wrote: This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. Is the BIOS configured to use AHCI mode or is it using IDE mode? Not relevant here, disks are connected to an onboard sas hba (lsi 1068, see first post), hardware is a primergy rx330 with 2 qc opterons. Are the disks 512 byte/sector or 4K? 512 byte/sector, HDS721010CLA330 Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. I don't have a whole lot of experience with SATA disks but it is my impression that you might see this sort of performance if the BIOS was configured so that the drives were used as IDE disks. If not that, then there must be a bottleneck in your hardware somewhere. With early nevada releases I had indeed the IDE/AHCI problem, albeit on different hardware. Solaris only ran in IDE mode, disks were 4 times slower than on linux, see http://www.oracle.com/webfolder/technetwork/hcl/data/components/details/intel/sol_10_05_08/2999.html Wouldn't a hardware bottleneck show up on raw dd tests as well? I can stream > 130 MB/sec from each of the two disks in parallel. dd reading from more than these two disks at the same time results in a slight slowdown, but here we talk about nearly 400 MB/sec aggregated bandwidth through the onboard hba, the box has 6 disk slots: extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 94.50.0 94.50.0 0.0 1.00.0 10.5 0 100 c13t6d0 94.50.0 94.50.0 0.0 1.00.0 10.6 0 100 c13t1d0 93.00.0 93.00.0 0.0 1.00.0 10.7 0 100 c13t2d0 94.50.0 94.50.0 0.0 1.00.0 10.5 0 100 c13t5d0 Don't know why this is a bit slower, maybe some pci-e bottleneck. Or something with the mpt driver, intrstat shows only one cpu handles all mpt interrupts. Or even the slow cpus? These are 1.8ghz opterons. During sequential reads from the zfs mirror I see > 1000 interrupts/sec on one cpu. So it could really be a bottleneck somewhere triggerd by the "smallish" 128k i/o requests from the zfs side. I think I'll benchmark again on a xeon box with faster cpus, my tests with sas disks were done on this other box. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Mon, 16 Jul 2012, Stefan Ring wrote: It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. That assumes that there *are* outstanding requests to be scheduled in parallel, which would only happen with multiple readers or a large read-ahead buffer. That is true. Zfs tries to detect the case of sequential reads and requests to read more data than the application has already requested. In this case the data may be prefetched from the other disk before the application has requested it. This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. zfs_vdev_max_pending is already tuned down to 3 as recommended for sata disks, iostat -Mxnz 2 looks something like r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 507.10.0 63.40.0 0.0 2.90.05.8 1 99 c13t5d0 477.60.0 59.70.0 0.0 2.80.05.8 1 94 c13t4d0 when reading from the zfs mirror. The default zfs_vdev_max_pending=10 leads to much higher service times in the 20-30msec range, throughput remains roughly the same. I can read from the dsk or rdsk devices in parallel with real platter speeds: dd if=/dev/dsk/c13t4d0s0 of=/dev/null bs=1024k count=8192 & dd if=/dev/dsk/c13t5d0s0 of=/dev/null bs=1024k count=8192 & extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 2467.50.0 134.90.0 0.0 0.90.00.4 1 87 c13t5d0 2546.50.0 139.30.0 0.0 0.80.00.3 1 84 c13t4d0 So I think there is no problem with the disks. Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs sata mirror slower than single disk
Hello list, did some bonnie++ benchmarks for different zpool configurations consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512 bytes/sector, 7.2k), and got some strange results, please see attachements for exact numbers and pool config: seq write factor seq read factor MB/sec MB/sec single1231135 1 raid0 1141249 2 mirror 570.5 129 1 Each of the disks is capable of about 135 MB/sec sequential reads and about 120 MB/sec sequential writes, iostat -En shows no defects. Disks are 100% busy in all tests, and show normal service times. This is on opensolaris 130b, rebooting with openindiana 151a live cd gives the same results, dd tests give the same results, too. Storage controller is an lsi 1068 using mpt driver. The pools are newly created and empty. atime on/off doesn't make a difference. Is there an explanation why 1) in the raid0 case the write speed is more or less the same as a single disk. 2) in the mirror case the write speed is cut by half, and the read speed is the same as a single disk. I'd expect about twice the performance for both reading and writing, maybe a bit less, but definitely more than measured. For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks maxing out at about 50-60 MB/sec on the outer tracks. seq write factor seq read factor MB/sec MB/sec single 381 50 1 raid0 892111 2 mirror 361 92 2 Here we get the expected behaviour: raid0 with about double the performance for reading and writing, mirror about the same performance for writing, and double the speed for reading, compared to a single disk. An old scsi system with 4x2 mirror pairs also shows these scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec write, each disk capable of 80 MB/sec. I don't care about absolute numbers, just don't get why the sata system is so much slower than expected, especially for a simple mirror. Any ideas? Thanks, Michael -- Michael Hase http://edition-software.de pool: ptest state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM ptest ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0 Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfssingle 32G79 98 123866 51 63626 35 255 99 135359 25 530.6 13 Latency 333ms 111ms5283ms 73791us 465ms2535ms Version 1.96 --Sequential Create-- Random Create zfssingle -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 4536 40 + +++ 14140 50 10382 69 + +++ 6260 73 Latency 21655us 154us 206us 24539us 46us 405us 1.96,1.96,zfssingle,1,1342165334,32G,,79,98,123866,51,63626,35,255,99,135359,25,530.6,13,16,4536,40,+,+++,14140,50,10382,69,+,+++,6260,73,333ms,111ms,5283ms,73791us,465ms,2535ms,21655us,154us,206us,24539us,46us,405us ### pool: ptest state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM ptest ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0 c13t5d0 ONLINE 0 0 0 Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfsstripe 32G78 98 114243 46 72938 37 192 77 249022 44 815.1 20 Latency 483ms 106ms5179ms3613ms 259ms1567ms Version 1.96 --Sequential Create-- Random Create zfsstripe -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 6474 53 + +++ 15505 47 8562 81 + +++ 10839 65 Latency 21894us 131us 208us 22203us 52us 230us 1.96,1.96,zfsstripe,1,1342172768,32G,,78,98,114243,46,72938,37,192,77,249022,44,815.1,20,16,6474,53,+,+++,15505,47,8562,81,+,+++,10839,65,483ms,106ms,5179ms,3613ms,259ms,1567ms,21894us,131us,208us,22203us,52us,230us pool: ptest state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ptestONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c13t4d0 O
Re: [zfs-discuss] Drive upgrades
Yes this Is another thing im weary of... I should have slightly under provisioned at the start or mixed manufacturers... Now i may have to replace 2tb fails with 2.5 for the sake of a block Sent from my iPhone On 13 Apr 2012, at 17:30, Tim Cook wrote: > > > On Fri, Apr 13, 2012 at 9:35 AM, Edward Ned Harvey > wrote: > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Michael Armstrong > > > > Is there a way to quickly ascertain if my seagate/hitachi drives are as > large as > > the 2.0tb samsungs? I'd like to avoid the situation of replacing all > drives and > > then not being able to grow the pool... > > It doesn't matter. If you have a bunch of drives that are all approx the > same size but vary slightly, and you make (for example) a raidz out of them, > then the raidz will only be limited by the size of the smallest one. So you > will only be wasting 1% of the drives that are slightly larger. > > Also, given that you have a pool currently made up of 13x2T and 5x1T ... I > presume these are separate vdev's. You don't have one huge 18-disk raidz3, > do you? that would be bad. And it would also mean that you're currently > wasting 13x1T. I assume the 5x1T are a single raidzN. You can increase the > size of these disks, without any cares about the size of the other 13. > > Just make sure you have the autoexpand property set. > > But most of all, make sure you do a scrub first, and make sure you complete > the resilver in between each disk swap. Do not pull out more than one disk > (or whatever your redundancy level is) while it's still resilvering from the > previously replaced disk. If you're very thorough, you would also do a > scrub in between each disk swap, but if it's just a bunch of home movies > that are replaceable, you will probably skip that step. > > > You will however have an issue replacing them if one should fail. You need > to have the same block count to replace a device, which is why I asked for a > "right-sizing" years ago. Deaf ears :/ > > --Tim > > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Drive upgrades
Hi Guys,I currently have a 18 drive system built from 13x 2.0tb Samsung's and 5x WD 1tb's... I'm about to swap out all of my 1tb drives with 2tb ones to grow the pool a bit... My question is...The replacement 2tb drives are from various manufacturers (seagate/hitachi/samsung) and I know from previous experience that the geometry/boundaries of each manufacturer's 2tb offerings are different.Is there a way to quickly ascertain if my seagate/hitachi drives are as large as the 2.0tb samsungs? I'd like to avoid the situation of replacing all drives and then not being able to grow the pool...Thanks,Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3 Jan 12, at 04:22 , Darren J Moffat wrote: > On 12/28/11 06:27, Richard Elling wrote: >> On Dec 27, 2011, at 7:46 PM, Tim Cook wrote: >>> On Tue, Dec 27, 2011 at 9:34 PM, Nico Williams >>> wrote: >>> On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack wrote: >>>> So with a de facto fork (illumos) now in place, is it possible that two >>>> zpools will report the same version yet be incompatible across >>>> implementations? >> >> This was already broken by Sun/Oracle when the deduplication feature was not >> backported to Solaris 10. If you are running Solaris 10, then zpool version >> 29 features >> are not implemented. > > Solaris 10 does have some deduplication support, it can import and read > datasets in a deduped pool just fine. You can't enable dedup on a dataset > and any writes won't dedup they will "rehydrate". > > So it is more like partial dedup support rather than it not being there at > all. "rehydrate"??? Is it instant or freeze dried? Mike - --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQIcBAEBAgAGBQJPAuOzAAoJEPFdIteZcPZgn7QQAI0nq500qymcpuTreoPpDHIL vvMtRS4/VoOxmHbu2wJT9GO21f4JC3CCzFRHl8t6NkAK5vi9cuNUx1IGjDjlZAqG Vp3H2DmtuHVHsPiAGB4J7b3zI4U8IL5tPhgbEcg5kkiTqBjMOCTdg1ibRz7ovf9Y aDmplOD1d2UN5il6FEs3ZEojHslb4yoRajd5HgyjibF6sdC1leKcAFaUOg9q0t/s 40Ckzw6G4RC5mCb6WHK+a4WXPUMG4uPryIRl4F4jxqrMCSw/rIUHa1slVcagu1gO wft+P7Y922SPnClMHhDufIGGKrqvJaOriYU+1ZXVoil18BaauboVn1/PEtlDOF57 vy0jOiC/DVICvk/AzzKfQxlO9YFhu4RInc27B2Ut4pCmXLeDDJpy5QXge+AZBM6K Q2dPJJ3ZNii4JYsTfIufMzWjBwBMhUgkbbK5kbdNyuIptg/ueHOKOf+v9gSkqCGC CjWrqtchtBSHa5Vw1JjMbKR5Y2qNzH+YuYICFgnYvJbZ31WO8TdzRL+M8PnuJRE3 WJDKs0TmSStYiuGZ1jf1oA3SJ1gcok47rYueSGKcmMSfhHfw3zeB0JpHLVQaCG2j k2CwfwGskSs1FvgHR+YbCCne5KXwk5PzqCvd5IGH7GZyEOJLtW29MjW5d2TazSzr 3u01cKzStpyXPaxj6+cD =SLu1 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Mon, Nov 14, 2011 at 14:40, Paul Kraus wrote: > On Fri, Nov 11, 2011 at 9:25 PM, Edward Ned Harvey > wrote: > >> LOL. Well, for what it's worth, there are three common pronunciations for >> btrfs. Butterfs, Betterfs, and B-Tree FS (because it's based on b-trees.) >> Check wikipedia. (This isn't really true, but I like to joke, after saying >> something like that, I wrote the wikipedia page just now.) ;-) > > Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls > apart (in terms of performance) when you get too many objects in one > FS, which is specifically what drove us to ZFS. We had 4.5 TB of data > in about 60 million files/directories on an Apple X-Serve and X-RAID > and the overall response was terrible. We moved the data to ZFS and > the performance was limited by the Windows client at that point. > >> Speaking of which. zettabyte filesystem. ;-) Is it just a dumb filesystem >> with a lot of address bits? Or is it something that offers functionality >> that other filesystems don't have? ;-) > > The stories I have heard indicate that the name came after the TLA. > "zfs" came first and "zettabyte" later. as Jeff told it (IIRC), the "expanded" version of zfs underwent several changes during the development phase, until it was decided one day to attach none of them to "zfs" and just have it be "the last word in filesystems". (perhaps he even replied to a similar message on this list ... check the archives :-) regards -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove corrupt files from snapshot
Hi, snapshots are read-only by design; you can clone them and manipulate the clone, but the snapshot itself remains r/o. HTH Michael On Thu, Nov 3, 2011 at 13:35, wrote: > > Hello, > > I have got a bunch of corrupted files in various snapshots on my ZFS file > backing store. I was not able to recover them so decided to remove all, > otherwise the continuously make trouble for my incremental backup (rsync, > diff etc. fails). > > However, snapshots seem to be read-only: > > # zpool status -v > pool: backups > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > backups ONLINE 0 0 13 > md0 ONLINE 0 0 13 > errors: Permanent errors have been detected in the following files: > /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc > ... > > # rm /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc > rm: /backups/memory_card/.zfs/snapshot/20110218230726/Backup/Backup.arc: > Read-only file system > > > Is there any way to force the file removal? > > > Cheers, > B. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
Or, if you absolutely must run linux for the operating system, see: http://zfsonlinux.org/ On Oct 17, 2011, at 8:55 AM, Freddie Cash wrote: > If you absolutely must run Linux on your storage server, for whatever reason, > then you probably won't be running ZFS. For the next year or two, it would > probably be safer to run software RAID (md), with LVM on top, with XFS or > Ext4 on top. It's not the easiest setup to manage, but it would be safer > than btrfs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] commercial zfs-based storage replication software?
On 1 Oct 11, at 08:01 , Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha >> >> On Sat, Oct 1, 2011 at 5:06 AM, Edward Ned Harvey >> wrote: >>> Have you looked at Sun Unified Storage, AKA the 7000 series? >> >> Thanks, that would be my fallback plan (along with nexentastor and > netapp). > > So you're basically looking for installable 3rd party software that > replicates that functionality? I don't know of any, but that's not saying > much, because when it comes to ZFS, I'm not very platform explorative. A I said before, hack an open source job scheduler or find one which allows creating jobs with parameters or panels for custom fields to put together the crontab command and wrap it in something which preserves the output of cron rather than email, but stores it in a database or something as well as keeps track of success or failure and notifies someone in the event of failure and/or restarts. Which also probably means it needs to be do distributed process management to kickoff everything it needs to. It should probably be ZFS aware so it can present filesystems and select based on filesystem rather than job. Oracle Enterprise Manager does this. It's commercial, and I'm sure they would negotiate on price for you and give you a good deal if you are good at bargaining with your Oracle Sales Rep. I think his requirements are being driven by a PHB who wants to see a "GUI". crontab, ssh - functionality already there, simple and not many "moving parts" but obviously too obfuscated for the PHB to understand. Good luck. Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] commercial zfs-based storage replication software?
Maybe I'm missing something here, but Amanda has a whole bunch of bells and whistles, and scans the filesystem to determine what should be backed up. Way overkill for this task I think. Seems to me like zfs send blah | ssh replicatehost zfs receive … more than meets the requirement when combined with just plain old crontab. If it's a graphical interface you're looking for, I'm sure someone has hacked together somethings in TCL/Tk pr Perl/TK as an interface to cron which you could probably hack to have construct your particular crontab entry. Just a thought, Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 On 30 Sep 11, at 07:33 , Fajar A. Nugraha wrote: > On Fri, Sep 30, 2011 at 7:22 PM, Edward Ned Harvey > wrote: >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha >>> >>> Does anyone know a good commercial zfs-based storage replication >>> software that runs on Solaris (i.e. not an appliance, not another OS >>> based on solaris)? >>> Kinda like Amanda, but for replication (not backup). >> >> Please define replication, not backup? To me, your question is unclear what >> you want to accomplish. What don't you like about zfs send | zfs receive? > > Basically I need something that does zfs send | zfs receive, plus > GUI/web interface to configure stuff (e.g. which fs to backup, > schedule, etc.), support, and a price tag. > > Believe it or not the last two requirement are actually important > (don't ask :P ), and are the main reasons why I can't use automated > send - receive scripts already available from the internet. > > CMIIW, Amanda can use "zfs send" but it only store the resulting > stream somewhere, while the requirement for this one is that the send > stream must be received on a different server (e.g. DR site) and be > accessible there. > > -- > Fajar > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I'm back!
Warm welcomes back. So whats neXt? - Mike DeMan On Sep 2, 2011, at 6:30 PM, Erik Trimble wrote: > Hi folks. > > I'm now no longer at Oracle, and the past couple of weeks have been a bit of > a mess for me as I disentangle myself from it. > > I apologize to those who may have tried to contact me during August, as my > @oracle.com email is no longer being read by myself, and I didn't have a lot > of extra time to devote to things like making sure my email subscription > lists pointed to my personal email. I've done that now. > > I now have a free(er) hand to do some work in IllumOS (hopefully, in ZFS in > particular), so I'm looking forward to getting back into the swing of things. > And, hopefully, not be too much of a PITA. > > :-) > > -Erik Trimble > tr...@netdemons.com > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC
Are you truly new to ZFS? Or do you work for NetApp or EMC or somebody else that is curious? - Mike On Aug 29, 2011, at 9:15 PM, Jesus Cea wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi all. Sorry if I am asking a FAQ, but I haven't found a really > authorizative answer to this. Most references are old, incomplete or > of "I have heard of" kind. > > I am running Solaris 10 Update 9, and my pool is v22. > > I recently got two 40GB SSD I plan to add to my pool. My idea is this: > > 1. Format each SSD as 39GB+1GB. > 2. Use the TWO 39GB's as L2ARC, with no redundancy. > 3. Use the TWO 1GB's as mirrored ZIL. > > 1GB of ZIL seems more than enough for my needs. I have synchronous > writes, but they are, 99.9% of the time, <1MB/s, with occasional bursts. > > My main concern here is about pool stability if there have any kind of > problem with the SSD's. Especifically: > > 1. Is the L2ARC data stored in the SSD checksummed?. If so, can I > expect that ZFS goes directly to the disk if the checksum is wrong?. > > 2. Can I import a POOL if one/both L2ARC's are not available?. > > 3. What happend if a L2ARC device, suddenly, "dissappears"?. > > 4. Any idea if L2ARC content will be persistent across system > rebooting "eventually"? > > 5. Can I import a POOL if one/both ZIL devices are not available?. My > pool is v22. I know that I can remove ZIL devices since v19, but I > don't know if I can remove them AFTER they are physically unavailable, > of before importing the pool (after a reboot). > > 6. Can I remove a ZIL device after ZFS consider it "faulty"?. > > 7. What if a ZIL device "dissapears", suddenly?. I know that I could > lose "committed" transactions in-fight, but will the machine crash?. > Will it fallback to ZIL on harddisk?. > > 8. Since my ZIL will be mirrored, I assume that the OS will actually > will look for transactions to be replayed in both devices (AFAIK, the > ZIL chain is considered done when the checksum of the last block is > not valid, and I wonder how this interacts with ZIL device mirroring). > > 9. If a ZIL device mirrored goes offline/online, will it resilver from > the other side, or it will simply get new transactions, since old > transactions are irrelevant after ¿30? seconds?. > > 10. What happens if my 1GB of ZIL is too optimistic?. Will ZFS use the > disks or it will stop writers until flushing ZIL to the HDs?. > > Anything else I should consider?. > > As you can see, my concerns concentrate in what happens if the SSDs go > bad or "somebody" unplugs them "live". > > I have backup of (most) of my data, but rebuilding a 12TB pool from > backups, in a production machine, in a remote hosting, would be > something I rather avoid :-p. > > I know that hybrid HD+SSD pools were a bit flacky in the past (you > lost the ZIL device, you kiss goodbye to your ZPOOL, in the pre-v19 > days), and I want to know what terrain I am getting into. > > PS: I plan to upgrade to S10 U10 when available, and I will upgrade > the ZPOOL version after a while. > > - -- > Jesus Cea Avion _/_/ _/_/_/_/_/_/ > j...@jcea.es - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ > jabber / xmpp:j...@jabber.org _/_/_/_/ _/_/_/_/_/ > . _/_/ _/_/_/_/ _/_/ _/_/ > "Things are not so easy" _/_/ _/_/_/_/ _/_/_/_/ _/_/ > "My name is Dump, Core Dump" _/_/_/_/_/_/ _/_/ _/_/ > "El amor es poner tu felicidad en la felicidad de otro" - Leibniz > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQCVAwUBTlxjxplgi5GaxT1NAQLi9AP/VW2LQqij6y25KQ3c5EDBWvnnL1Z7R65j > BJ0N1EbWW6ZdkQ9uFoLNJBVb8xPgwpTOKuy5g8FTwrjs1Sc5a3E3DbRDUg75faE5 > 4IOgCi0gtIVyrxGEQ2AAhnKHGcto/2gB9Y5KRiibBeysbqNvr0HXQsko7WRauP96 > N1L1TqFsN8E= > =sDRY > -END PGP SIGNATURE- > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!
I can not help but agree with Tim's comment below. If you want a free version of ZFS, in which case you are still responsible for things yourself - like having backups, then maybe: www.freenas.org www.linuxonzfs.org www.openindiana.org Meanwhile, it is grossly inappropriate to be complaining about lack of support when using an operating system / file system that you know has no support. Doubly so if your data is important and doubly so again if did not already back it up. - mike On Aug 19, 2011, at 6:54 AM, Tim Cook wrote: > > > You digitally signed a license agreement stating the following: > No Technical Support > Our technical support organization will not provide technical support, phone > support, or updates to you for the Programs licensed under this agreement. > > To turn around and keep repeating that they're "holding your data hostage" is > disingenuous at best. Nobody is holding your data hostage. You voluntarily > put it on an operating system that explicitly states doesn't offer support > from the parent company. Nobody from Oracle is going to show up with a patch > for you on this mailing list because none of the Oracle employees want to > lose their job and subsequently be subjected to a lawsuit. If that's what > you're planning on waiting for, I'd suggest you take a new approach. > > Sorry to be a downer, but that's reality. > > --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disable ZIL - persistent
On 5 Aug 11, at 08:14 , Darren J Moffat wrote: > On 08/05/11 13:11, Edward Ned Harvey wrote: >> >> My question: Is there any way to make Disabled ZIL a normal mode of >> operations in solaris 10? Particularly: >> >> If I do this "echo zil_disable/W0t1 | mdb -kw" then I have to remount >> the filesystem. It's kind of difficult to do this automatically at boot >> time, and impossible (as far as I know) for rpool. The only solution I >> see is to write some startup script which applies it to filesystems >> other than rpool. Which feels kludgy. Is there a better way? > > echo "set zfs:zil_disable = 1" > /etc/system echo "set zfs:zil_disable = 1" >> /etc/system Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zil on multiple usb keys
+1 on the below, and in addition... ...compact flash, like off of USB sticks is not designed to deal with very many writes to it. Commonly it is used to store a bootable image that maybe once a year will have an upgrade on it. Basically, trying to use those devices for a ZIL, even they are mirrored - you should be prepared to having one die and be replaced very, very regularly. Generally performance is going to pretty bad as well - USB sticks are not made to be written too rapidly. They are entirely different animals than SSDs. I would not be surprised (but would be curious to know if you still move forward on this) that you will find performance even worse trying to do this. On Jul 18, 2011, at 1:54 AM, Fajar A. Nugraha wrote: > First of all, using USB disks for permanent storage is a bad idea. Go > for e-sata instead (http://en.wikipedia.org/wiki/Serial_ata#eSATA). It ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resizing ZFS partition, shrinking NTFS?
On 17 Jun 11, at 21:14 , Bob Friesenhahn wrote: > On Fri, 17 Jun 2011, Jim Klimov wrote: >> I gather that he is trying to expand his root pool, and you can >> not add a vdev to one. Though, true, it might be possible to >> create a second, data pool, in the partition. I am not sure if >> zfs can make two pools in different partitions of the same >> device though - underneath it still uses Solaris slices, and >> I think those can be used on one partition. That was my >> assumption for a long time, though never really tested. > > This would be a bad assumption. Zfs should not care and you are able to do > apparently silly things with it. Sometimes allowing potentially silly things > is quite useful. > This is true. If one has mirrored disks, you could do something like I explain here WRT partitioning and resizing pools. http://www.kamiogi.net/Kamiogi/Frame_Dragging/Entries/2009/5/19_Everything_in_Its_Place_-_Moving_and_Reorganizing_ZFS_Storage.html I did some shuffling using Solaris partitions here on a home server, but it was using mirrors of the same geometry disks. You might be able to o a similar shuffle using an external USB drive which was appropriately sized and turn on autoexpand. Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On 17 Jun 11, at 21:02 , Ross Walker wrote: > On Jun 16, 2011, at 7:23 PM, Erik Trimble wrote: > >> On 6/16/2011 1:32 PM, Paul Kraus wrote: >>> On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling >>> wrote: >>> >>>> You can run OpenVMS :-) >>> Since *you* brought it up (I was not going to :-), how does VMS' >>> versioning FS handle those issues ? >>> >> It doesn't, per se. VMS's filesystem has a "versioning" concept (i.e. every >> time you do a close() on a file, it creates a new file with the version >> number appended, e.g. foo;1 and foo;2 are the same file, different >> versions). However, it is completely missing the rest of the features we're >> talking about, like data *consistency* in that file. It's still up to the >> app using the file to figure out what data consistency means, and such. >> Really, all VMS adds is versioning, nothing else (no API, no additional >> features, etc.). > > I believe NTFS was built on the same concept of file streams the VMS FS used > for versioning. > > It's a very simple versioning system. > > Personnally I use Sharepoint, but there are other content management systems > out there that provide what your looking for, so no need to bring out the > crypt keeper. > I think from following this whole discussion people are wanting "Versions" which will be offered by OS X Lion soon. However, it is dependent upon applications playing nice,behaving and using the "standard" API's. It would likely take a major overhaul in the way ZFS handles snapshots to create them at the object level rather than the filesystems level. Might be a nice exploratory exercise for those in the know with the ZFS roadmap, but then there are two "roadmaps" right? Also consistency and integrity cannot be guaranteed on the object level since an application may have more than a single filesystem object in use at a time and operations would need to be transaction based with commits and rollbacks. Way off-topic, but Smalltalk and its variants do this by maintaining the state of everything in an operating environment image. But then again, I could be wrong. Mike --- Michael Sullivan m...@axsh.us http://www.axsh.us/ Phone: +1-662-259- Mobile: +1-662-202-7716 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resizing ZFS partition, shrinking NTFS?
On 17.06.2011 01:44, John D Groenveld wrote: In message<444915109.61308252125289.JavaMail.Twebapp@sf-app1>, Clive Meredith writes: I currently run a duel boot machine with a 45Gb partition for Win7 Ultimate an d a 25Gb partition for OpenSolaris 10 (134). I need to shrink NTFS to 20Gb an d increase the ZFS partion to 45Gb. Is this possible please? I have looked a t using the partition tool in OpenSolaris but both partition are locked, even under admin. Win7 won't allow me to shrink the dynamic volume, as the Finsh b utton is always greyed out, so no luck in that direction. Shrink the NTFS filesystem first. I've used the Knoppix LiveCD against a defragmented NTFS. Then use beadm(1M) to duplicate your OpenSolaris BE to a USB drive and also send snapshots of any other rpool ZFS there. I'd suggest a somewhat different approach: 1) boot a live cd and use something like parted to shrink the NTFS partition 2) create a new partition without FS in the space now freed from NTFS 3) boot OpenSolaris, add the partition from 2) as vdev to your zpool. HTH Michael -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On 15.06.2011 14:30, Simon Walter wrote: Another one is that snapshots are per-filesystem, while the intention here is to capture a document in one user session. Taking a snapshot will of course say nothing about the state of other user sessions. Any document in the process of being saved by another user, for example, will be corrupt. Would it be? I think that's pretty lame for ZFS to corrupt data. I think "corrupt" is not the right word to use here - "inconsistent" is probably better. ZFS has no idea when a document is "OK", so if your snapshot happens between two writes (even from a single user), it will be consistent from the POV of the FS, but may not be from the POV of the application. HTH Michael -- Michael Schuster http://recursiveramblings.wordpress.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Have my RMA... Now what??
Yes, particularly if you have older drives with 512 sectors and then buy a newer drive that seems the same, but is not, because it has 4k sectors. Looks like it works, and will work, but performance drops. On May 28, 2011, at 4:59 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. wrote: > yes good idea, another things to keep in mind > technology change so fast, by the time you want a replacement, may be HDD > does exist any more > or the supplier changed, so the drives are not exactly like your original > drive > > > > > On 5/28/2011 6:05 PM, Michael DeMan wrote: >> Always pre-purchase one extra drive to have on hand. When you get it, >> confirm it was not dead-on-arrival by hooking up on an external USB to a >> workstation and running whatever your favorite tools are to validate it is >> okay. Then put it back in its original packaging, and put a label on it >> about what it is, and that it is a spare for box(s) XYZ disk system. >> >> When a drive fails, use that one off the shelf to do your replacement >> immediately then deal with the RMA, paperwork, and snailmail to get the bad >> drive replaced. >> >> Also, depending how many disks you have in your array - keeping multiple >> spares can be a good idea as well to cover another disk dying while waiting >> on that replacement one. >> >> In my opinion, the above goes whether you have your disk system configured >> with hot spare or not. And the technique is applicable to both >> personal/home-use and commercial uses if your data is important. >> >> >> - Mike >> >> On May 28, 2011, at 9:30 AM, Brian wrote: >> >>> I have a raidz2 pool with one disk that seems to be going bad, several >>> errors are noted in iostat. I have an RMA for the drive, however - no I am >>> wondering how I proceed. I need to send the drive in and then they will >>> send me one back. If I had the drive on hand, I could do a zpool replace. >>> >>> Do I do a zpool offline? zpool detach? >>> Once I get the drive back and put it in the same drive bay.. Is it just a >>> zpool replace? >>> -- >>> This message posted from opensolaris.org >>> ___ >>> zfs-discuss mailing list >>> zfs-discuss@opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Have my RMA... Now what??
Always pre-purchase one extra drive to have on hand. When you get it, confirm it was not dead-on-arrival by hooking up on an external USB to a workstation and running whatever your favorite tools are to validate it is okay. Then put it back in its original packaging, and put a label on it about what it is, and that it is a spare for box(s) XYZ disk system. When a drive fails, use that one off the shelf to do your replacement immediately then deal with the RMA, paperwork, and snailmail to get the bad drive replaced. Also, depending how many disks you have in your array - keeping multiple spares can be a good idea as well to cover another disk dying while waiting on that replacement one. In my opinion, the above goes whether you have your disk system configured with hot spare or not. And the technique is applicable to both personal/home-use and commercial uses if your data is important. - Mike On May 28, 2011, at 9:30 AM, Brian wrote: > I have a raidz2 pool with one disk that seems to be going bad, several errors > are noted in iostat. I have an RMA for the drive, however - no I am > wondering how I proceed. I need to send the drive in and then they will send > me one back. If I had the drive on hand, I could do a zpool replace. > > Do I do a zpool offline? zpool detach? > Once I get the drive back and put it in the same drive bay.. Is it just a > zpool replace ? > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
I think on this, the big question is going to be whether Oracle continues to release ZFS updates under CDDL after their commercial releases. Overall, in the past it has obviously and necessarily been the case that FreeBSD has been a '2nd class citizen'. Moving forward, that 2nd class idea becomes very mutable - and ironically it becomes more so in regards to dealing with organizations that have longevity. Moving forward... If Oracle continues to release critical ZFS feature sets under CDDL to the community, then: A) They are no longer pre-releasing those features to OpenSolaris B) FreeBSD gets them at the same time. If Oracle does not continue to release ZFS features sets under CDDL, then then game changes. Pick your choice of operating systems - one that has a history of surviving for nearly two decades on its own with community support, or the 'green leaf off the dead tree' that just decided to jump into the willy-nilly world without direct/giant corporate support. 2nd class citizen issue for FreeBSD disappears either way. The only remaining question would be the remaining crufts of legal disposition. I could for instance see NetApp or somebody try and sue ixSystems, but I have a really, really rough time seeing Oracle/LarryEllison suing the FreeBSD foundation overall or something? Oh yeah - plus BTRFS on the horizon? Honestly - I am not here to start a flame war - I am asking these questions because businesses both big and small need to know what to do. My hunch is, we all have to wait and see if Oracle releases ZFS updates after Solaris 11, and if so, whether that is a subset of functionality or full functionality. - mike On Mar 19, 2011, at 11:54 PM, Fajar A. Nugraha wrote: > On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek wrote: >> On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: >>> Newer versions of FreeBSD have newer ZFS code. >> >> Yes, we are at v28 at this point (the lastest open-source version). >> >>> That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] >> >> That's actually not true. There are more FreeBSD committers working on >> ZFS than on UFS. > > How is the performance of ZFS under FreeBSD? Is it comparable to that > in Solaris, or still slower due to some needed compatibility layer? > > -- > Fajar > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
ZFSv28 is in HEAD now and will be out in 8.3. ZFS + HAST in 9.x means being able to cluster off different hardware. In regards to OpenSolaris and Indiana - can somebody clarify the relationship there? It was clear with OpenSolaris that the latest/greatest ZFS would always be available since it was a guinea-pig product for cost conscious folks and served as an excellent area for Sun to get marketplace feedback and bug fixes done before rolling updates into full Solaris. To me it seems that Open Indiana is basically a green branch off of a dead tree - if I am wrong, please enlighten me. On Mar 18, 2011, at 6:16 PM, Roy Sigurd Karlsbakk wrote: >> I think we all feel the same pain with Oracle's purchase of Sun. >> >> FreeBSD that has commercial support for ZFS maybe? > > Fbsd currently has a very old zpool version, not suitable for running with > SLOGs, since if you lose it, you may lose the pool, which isn't very > amusing... > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > r...@karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det > er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av > idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og > relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OpenIndiana-discuss] best migration path from Solaris 10
Hi David, Caught your note about bonnie, actually do some testing myself over the weekend. All on older hardware for fun - dual opteron 285 with 16GB RAM. Disk systems is off a pair of SuperMicro SATA cards, with a combination of WD enterprise and Seagate ES 1TB drives. No ZIL, no L2ARC, no tuning at all from base FreeNAS install. 10 drives total, I'm going to be running tests as below, mostly curious about IOPS and to sort out a little debate with a co-worker. - all 10 in one raidz2 (running now) - 5 by 2-way mirrors - 2 by 5-disk raidz1 Script is as below - if folks would find the data I collect be useful information at all, let me know and I will post it publicly somewhere. freenas# cat test.sh #!/bin/sh # Basic test for file I/O. We run lots and lots of the tradditional # 'bonnie' tool at 50GB file size, starting one every minute. Resulting # data should give us a good work mixture in the middle given all the different # tests that bonnnie runs, 100 instances running at the same time, and at different # stages of their processing. MAX=100 COUNT=0 FILESYSTEM=testrz2 LOG=${FILESYSTEM}.log date > ${LOG} echo "Test with file system named ${FILESYSTEM} and Configuration of..." >> ${LOG} zpool status >> ${LOG} # DEMAN grab zfs and regular dev iostats every 10 minutes during test zpool iostat -v 600 >> ${LOG} & iostat -w 600 ada0 ada1 ada2 ada3 ada4 ada5 ada6 ada7 ada8 ada9 > ${LOG}.iostat & while [ $COUNT -le $MAX ]; do echo kicking off bonnie bonnie -d /mnt/${FILESYSTEM} -s 5 & sleep 60; COUNT=$((count+1)); done; On Mar 18, 2011, at 3:26 PM, David Brodbeck wrote: > I'm in a similar position, so I'll be curious what kinds of responses you > get. I can give you a thumbnail sketch of what I've looked at so far: > > I evaluated FreeBSD, and ruled it out because I need NFSv4, and FreeBSD's > NFSv4 support is still in an early stage. The NFS stability and performance > just isn't there yet, in my opinion. > > Nexenta Core looked promising, but locked up in bonnie++ NFS testing with our > RedHat nodes, so its stability is a bit of a question mark for me. > > I haven't gotten the opportunity to thoroughly evaluate OpenIndiana, yet. > It's only available as a DVD ISO, and my test machine currently has only a > CD-ROM drive. Changing that is on my to-do list, but other things keep > slipping in ahead of it. > > For now I'm running OpenSolaris, with a locally-compiled version of Samba. > (The OpenSolaris Samba package is very old and has several unpatched security > holes, at this point.) > > -- > David Brodbeck > System Administrator, Linguistics > University of Washington > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
I think we all feel the same pain with Oracle's purchase of Sun. FreeBSD that has commercial support for ZFS maybe? Not here quite yet, but it is something being looked at by an F500 that I am currently on contract with. www.freenas.org, www.ixsystems.com. Not saying this would be the right solution by any means, but for that 'corporate barrier', sometimes the option to get both the hardware and ZFS from the same place, with support, helps out. - mike On Mar 18, 2011, at 2:56 PM, Paul B. Henson wrote: > We've been running Solaris 10 for the past couple of years, primarily to > leverage zfs to provide storage for about 40,000 faculty, staff, and students > as well as about 1000 groups. Access is provided via NFSv4, CIFS (by samba), > and http/https (including a local module allowing filesystem acl's to be > respected via web access). This has worked reasonably well barring some > ongoing issues with scalability (approximately a 2 hour reboot window on an > x4500 with ~8000 zfs filesystems, complete breakage of live upgrade) and > acl/chmod interaction madness. > > We were just about to start working on a cutover to OpenSolaris (for the > in-kernel CIFS server, and quicker access to new features/developments) when > Oracle finished assimilating Sun and killed off the OpenSolaris distribution. > We've been sitting pat for a while to see how things ended up shaking out, > and at this point want to start reevaluating our best migration option to > move forward from Solaris 10. > > There's really nothing else available that is comparable to zfs (perhaps > btrfs someday in the indefinite future, but who knows when that day might > come), so our options would appear to be Solaris 11 Express, Nexenta (either > NexentaStor or NexentaCore), and OpenIndiana (FreeBSD is occasionally > mentioned as a possibility, but I don't really see that as suitable for our > enterprise needs). > > Solaris 11 is the official successor to OpenSolaris, has commercial support, > and the backing of a huge corporation which historically has contributed the > majority of Solaris forward development. However, that corporation is Oracle, > and frankly, I don't like doing business with Oracle. With no offense > intended to the no doubt numerous talented and goodhearted people that might > work there, Oracle is simply evil. We've dealt with Oracle for a long time > (in addition to their database itself, we're a PeopleSoft shop) and a > positive interaction with them is quite rare. Since they took over Sun, costs > on licensing, support contracts, and hardware have increased dramatically, at > least in the cases where we've actually been able to get a quote. Arguably, > we are not their target market, and they make that quite clear ;). There's > also been significant brain drain of prior Sun employees since the takeover, > so while they might still continue to contribute the most money into Solaris > dev elopment, they might not be the future source of the most innovation. Given our needs, and our budget, I really don't consider this a viable option. > > Nexenta, on the other hand, seems to be the kind of company I'd like to deal > with. Relatively small, nimble, with a ton of former Sun zfs talent working > for them, and what appears to be actual consideration for the needs of their > customers. I think I'd more likely get my needs addressed through Nexenta, > they've already started work on adding aclmode back and I've had some initial > discussion with one of their engineers on the possibility of adding > additional options such as denying or ignoring attempted chmod updates on > objects with acls. It looks like they only offer commercial support for > NexentaStor, not NexentaCore. Commercial support isn't a strict requirement, > a sizable chunk of our infrastructure runs on a non-commercial linux > distribution and open source software, but it can make management happier. > NexentaStor seems positioned as a storage appliance, which isn't really what > we need. I'm not particularly interested in a web gui or cli interface that > hides the underly ing complexity of the operating system and zfs, on the contrary, I want full access to the guts :). We have our zfs deployment integrated into our identity management system, which automatically provisions, destroys, and maintains filespace for our user/groups, as well as providing an API for end-users and administrators to manage quotas and other attributes. We also run apache with some custom modules. I still need to investigate further, but I'm not even sure if NexentaStor provides access into the underlying OS or encapsulates everything and only allows control through its own administrative functionality. > > NexentaCore is more of the raw operating system we're probably looking for, > but with only community-based support. Given that NexentaCore and OpenIndiana > are now both going to be based off of the illumos core, I'm no
Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 21
I obtained smartmontools (which includes smartctl) from the standard apt repository (i'm using nexenta however), in addition its neccessary to use the device type of sat,12 with smartctl to get it to read attributes correctly in OS afaik. Also regarding dev id's on the system, from what i've seen they are assigned to ports therefor do not change, however upon changing a controller will most likely change unless its the same chipset with exactly the same port configuration. Hope this helps. On 7 Feb 2011, at 18:04, zfs-discuss-requ...@opensolaris.org wrote: > Having managed to muddle through this weekend without loss (though with a > certain amount of angst and duplication of efforts), I'm in the mood to > label things a bit more clearly on my system :-). > > smartctl doesn't seem to be on my system, though. I'm running > snv_134. I'm still pretty badly lost in the whole repository / > package thing with Solaris, most of my brain cells were already > occupied with Red Hat, Debian, and Perl package information :-( . > Where do I look? > > Are the controller port IDs, the "C9T3D0" things that ZFS likes, > reasonably stable? They won't change just because I add or remove > drives, right; only maybe if I change controller cards? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] deduplication requirements
Hi guys, I'm currently running 2 zpools each in a raidz1 configuration, totally around 16TB usable data. I'm running it all on an OpenSolaris based box with 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and underpowered for deduplication, so I'm looking at building a new system, but wanted some advice first, here is what i've planned so far: Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 13
Additionally, the way I do it is to draw a diagram of the drives in the system, labelled with the drive serial numbers. Then when a drive fails, I can find out from smartctl which drive it is and remove/replace without trial and error. On 5 Feb 2011, at 21:54, zfs-discuss-requ...@opensolaris.org wrote: > > Message: 7 > Date: Sat, 5 Feb 2011 15:42:45 -0500 > From: rwali...@washdcmail.com > To: David Dyer-Bennet > Cc: zfs-discuss@opensolaris.org > Subject: Re: [zfs-discuss] Identifying drives (SATA) > Message-ID: <58b53790-323b-4ae4-98cd-575f93b66...@washdcmail.com> > Content-Type: text/plain; charset=us-ascii > > > On Feb 5, 2011, at 2:43 PM, David Dyer-Bennet wrote: > >> Is there a clever way to figure out which drive is which? And if I have to >> fall back on removing a drive I think is right, and seeing if that's true, >> what admin actions will I have to perform to get the pool back to safety? >> (I've got backups, but it's a pain to restore of course.) (Hmmm; in >> single-user mode, use dd to read huge chunks of one disk, and see which >> lights come on? Do I even need to be in single-user mode to do that?) > > Obviously this depends on your lights working to some extent (the right light > doing something when the right disk is accessed), but I've used: > > dd if=/dev/rdsk/c8t3d0s0 of=/dev/null bs=4k count=10 > > which someone mentioned on this list. Assuming you can actually read from > the disk (it isn't completely dead), it should allow you to direct traffic to > each drive individually. > > Good luck, > Ware ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS slow for small files: idle disks
sks each, and 1 channel with 2 disks). Richard Ellings zilstat gives N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops <=4kB 4-32kB >=32kB 9552 9552 9552 671744 671744 671744164164 0 0 10192 10192 10192 724992 724992 724992177177 0 0 9568 9568 9568 679936 679936 679936166166 0 0 11712 11712 11712 823296 823296 823296201201 0 0 10784 10784 10784 765952 765952 765952187187 0 0 10024 10024 10024 708608 708608 708608173173 0 0 About 200 zil ops all < 4k as maximum. As said the disks aren't busy during this test. The test zfs ist configured with atime off. logbias nearly doesn't matter, with logbias=latency the iops rate is a little bit lower. Attached are some bonnie++ results to show, that all disks and the whole pool are quite healthy. I get > 1000 random reads/sec local and still nearly 900 reads/sec via nfs. For large files I easily get gbit wirespeed (105 MB/sec read) with nfs. And for random reads in a bonnie or iozone test the disks are really 80%-100% busy. Just for small files the array sits almost idle, the array can do way more. I discovered this on different solaris versions, not only this test system. Is there any explanation for this behaviour? Thanks, Michael -- This message posted from opensolaris.orglocal Version 1.03c --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ibmr10 16G 108972 25 89923 21 263540 26 1074 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 30359 99 + +++ + +++ 24836 99 + +++ + +++ ibmr10,16G,,,108972,25,89923,21,,,263540,26,1073.5,3,16,30359,99,+,+++,+,+++,24836,99,+,+++,+,+++ NFS Version 1.03d --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nfsibmr10 16G 50022 11 42524 14 105335 18 884.8 20 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 152 3 + +++ 182 1 151 3 + +++ 183 1 nfsibmr10,16G,,,50022,11,42524,14,,,105335,18,884.8,20,16,152,3,+,+++,182,1,151,3,+,+++,183,1 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting help on ZFS
On Thu, Jan 20, 2011 at 01:47, Steve Kellam wrote: > I have a home media server set up using OpenSolaris. All my experience with > OpenSolaris has been through setting up and maintaining this server so it is > rather limited. I have run in to some problems recently and I am not sure > how the best way to troubleshoot this. I was hoping to get some feedback on > possible fixes for this. > > I am running SunOS 5.11 snv_134. It is running on a tower with 6 HDD > configured in as raidz2 array. Motherboard: ECS 945GCD-M(1.0) Intel Atom 330 > Intel 945GC Micro ATX Motherboard/CPU Combo. Memory: 4GB. > > I set this up about a year ago and have had very few problems. I was > streaming a movie off the server a few days ago and it all of a sudden lost > connectivity with the server. When I checked the server, there was no output > on the display from the server but the power supply seemed to be running and > the fans were going. > The next day it started working again and I was able to log in. The SMB and > NFS file server was connecting without problems. > > Now I am able to connect remotely via SSH. I am able to bring up a zpool > status screen that shows no problems. It reports no known data errors. I am > able to go to the top level data directories but when I cd into the > sub-directories the SSH connection freezes. > > I have tried to do a ZFS scrub on the pool and it only gets to 0.02% and > never gets beyond that but does not report any errors. Now, also, I am > unable to stop the scrub. I use the zpool scrub -s command but this freezes > the SSH connection. > When I reboot, it is still trying to scrub but not making progress. > > I have the system set up to a battery back up with surge protection and I'm > not aware of any spikes in electricity recently. I have not made any > modifications to the system. All the drives have been run through SpinRite > less than a couple months ago without any data errors. > > I can't figure out how this happened all of the sudden and how best to > troubleshoot it. > > If you have any help or technical wisdom to offer, I'd appreciate it as this > has been frustrating. look in /var/adm/messages (.*) to see whether there's anything interesting around the time you saw the loss of connectivity, and also since, then take it from there. HTH Michael -- regards/mit freundlichen Grüssen Michael Schuster ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
Ah ok, I wont be using dedup anyway just wanted to try. Ill be adding more ram though, I guess you can't have too much. Thanks Erik Trimble wrote: >You can't really do that. > >Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes >RAM to maintain a cache table of what's in the L2ARC. Using 2GB of RAM >with an SSD-based L2ARC (even without Dedup) likely won't help you too >much vs not having the SSD. > >If you're going to turn on Dedup, you need at least 8GB of RAM to go >with the SSD. > >-Erik > > >On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote: >> Thanks everyone, I think overtime I'm gonna update the system to include an >> ssd for sure. Memory may come later though. Thanks for everyone's responses >> >> Erik Trimble wrote: >> >> >On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: >> >> I've since turned off dedup, added another 3 drives and results have >> >> improved to around 148388K/sec on average, would turning on compression >> >> make things more CPU bound and improve performance further? >> >> >> >> On 18 Jan 2011, at 15:07, Richard Elling wrote: >> >> >> >> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: >> >> > >> >> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've >> >> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb >> >> >> drives connected via SATA-II in a raidz1 configuration with dedup >> >> >> enabled compression off and pool version 23. From running bonnie++ I >> >> >> get the following results: >> >> >> >> >> >> Version 1.03b --Sequential Output-- --Sequential Input- >> >> >> --Random- >> >> >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> >> >> --Seeks-- >> >> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> >> >> /sec %CP >> >> >> nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 >> >> >> 429.8 1 >> >> >> --Sequential Create-- Random >> >> >> Create >> >> >> -Create-- --Read--- -Delete-- -Create-- --Read--- >> >> >> -Delete-- >> >> >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >> >> >> /sec %CP >> >> >>16 7181 29 + +++ + +++ 21477 97 + +++ >> >> >> + +++ >> >> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ >> >> >> >> >> >> >> >> >> I'd expect more than 105290K/s on a sequential read as a peak for a >> >> >> single drive, let alone a striped set. The system has a relatively >> >> >> decent CPU, however only 2GB memory, do you think increasing this to >> >> >> 4GB would noticeably affect performance of my zpool? The memory is >> >> >> only DDR1. >> >> > >> >> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, >> >> > turn off dedup >> >> > and enable compression. >> >> > -- richard >> >> > >> >> >> >> ___ >> >> zfs-discuss mailing list >> >> zfs-discuss@opensolaris.org >> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > >> > >> >Compression will help speed things up (I/O, that is), presuming that >> >you're not already CPU-bound, which it doesn't seem you are. >> > >> >If you want Dedup, you pretty much are required to buy an SSD for L2ARC, >> >*and* get more RAM. >> > >> > >> >These days, I really don't recommend running ZFS as a fileserver without >> >a bare minimum of 4GB of RAM (8GB for anything other than light use), >> >even with Dedup turned off. >> > >> > >> >-- >> >Erik Trimble >> >Java System Support >> >Mailstop: usca22-317 >> >Phone: x67195 >> >Santa Clara, CA >> >Timezone: US/Pacific (GMT-0800) >> > >-- >Erik Trimble >Java System Support >Mailstop: usca22-317 >Phone: x67195 >Santa Clara, CA >Timezone: US/Pacific (GMT-0800) > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
Thanks everyone, I think overtime I'm gonna update the system to include an ssd for sure. Memory may come later though. Thanks for everyone's responses Erik Trimble wrote: >On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: >> I've since turned off dedup, added another 3 drives and results have >> improved to around 148388K/sec on average, would turning on compression make >> things more CPU bound and improve performance further? >> >> On 18 Jan 2011, at 15:07, Richard Elling wrote: >> >> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: >> > >> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've >> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb >> >> drives connected via SATA-II in a raidz1 configuration with dedup enabled >> >> compression off and pool version 23. From running bonnie++ I get the >> >> following results: >> >> >> >> Version 1.03b --Sequential Output-- --Sequential Input- >> >> --Random- >> >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> >> --Seeks-- >> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> >> /sec %CP >> >> nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 >> >> 429.8 1 >> >> --Sequential Create-- Random >> >> Create >> >> -Create-- --Read--- -Delete-- -Create-- --Read--- >> >> -Delete-- >> >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec >> >> %CP >> >>16 7181 29 + +++ + +++ 21477 97 + +++ + >> >> +++ >> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ >> >> >> >> >> >> I'd expect more than 105290K/s on a sequential read as a peak for a >> >> single drive, let alone a striped set. The system has a relatively decent >> >> CPU, however only 2GB memory, do you think increasing this to 4GB would >> >> noticeably affect performance of my zpool? The memory is only DDR1. >> > >> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn >> > off dedup >> > and enable compression. >> > -- richard >> > >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >Compression will help speed things up (I/O, that is), presuming that >you're not already CPU-bound, which it doesn't seem you are. > >If you want Dedup, you pretty much are required to buy an SSD for L2ARC, >*and* get more RAM. > > >These days, I really don't recommend running ZFS as a fileserver without >a bare minimum of 4GB of RAM (8GB for anything other than light use), >even with Dedup turned off. > > >-- >Erik Trimble >Java System Support >Mailstop: usca22-317 >Phone: x67195 >Santa Clara, CA >Timezone: US/Pacific (GMT-0800) > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
I've since turned off dedup, added another 3 drives and results have improved to around 148388K/sec on average, would turning on compression make things more CPU bound and improve performance further? On 18 Jan 2011, at 15:07, Richard Elling wrote: > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: > >> Hi guys, sorry in advance if this is somewhat a lowly question, I've >> recently built a zfs test box based on nexentastor with 4x samsung 2tb >> drives connected via SATA-II in a raidz1 configuration with dedup enabled >> compression off and pool version 23. From running bonnie++ I get the >> following results: >> >> Version 1.03b --Sequential Output-- --Sequential Input- >> --Random- >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec >> %CP >> nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 >> 1 >> --Sequential Create-- Random Create >> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >>16 7181 29 + +++ + +++ 21477 97 + +++ + +++ >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ >> >> >> I'd expect more than 105290K/s on a sequential read as a peak for a single >> drive, let alone a striped set. The system has a relatively decent CPU, >> however only 2GB memory, do you think increasing this to 4GB would >> noticeably affect performance of my zpool? The memory is only DDR1. > > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn off > dedup > and enable compression. > -- richard > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is my bottleneck RAM?
Hi guys, sorry in advance if this is somewhat a lowly question, I've recently built a zfs test box based on nexentastor with 4x samsung 2tb drives connected via SATA-II in a raidz1 configuration with dedup enabled compression off and pool version 23. From running bonnie++ I get the following results: Version 1.03b --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 429.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 7181 29 + +++ + +++ 21477 97 + +++ + +++ nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ I'd expect more than 105290K/s on a sequential read as a peak for a single drive, let alone a striped set. The system has a relatively decent CPU, however only 2GB memory, do you think increasing this to 4GB would noticeably affect performance of my zpool? The memory is only DDR1. Thanks in advance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Just to add a bit to this, I just love sweeping generalizations... On 9 Jan 2011, at 19:33 , Richard Elling wrote: > On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey > wrote: > >>> From: Pasi Kärkkäinen [mailto:pa...@iki.fi] >>> >>> Other OS's have had problems with the Broadcom NICs aswell.. >> >> Yes. The difference is, when I go to support.dell.com and punch in my >> service tag, I can download updated firmware and drivers for RHEL that (at >> least supposedly) solve the problem. I haven't tested it, but the dell >> support guy told me it has worked for RHEL users. There is nothing >> available to download for solaris. > > The drivers are written by Broadcom and are, AFAIK, closed source. > By going through Dell, you are going through a middle-man. For example, > > http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php > > where you see the release of the Solaris drivers was at the same time > as Windows. > What Richard says is true. Broadcom have been a source of contention in the Linux world as well as the *BSD world due to the proprietary nature of their firmware. OpenSolaris/Solaris users are not the only ones who have complained about this. There's been much uproar in the FOSS community about Broadcom and their drivers. As a result, I've seen some pretty nasty hacks like people using the Windows drivers linked into their kernel - *gack* I forget all the gory details, but it was rather disgusting as I recall, bubblegum, bailing wire, duct tape and all. Dell and Red Hat aren't exactly a marriage made in heaven either. I've had problems getting support from both Dell and Red Hat, them pointing fingers at each other rather than solving the problem. Like most people, I've had to come up with my own work-arounds, like others with the Broadcom issue, using a "known quantity" NIC. When dealing with Dell as a corporate buyer, they have always made it quite clear that they are primarily a Windows platform. Linux, oh yes, we have that too... >> Also, the bcom is not the only problem on that server. After I added-on an >> intel network card and disabled the bcom, the weekly crashes stopped, but >> now it's ... I don't know ... once every 3 weeks with a slightly different >> mode of failure. This is yet again, rare enough that the system could very >> well pass a certification test, but not rare enough for me to feel >> comfortable putting into production as a primary mission critical server. I've never been particularly warm and fuzzy with Dell servers. They seem to like to change their chipsets slightly while a model is in production. This can cause all sorts of problems which are difficult to diagnose since an "identical" Dell system will have no problems, and it's mate will crash weekly. >> >> I really think there are only two ways in the world to engineer a good solid >> server: >> (a) Smoke your own crack. Systems engineering teams use the same systems >> that are sold to customers. > > This is rarely practical, not to mention that product development > is often not in the systems engineering organization. > >> or >> (b) Sell millions of 'em. So despite whether or not the engineering team >> uses them, you're still going to have sufficient mass to dedicate engineers >> to the purpose of post-sales bug solving. > > yes, indeed :-) > -- richard As for certified systems, It's my understanding that Nexenta themselves don't "certify" anything. They have systems which are recommended and supported by their network of VAR's. It just so happens that SuperMicro is one of the brands of choice, but even then one must adhere to a fairly tight HCL. The same holds true for Solaris/OpenSolaris with third-party hardware. SATA Controllers and multiplexers are also another example of the drivers being written by the manufacturer and Solaris/OpenSolaris are not a priority over Windows and Linux, in that order. Deviation from items which are not somewhat "plain vanilla" and are not listed on the HCL is just asking for trouble. Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Mobile: +1-662-202-7716 US Phone: +1-561-283-2034 JP Phone: +81-50-5806-6242 smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Jan 7, 2011, at 6:13 AM, David Magda wrote: > On Fri, January 7, 2011 01:42, Michael DeMan wrote: >> Then - there is the other side of things. The 'black swan' event. At >> some point, given percentages on a scenario like the example case above, >> one simply has to make the business justification case internally at their >> own company about whether to go SHA-256 only or Fletcher+Verification? >> Add Murphy's Law to the 'black swan event' and of course the only data >> that is lost is that .01% of your data that is the most critical? > > The other thing to note is that by default (with de-dupe disabled), ZFS > uses Fletcher checksums to prevent data corruption. Add also the fact all > other file systems don't have any checksums, and simply rely on the fact > that disks have a bit error rate of (at best) 10^-16. > Agreed - but I think it is still missing the point of what the original poster was asking about. In all honesty I think the debate is a business decision - the highly improbable vs. certainty. Somebody somewhere must have written this stuff up, along with simple use cases? Perhaps even a new acronym? MTBC - mean time before collision? And even with the 'certainty' factor being the choice - other things like human error come in to play and are far riskier? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
At the end of the day this issue essentially is about mathematical improbability versus certainty? To be quite honest, I too am skeptical about about using de-dupe just based on SHA256. In prior posts it was asked that the potential adopter of the technology provide the mathematical reason to NOT use SHA-256 only. However, if Oracle believes that it is adequate to do that, would it be possible for somebody to provide: (A) The theoretical documents and associated mathematics specific to say one simple use case? (A1) Total data size is 1PB (lets say the zpool is 2PB to not worry about that part of it). (A2) Daily, 10TB of data is updated, 1TB of data is deleted, and 1TB of data is 'new'. (A3) Out of the dataset, 25% of the data is capable of being de-duplicated (A4) Between A2 and A3 above, the 25% rule from A3 also applies to everything in A2. I think the above would be a pretty 'soft' case for justifying the case that SHA-256 works? I would presume some kind of simple kind of scenario mathematically has been run already by somebody inside Oracle/Sun long ago when first proposing that ZFS be funded internally at all? Then - there is the other side of things. The 'black swan' event. At some point, given percentages on a scenario like the example case above, one simply has to make the business justification case internally at their own company about whether to go SHA-256 only or Fletcher+Verification? Add Murphy's Law to the 'black swan event' and of course the only data that is lost is that .01% of your data that is the most critical? Not trying to be aggressive or combative here at all against peoples opinions and understandings of it all - I would just like to see some hard information about it all - it must exist somewhere already? Thanks, - Mike On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Peter Taps >> >> Perhaps (Sha256+NoVerification) would work 99.99% of the time. But > > Append 50 more 9's on there. > 99.% > > See below. > > >> I have been told that the checksum value returned by Sha256 is almost >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a >> bigger problem such as memory corruption, etc. Essentially, adding >> verification to sha256 is an overkill. > > Someone please correct me if I'm wrong. I assume ZFS dedup matches both the > blocksize and the checksum right? A simple checksum collision (which is > astronomically unlikely) is still not sufficient to produce corrupted data. > It's even more unlikely than that. > > Using the above assumption, here's how you calculate the probability of > corruption if you're not using verification: > > Suppose every single block in your whole pool is precisely the same size > (which is unrealistic in the real world, but I'm trying to calculate worst > case.) Suppose the block is 4K, which is again, unrealistically worst case. > Suppose your dataset is purely random or sequential ... with no duplicated > data ... which is unrealisic because if your data is like that, then why in > the world are you enabling dedupe? But again, assuming worst case > scenario... At this point we'll throw in some evil clowns, spit on a voodoo > priestess, and curse the heavens for some extra bad luck. > > If you have astronomically infinite quantities of data, then your > probability of corruption approaches 100%. With infinite data, eventually > you're guaranteed to have a collision. So the probability of corruption is > directly related to the total amount of data you have, and the new question > is: For anything Earthly, how near are you to 0% probability of collision > in reality? > > Suppose you have 128TB of data. That is ... you have 2^35 unique 4k blocks > of uniformly sized data. Then the probability you have any collision in > your whole dataset is (sum(1 thru 2^35))*2^-256 > Note: sum of integers from 1 to N is (N*(N+1))/2 > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 > Note: (N*(N+1))/2 in this case = 2^69 + 2^34 > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= > 5.1E-57 + 1.5E-67 > > ~= 5.1E-57 > > In other words, even in the absolute worst case, cursing the gods, running > without verification, using data that's specifically formulated to try and > cause errors, on a dataset that I bet is larger than what you're doing, ... > > Before we go any further ... The total number of bits stored on all the > storage in the whole planet is a lot smaller than the total number of > molecules in the planet. > > There are estimated 8.87 * 10^49 molecules in planet Earth. > > The probability of a collision in your worst-case unrealistic dataset as > described, is even 100 million times less likely than randomly finding a > single specific molecule in the whole planet Earth
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
Ed, with all due respect to your math, I've seen rsync bomb due to an SHA256 collision, so I know it can and does happen. I respect my data, so even with checksumming and comparing the block size, I'll still do a comparison check if those two match. You will end up with silent data corruption which could affect you in so many ways. Do you want to stake your career and reputation on that? With a client or employer's data? I sure don't. "Those who walk on the razor's edge are destined to be cut to ribbons…" Someone I used to work with said that, not me. For my home media server, maybe, but even then I'd hate to lose any of my family photos or video due to a hash collision. I'll play it safe if I dedup. Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Mobile: +1-662-202-7716 US Phone: +1-561-283-2034 JP Phone: +81-50-5806-6242 On 7 Jan 2011, at 00:05 , Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Peter Taps >> >> Perhaps (Sha256+NoVerification) would work 99.99% of the time. But > > Append 50 more 9's on there. > 99.% > > See below. > > >> I have been told that the checksum value returned by Sha256 is almost >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a >> bigger problem such as memory corruption, etc. Essentially, adding >> verification to sha256 is an overkill. > > Someone please correct me if I'm wrong. I assume ZFS dedup matches both the > blocksize and the checksum right? A simple checksum collision (which is > astronomically unlikely) is still not sufficient to produce corrupted data. > It's even more unlikely than that. > > Using the above assumption, here's how you calculate the probability of > corruption if you're not using verification: > > Suppose every single block in your whole pool is precisely the same size > (which is unrealistic in the real world, but I'm trying to calculate worst > case.) Suppose the block is 4K, which is again, unrealistically worst case. > Suppose your dataset is purely random or sequential ... with no duplicated > data ... which is unrealisic because if your data is like that, then why in > the world are you enabling dedupe? But again, assuming worst case > scenario... At this point we'll throw in some evil clowns, spit on a voodoo > priestess, and curse the heavens for some extra bad luck. > > If you have astronomically infinite quantities of data, then your > probability of corruption approaches 100%. With infinite data, eventually > you're guaranteed to have a collision. So the probability of corruption is > directly related to the total amount of data you have, and the new question > is: For anything Earthly, how near are you to 0% probability of collision > in reality? > > Suppose you have 128TB of data. That is ... you have 2^35 unique 4k blocks > of uniformly sized data. Then the probability you have any collision in > your whole dataset is (sum(1 thru 2^35))*2^-256 > Note: sum of integers from 1 to N is (N*(N+1))/2 > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 > Note: (N*(N+1))/2 in this case = 2^69 + 2^34 > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= > 5.1E-57 + 1.5E-67 > > ~= 5.1E-57 > > In other words, even in the absolute worst case, cursing the gods, running > without verification, using data that's specifically formulated to try and > cause errors, on a dataset that I bet is larger than what you're doing, ... > > Before we go any further ... The total number of bits stored on all the > storage in the whole planet is a lot smaller than the total number of > molecules in the planet. > > There are estimated 8.87 * 10^49 molecules in planet Earth. > > The probability of a collision in your worst-case unrealistic dataset as > described, is even 100 million times less likely than randomly finding a > single specific molecule in the whole planet Earth by pure luck. > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey wrote: >> From: Deano [mailto:de...@rattie.demon.co.uk] >> Sent: Wednesday, January 05, 2011 9:16 AM >> >> So honestly do we want to innovate ZFS (I do) or do we just want to follow >> Oracle? > > Well, you can't follow Oracle. Unless you wait till they release something, > reverse engineer it, and attempt to reimplement it. that's not my understanding - while we will have to wait, oracle is supposed to release *some* source code afterwards to satisfy some claim or other. I agree, some would argue that that should have already happened with S11 express... I don't know it has, but that's not *the* release of S11, is it? And once the code is released, even if after the fact, it's not reverse-engineering anymore, is it? Michael PS: just in case: even while at Oracle, I had no insight into any of these plans, much less do I have now. -- regards/mit freundlichen Grüssen Michael Schuster ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A couple of quick questions
I can't answer any of these authoritatively(?), but have a comment: On Wed, Dec 22, 2010 at 10:55, Per Hojmark wrote: > 1) What's the maximum number of disk devices that can be used to construct > filesystems? lots. > 2) Is there a practical limit on #1? I've seen messages where folks suggested > 40 physical devices is the practical maximum. That would seem to imply a > maximum single volume size of 80TB... how does that follow, or, in other words, why do you believe zfs can only handle 2 TB per physical disc? (hint: look up GTP or EFI label ;-) HTH -- regards/mit freundlichen Grüssen Michael Schuster ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideas for ghetto file server data reliability?
Ummm… there's a difference between data integrity and data corruption. Integrity is enforced programmatically by something like a DBMS. This sets up basic rules that ensure the programmer, program or algorithm adhere to a level of sanity and bounds. Corruption is where cosmic rays, bit rot, malware or some other item writes to the block level. ZFS protects systems from a lot of this by the way it's constructed to keep metadata, checksums, and duplicates of critical data. If the filesystem is given bad data it will faithfully lay it down on disk. If that faulty data gets corrupt, ZFS will come in and save the day. Regards, Mike On Nov 16, 2010, at 11:28, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Toby Thain >> >> The corruption will at least be detected by a scrub, even in cases where > it >> cannot be repaired. > > Not necessarily. Let's suppose you have some bad memory, and no ECC. Your > application does 1 + 1 = 3. Then your application writes the answer to a > file. Without ECC, the corruption happened in memory and went undetected. > Then the corruption was written to file, with a correct checksum. So in > fact it's not filesystem corruption, and ZFS will correctly mark the > filesystem as clean and free of checksum errors. > > In conclusion: > > Use ECC if you care about your data. > Do backups if you care about your data. > > Don't be a cheapskate, or else, don't complain when you get bitten by lack > of adequate data protection. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
Congratulations Ed, and welcome to "open systems…" Ah, but Nexenta is open and has "no vendor lock-in." That's what you probably should have done is bank everything on Illumos and Nexenta. A winning combination by all accounts. But then again, you could have used Linux on any hardware as well. Then your hardware and software issues would probably be multiplied even more. Cheers, Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 On 23 Oct 2010, at 12:53 , Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Kyle McDonald >> >> I'm currently considering purchasing 1 or 2 Dell R515's. >> >> With up to 14 drives, and up to 64GB of RAM, it seems like it's well >> suited >> for a low-end ZFS server. >> >> I know this box is new, but I wonder if anyone out there has any >> experience with it? >> >> How about the H700 SAS controller? >> >> Anyone know where to find the Dell 3.5" sleds that take 2.5" drives? I >> want to put some SSD's in a box like this, but there's no way I'm >> going to pay Dell's SSD prices. $1300 for a 50GB 'mainstream' SSD? Are >> they kidding? > > You are asking for a world of hurt. You may luck out, and it may work > great, thus saving you money. Take my example for example ... I took the > "safe" approach (as far as any non-sun hardware is concerned.) I bought an > officially supported dell server, with all dell blessed and solaris > supported components, with support contracts on both the hardware and > software, fully patched and updated on all fronts, and I am getting system > failures approx once per week. I have support tickets open with both dell > and oracle right now ... Have no idea how it's all going to turn out. But > if you have a problem like mine, using unsupported hardware, you have no > alternative. You're up a tree full of bees, naked, with a hunter on the > ground trying to shoot you. And IMHO, I think the probability of having a > problem like mine is higher when you use the unsupported hardware. But of > course there's no definable way to quantize that belief. > > My advice to you is: buy the supported hardware, and the support contracts > for both the hardware and software. But of course, that's all just a > calculated risk, and I doubt you're going to take my advice. ;-) > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote: >> From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com] >> Sent: Thursday, October 07, 2010 10:02 PM >> >> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey >> wrote: >>> If you're going raidz3, with 7 disks, then you might as well just make >>> mirrors instead, and eliminate the slow resilver. >> >> There is a difference in reliability: raidzN means _any_ N disks can >> fail, whereas mirror means one disk in each mirror pair can fail. >> With a mirror, Murphy's Law says that the second disk to fail will be >> the pair of the first disk :-). > > Maybe. But in reality, you're just guessing the probability of a single > failure, the probability of multiple failures, and the probability of > multiple failures within the critical time window and critical redundancy > set. > > The probability of a 2nd failure within the critical time window is smaller > whenever the critical time window is decreased, and the probability of that > failure being within the critical redundancy set is smaller whenever your > critical redundancy set is smaller. So if raidz2 takes twice as long to > resilver than a mirror, and has a larger critical redundancy set, then you > haven't gained any probable resiliency over a mirror. > > Although it's true with mirrors, it's possible for 2 disks to fail and > result in loss of pool, I think the probability of that happening is smaller > than the probability of a 3-disk failure in the raidz2. > > How much longer does a 7-disk raidz2 take to resilver as compared to a > mirror? According to my calculations, it's in the vicinity of 10x longer. > This article has been posted elsewhere, is about 10 months old, but is a good read: http://queue.acm.org/detail.cfm?id=1670144 Really, there should be a ballpark / back of the napkin formula to be able to calculate this? I've been curious about this too, so here goes a 1st cut... DR = disk reliability, in terms of chance of the disk dying in any given time period, say any given hour? DFW = disk full write - time to write every sector on the disk. This will vary depending on system load, but is still an input item that can be determined by some testing. RSM = resilver time for a mirror of two of the given disks RSZ1 = resilver time for raidz1 vdev of two of the given disks? RSZ2 = resilver time for raidz2 vdev of two of the given disks? chances of losing all data in a mirror: DLM = RSM * DR. chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR. chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR Now, for the above, I'll make some other assumptions... Lets just guess at a 1-year MTBF for our disks, and for purposes here, just flat line that at a failure rate of chance per hour throughout the year. Lets presume rebuilding a mirror takes one hour. Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk than a mirror, I think this would be a 'safe' ratio to the benefit of the mirror. Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk than a mirror, this should be 'safe' and again benefit to the mirror. DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk might die in any given hour. DLM = one hour * DR = .000114 DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in the pool, and any one of them could fail) DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks) = a much tinier chance of losing all that data. A better way to think about it maybe Based on our 1-year flat-line MTBF for disks, to figure out how much faster the mirror must rebuild for reliability to be the same as a raidz2... DLM = DLRZ2 .0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks) X = (.0001114 * 6-disks) * 5 X = .003342 So, the mirror would have to resilver three hundred times faster than the raiz2 (1 / .003342) in order for it to offer the same levels of reliability in regards to the chances of losing the entire vdev due to additional disk failures during a resilver? The governing thing here is that O(2) level of reliability based on expected chances of failure of additional disks during any given moment in time, vs. O(1) for mirrors and raidz1? Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we are working on the assumption we have already lost one disk. With raidz3, we would have ( 1 / (.0001114 * 4-disks remaining in pool ), or about 2,000 times more reliability? Now, the above does not include things like proper statistics that the chances of that 2nd and 3rd disk failing (even correlations) may be higher than our 'flat-line' %/hr. based on 1-year MTBF, or stuff like if all the disks were purchased in the same lots and at the same time, so their chances of failing around the same time is higher, etc. ___
Re: [zfs-discuss] TLER and ZFS
Can you give us release numbers that confirm that this is 'automatic'. It is my understanding that the last available public release of OpenSolaris does not do this. On Oct 5, 2010, at 8:52 PM, Richard Elling wrote: > ZFS already aligns the beginning of data areas to 4KB offsets from the label. > For modern OpenSolaris and Solaris implementations, the default starting > block for partitions is also aligned to 4KB. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
Hi upfront, and thanks for the valuable information. On Oct 5, 2010, at 4:12 PM, Peter Jeremy wrote: >> Another annoying thing with the whole 4K sector size, is what happens >> when you need to replace drives next year, or the year after? > > About the only mitigation needed is to ensure that any partitioning is > based on multiples of 4KB. I agree, but to be quite honest, I have no clue how to do this with ZFS. It seems that it should be something under the regular tuning documenation. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide Is it going to be the case that basic information like about how to deal with common scenarios like this is no longer going to be publicly available, and Oracle will simply keep it 'close to the vest', with the relevant information simply available for those who choose to research it themselves, or only available to those with certain levels of support contracts from Oracle? To put it another way - does the community that uses ZFS need to fork 'ZFS Best Practices' and 'ZFZ Evil Tuning' to ensure that it is reasonably up to date? Sorry for the somewhat hostile in the above, but the changes w/ the merger have demoralized a lot of folks I think. - Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On Oct 5, 2010, at 2:47 PM, casper@sun.com wrote: > > > I've seen several important features when selecting a drive for > a mirror: > > TLER (the ability of the drive to timeout a command) > sector size (native vs virtual) > power use (specifically at home) > performance (mostly for work) > price > > I've heard scary stories about a mismatch of the native sector size and > unaligned Solaris partitions (4K sectors, unaligned cylinder). > Yes, avoiding the 4K sector sizes is a huge issue right now too - another item I forgot on the reasons to absolutely avoid those WD 'green' drives. Three good reasons to avoid WD 'green' drives for ZFS... - TLER issues - IntelliPower head park issues - 4K sector size issues ...they are an absolutely nightmare. The WD 1TB 'enterprise' drives are still 512 sector size and safe to use, who knows though, maybe they just started shipping with 4K sector size as I write this e-mail? Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? That part has me worried on this whole 4K sector migration thing more than what to buy today. Given the choice, I would prefer to buy 4K sector size now, but operating system support is still limited. Does anybody know if there any vendors that are shipping 4K sector drives that have a jumper option to make them 512 size? WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? I would be interested, and probably others would too, on what the original poster finally decides on this? - Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On Oct 5, 2010, at 1:47 PM, Roy Sigurd Karlsbakk wrote: >> Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5" Internal >> Hard Drive -Bare Drive >> >> are only $129. >> >> vs. $89 for the 'regular' black drives. >> >> 45% higher price, but it is my understanding that the 'RAID Edition' >> ones also are physically constructed for longer life, lower vibration >> levels, etc. > > Well, here it's about 60% up and for 150 drives, that makes a wee > difference... > > Vennlige hilsener / Best regards > > roy Understood on 1.6 times cost, especially for quantity 150 drives. I think (and if I am wrong, somebody else correct me) - that if you are using commodity controllers, which seems to generally fine for ZFS, then if a drive times out trying to constantly re-read a bad sector, it could stall out the read on the entire pool overall. On the other hand, if the drives are exported as JBOD from a RAID controller, I would think the RAID controller itself would just mark the drive as bad and offline it quickly based on its own internal algorithms. The above would also be relevant to the anticipated usage. For instance, if it is some sort of backup machine and delays due to some reads stalling on out TLER then perhaps it is not a big deal. If it is for more of an up-front production use, that could be intolerable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
I'm not sure on the TLER issues by themselves, but after the nightmares I have gone through dealing with the 'green drives', which have both the TLER issue and the IntelliPower head parking issues, I would just stay away from it all entirely and pay extra for the 'RAID Editiion' drives. Just out of curiosity, I took a peek a newegg. Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5" Internal Hard Drive -Bare Drive are only $129. vs. $89 for the 'regular' black drives. 45% higher price, but it is my understanding that the 'RAID Edition' ones also are physically constructed for longer life, lower vibration levels, etc. On Oct 5, 2010, at 1:30 PM, Roy Sigurd Karlsbakk wrote: > Hi all > > I just discovered WD Black drives are rumored not to be set to allow TLER. > Does anyone know how much performance impact the lack of TLER might have on a > large pool? Choosing Enterprise drives will cost about 60% more, and on a > large install, that means a lot of money... > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > r...@karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det > er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av > idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og > relevante synonymer på norsk. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs unmount" versus "umount"?
On 30.09.10 15:42, Mark J Musante wrote: On Thu, 30 Sep 2010, Linder, Doug wrote: Is there any technical difference between using "zfs unmount" to unmount a ZFS filesystem versus the standard unix "umount" command? I always use "zfs unmount" but some of my colleagues still just use umount. Is there any reason to use one over the other? No, they're identical. If you use 'zfs umount' the code automatically maps it to 'unmount'. It also maps 'recv' to 'receive' and '-?' to call into the usage function. Here's the relevant code from main(): Mark, I think that wasn't the question, rather, "what's the difference between 'zfs u[n]mount' and '/usr/bin/umount'?" HTH Michael -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
I'm sorry to say that I am quite the newbie to ZFS. When you say zfs send/receive what exactly are you referring to? I had the zfs array mounted to a specific location in my file system (/mnt/Share) and I was sharing that location over the network with a samba server. The directory had read-write-execute persion set to allow anyone to write to it and I was copying data from windows into it. At what point do file changes get committed to the file system? I sort of assumed that any additional files copied over would be committed once the next file began copying. Thanks for your insight. -Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
Oh and yes, raidz1. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
I don't know what happened. I was in the process of copying files onto my new file server when the copy process from the other machine failed. I turned on the monitor for the fileserver and found that it had rebooted by itself at some point (machine fault maybe?) and when I remounted the drives every last thing was gone. I am new to zfs. How do you take snapshots? Does the sytem do it automagically for you? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] file recovery on lost RAIDZ array
I recently lost all of the data on my single parity raid z array. Each of the drives was encrypted with the zfs array built within the encrypted volumes. I am not exactly sure what happened. The files were there and accessible and then they were all gone. The server apparently crashed and rebooted and everything was lost. After the crash I remounted the encrypted drives and the zpool was still reporting that roughly 3TB of the 7TB array were used, but I could not see any of the files through the array's mount point. I unmounted the zpool and then remounted it and suddenly zpool was reporting 0TB were used. I did not remap the virtual device. The only thing of note that I saw was that the name of storage pool had changed. Originally it was "Movies" and then it became "Movita". I am guessing that the file system became corrupted some how. (zpool status did not report any errors) So, my questions are these... Is there anyway to undelete data from a lost raidz array? If I build a new virtual device on top of the old one and the drive topology remains the same, can we scan the drives for files from old arrays? Also, is there any way to repair a corrupted storage pool? Is it possible to backup the file table or whatever partition index zfs maintains? I imagine that you all are going to suggest that I scrub the array, but that is not an option at this point. I had a backup of all of the data lost as I am moving between file servers so at a certain point I gave up and decided to start fresh. This doesn't give me a warm fuzzy feeling about zfs, though. Thanks, -Mike -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with SAN's and HA
Lao, I had a look at the HAStoragePlus etc and from what i understand that's to mirror local storage across 2 nodes for services to be able to access 'DRBD style'. Having a read thru the documentation on the oracle site the cluster software from what i gather is how to cluster services together (oracle/apache etc) and again any documentation i've found on storage is how to duplicate local storage to multiple hosts for HA failover. Can't really see anything on clustering services to use shared storage/zfs pools. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS with SAN's and HA
Hey all, I currently work for a company that has purchased a number of different SAN solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file store over fiber channel. Basically I've taken slices from each of the sans and added them to a ZFS pool on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from this pool and assigned them to other servers. I then have snapshots taken on each of the LUN's and replication off site for DR. This all works perfectly (backups for ESXi!) However, I'd like to be able to a) expand and b) make it HA. All the documentation i can find on setting up a HA cluster for file stores replicates data from 2 servers and then serves from these computers (i trust the SAN's to take care of the data and don't want to replicate anything -- cost!). Basically all i want is for the node that serves the ZFS pool to be HA (if this was to be put into production we have around 128tb and are looking to expand to a pb). We have a couple of IBM SVC's that seem to handle the HA node setup in some obscure property IBM way so logically it seems possible. Clients would only be making changes via a single 'zfs proxy' at a time (multi-pathing setup for fail over only) so i don't believe I'd need to OCFS the setup? If i do need to setup OCFS can i put ZFS on top of that? (want snap-shotting/rollback and replication to a off site location, as well as all the goodness of thin provisioning and de-duplication) However when i import the ZFS pool onto the 2nd box i got large warnings about it being mounted elsewhere and i needed to force the import, then when importing the LUN's i saw that the GUUID was different so multi-pathing doesn't pick that the LUN's are the same? can i change a GUUID via smtfadm? Is any of this even possible over fiber channel? Is anyone able to point me at some documentation? Am i simply crazy? Any input would be most welcome. Thanks in advance, -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs/iSCSI: 0000 = SNS Error Type: Current Error (0x70)
Hi, I'm trying to track down an error with a 64bit x86 OpenSolaris 2009.06 ZFS shared via iSCSI and an Ubuntu 10.04 client. The client can successfully log in, but no device node appears. I captured a session with wireshark. When the client attempts a "SCSI: Inquiry LUN: 0x00", OpenSolaris sends a "SCSI Response (Check Condition) LUN:0x00" that contains the following: .111 = SNS Error Type: Current Error (0x70) Filemark: 0, EOM: 0, ILI: 0 0100 = Sense Key: Hardware Error (0x04) The ZFS being exported is a 400GB chunk of a 1TB ZFS mirror. The underlying OS reports no hardware errors, and "zpool status" looks OK. Why would OpenSolaris give this error? Is there anything I can do for it? Any suggestions would be appreciated. (I discussed this with the open-iscsi people at http://groups.google.com/group/open-iscsi/browse_thread/thread/06b83227ffc6a31a/2e58a163e21ec74e#2e58a163e21ec74e.) Thanks, ==ml -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
On 17.08.10 04:17, Will Murnane wrote: On Mon, Aug 16, 2010 at 21:58, Kishore Kumar Pusukuri wrote: Hi, I am surprised with the performances of some 64-bit multi-threaded applications on my AMD Opteron machine. For most of the applications, the performance of 32-bit version is almost same as the performance of 64-bit version. However, for a couple of applications, 32-bit versions provide better performance (running-time is around 76 secs) than 64-bit (running time is around 96 secs). Could anyone help me to find the reason behind this, please? [...] This list discusses the ZFS filesystem. Perhaps you'd be better off posting to perf-discuss or tools-gcc? That said, you need to provide more information. What compiler and flags did you use? What does your program (broadly speaking) do? What did you measure to conclude that it's slower in 64-bit mode? add to that: what OS are you using? Michael -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Degraded Pool, Spontaneous Reboots
Hello, I've been getting warnings that my zfs pool is degraded. At first it was complaining about a few corrupt files, which were listed as hex numbers instead of filenames, i.e. VOL1:<0x0> After a scrub, a couple of the filenames appeared - turns out they were in snapshots I don't really need, so I destroyed those snapshots and started a new scrub. Subsequently, I typed " zpool status -v VOL1" ... and the machine rebooted. When I could log on again, I looked at /var/log/messages, but found nothing interesting prior to the reboot. I typed " zpool status -v VOL1" again, whereupon the machine rebooted. When the machine was back up, I stopped the scrub, waited a while, then typed "zpool status -v VOL1" again, and this time got: r...@nexenta1:~# zpool status -v VOL1 pool: VOL1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scan: scrub canceled on Wed Aug 11 11:03:15 2010 config: NAME STATE READ WRITE CKSUM VOL1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c2d0 DEGRADED 0 0 0 too many errors c3d0 DEGRADED 0 0 0 too many errors c4d0 DEGRADED 0 0 0 too many errors c5d0 DEGRADED 0 0 0 too many errors So, I have the following questions: 1) How do I find out which file is corrupt, when I only get something like "VOL1:<0x0>" 2) What could be causing these reboots? 3) How can I fix my pool? Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS p[erformance drop with new Xeon 55xx and 56xx cpus
On 08/12/10 04:16, Steve Gonczi wrote: Greetings, I am seeing some unexplained performance drop using the above cpus, using a fairly up-to-date build ( late 145). Basically, the system seems to be 98% idle, spending most if its time in this stack: unix`i86_mwait+0xd unix`cpu_idle_mwait+0xf1 unix`idle+0x114 unix`thread_start+0x8 455645 Most cpus seem to be idling most of the time, sitting on the mwait instruction. No lock contention, not waiting on io, I am finding myself at a loss explaining what this system is doing. (I am monitoring the system w. lockstat, mpstat, prstat). Despite the predominantly idle system, I see some latency reported by prstat microstate accounting on the zfs threads. This is a fairly beefy box, 24G memory, 16 cpus. Doing a local zfs send | receive, should be getting at least 100MB+, and I am only getting 5-10MB. I see some Intel errata on the 55xx series xeons, a problem with the monitor/mwait instructions, that could conceivably cause missed wake-up or mis-reported mwait status. I'd suggest you supply a bit more information (to the list, not to me, I don't know very much about zfs internals): - zpool/zfs configuration - history of this issue: has it been like this since you installed the machine? - if no: what changes were introduced around the time you saw this first? - does this happen on a busy machine too? - describe your test in more detail - provide measurements (lockstat, iostat, maybe some DTrace) before and during test, add some timestamps so people can correlate data to events. - anything else you can think of that might be relevant. HTH Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] core dumps eating space in snapshots
On 27.07.10 14:21, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of devsk I have many core files stuck in snapshots eating up gigs of my disk space. Most of these are BE's which I don't really want to delete right now. Ok, you don't want to delete them ... Is there a way to get rid of them? I know snapshots are RO but can I do some magic with clones and reclaim my space? You don't want to delete them, but you don't want them to take up space either? Um ... Sorry, can't be done. Move them to a different disk ... Or clarify what it is that you want. If you're saying you have core files in your present filesystem that you don't want to delete ... And you also have core files in snapshots that you *do* want to delete ... As long as the file hasn't been changing, it's not consuming space beyond what's in the current filesystem. (See the output of zfs list, looking at sizes and you'll see that.) If it has been changing ... the cores in snapshot are in fact different from the cores in present filesytem ... then the only way to delete them is to destroy snapshots. Or have I still misunderstood the question? yes, I think so. Here's how I read it: the snapshots contain lots more than the core files, and OP wants to remove only the core files (I'm assuming they weren't discovered before the snapshot was taken) but retain the rest. does that explain it better? HTH Michael -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
On Mon, Jul 19, 2010 at 4:35 PM, Richard Elling wrote: > I depends on if the problem was fixed or not. What says > zpool status -xv > > -- richard [r...@nas01 ~]# zpool status -xv pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 14h2m with 0 errors on Sun Jul 18 18:32:38 2010 config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 raidz2ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 raidz2DEGRADED 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 DEGRADED 0 0 0 too many errors c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 was never fixed. I thought I needed to replace the drive. Should I mark it as "resolved" or whatever the syntax is and re-run a scrub? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
On Mon, Jul 19, 2010 at 4:26 PM, Richard Elling wrote: > Aren't you assuming the I/O error comes from the drive? > fmdump -eV okay - I guess I am. Is this just telling me "hey stupid, a checksum failed" ? In which case why did this never resolve itself and the specific device get marked as degraded? Apr 04 2010 21:52:38.920978339 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x64350d4040300c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xfd80ebd352cc9271 vdev = 0x29282dc6fa073a2 (end detector) pool = tank pool_guid = 0xfd80ebd352cc9271 pool_context = 0 pool_failmode = wait vdev_guid = 0x29282dc6fa073a2 vdev_type = disk vdev_path = /dev/dsk/c2t5d0s0 vdev_devid = id1,s...@sata_st31500341as9vs077gt/a parent_guid = 0xc2d5959dd2c07bf7 parent_type = raidz zio_err = 0 zio_offset = 0x40abbf2600 zio_size = 0x200 zio_objset = 0x10 zio_object = 0x1c06000 zio_level = 2 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x4bb96c96 0x36e503a3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
On Mon, Jul 19, 2010 at 4:16 PM, Marty Scholes wrote: > Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" > and watch the drive activity lights. The drive in the pool which isn't > blinking like crazy is a faulted/offlined drive. Actually I guess my real question is why iostat hasn't logged any errors in its counters even though the device has been bad in there for months? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
On Mon, Jul 19, 2010 at 4:16 PM, Marty Scholes wrote: > Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" > and watch the drive activity lights. The drive in the pool which isn't > blinking like crazy is a faulted/offlined drive. > > Ugly and oh-so-hackerish, but it works. that was my idea except figuring out something to make just specific drives write one at a time. although if it has been offlined or whatever then it shouldn't receive any requests, that sounds even easier. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
On Mon, Jul 19, 2010 at 3:11 PM, Haudy Kazemi wrote: > ' iostat -Eni ' indeed outputs Device ID on some of the drives,but I still > can't understand how it helps me to identify model of specific drive. Curious: [r...@nas01 ~]# zpool status -x pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 14h2m with 0 errors on Sun Jul 18 18:32:38 2010 config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 raidz2ONLINE 0 0 0 ... raidz2DEGRADED 0 0 0 ... c2t5d0 DEGRADED 0 0 0 too many errors ... c2t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST31500341AS Revision: SD1B Device Id: id1,s...@sata_st31500341as9vs077gt Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 Why has it been reported as bad (for probably 2 months now, I haven't got around to figuring out which disk in the case it is etc.) but the iostat isn't showing me any errors. Note: I do a weekly scrub too. Not sure if that matters or helps reset the device. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended RAM for ZFS on various platforms
Garrett D'Amore wrote: >On Fri, 2010-07-16 at 10:24 -0700, Michael Johnson wrote: >> I'm currently planning on running FreeBSD with ZFS, but I wanted to >>double-check >> how much memory I'd need for it to be stable. The ZFS wiki currently says >you >> can go as low as 1 GB, but recommends 2 GB; however, elsewhere I've seen >>someone >> claim that you need at least 4 GB. Does anyone here know how much RAM >FreeBSD >> would need in this case? >> >> Likewise, how much RAM does OpenSolaris need for stability when running ZFS? >> How about other OpenSolaris-based OSs, like NexentaStor? (My searching >found >> that OpenSolaris recommended at least 1 GB, while NexentaStor said 2 GB was >> okay, 4 GB was better. I'd be interested in hearing your input, though.) > >1GB isn't enough for a real system. 2GB is a bare minimum. If you're >going to use dedup, plan on a *lot* more. I think 4 or 8 GB are good >for a typical desktop or home NAS setup. With FreeBSD you may be able >to get away with less. (Probably, in fact.) Fortunately, I don't need deduplication; it's kind of a nice feature, but the extra RAM it would take isn't worth it. Just curious, why do you say I'd be able to get away with less RAM in FreeBSD (as compared to NexentaStor, I'm assuming)? I don't know tons about the OSs in question; is FreeBSD just leaner in general? >> If it matters, I'm currently planning on RAID-Z2 with 4x500GB consumer-grade >> SATA drives. (I know that's not a very efficient configuration, but I'd >>really >> like the redundancy of RAID-Z2 and I just don't need more than 1 TB of >>available >> storage right now, or for the next several years.) This is on an AMD64 >>system, >> and the OS in question will be running inside of VirtualBox, with raw access >>to >> the drives. > >Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors. >You'll have better performance, and good resilience against errors. And >you can grow later as you need to by just adding additional drive pairs. A pair of mirrors would be nice, but would only protect against 100% of one drive failing, and 50% of two-drive failures. Performance is less important to me than redundancy; this setup won't be seeing tons of disk activity, but I want it to be as reliable as possible. Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recommended RAM for ZFS on various platforms
I'm currently planning on running FreeBSD with ZFS, but I wanted to double-check how much memory I'd need for it to be stable. The ZFS wiki currently says you can go as low as 1 GB, but recommends 2 GB; however, elsewhere I've seen someone claim that you need at least 4 GB. Does anyone here know how much RAM FreeBSD would need in this case? Likewise, how much RAM does OpenSolaris need for stability when running ZFS? How about other OpenSolaris-based OSs, like NexentaStor? (My searching found that OpenSolaris recommended at least 1 GB, while NexentaStor said 2 GB was okay, 4 GB was better. I'd be interested in hearing your input, though.) If it matters, I'm currently planning on RAID-Z2 with 4x500GB consumer-grade SATA drives. (I know that's not a very efficient configuration, but I'd really like the redundancy of RAID-Z2 and I just don't need more than 1 TB of available storage right now, or for the next several years.) This is on an AMD64 system, and the OS in question will be running inside of VirtualBox, with raw access to the drives. Thanks, Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption?
Garrett wrote: >I don't know about ramifications (though I suspect that a broadening >error scope would decrease ZFS' ability to isolate and work around >problematic regions on the media), but one thing I do know. If you use >FreeBSD disk encryption below ZFS, then you won't be able able to import >your pools to another implementation -- you will be stuck with FreeBSD. This is an excellent point. Geli isn't a good option for me, then, though using encryption outside of the VM would still work. >Btw, if you want a commercially supported and maintained product, have >you looked at NexentaStor? Regardless of what happens with OpenSolaris, >we aren't going anywhere. (Full disclosure: I'm a Nexenta Systems >employee. :-) I probably ought to consider other OpenSolaris alternatives, like NexentaStor. (Though I'd be looking at the free version, not the commercial one: this is just for personal use, despite how careful I'm being with it. :) ) However (and please correct me if I'm wrong), isn't your future still tied to the future of OpenSolaris? The code is open, of course, but my understanding is that there isn't the same kind of developer community supporting OpenSolaris itself that you see with Linux (or even the BSDs). In other words, if Oracle stops development of OpenSolaris, there wouldn't be enough developers still working on it to keep it from stagnating. Or are you saying that you employ enough kernel hackers to keep up even without Oracle? (I am admittedly ignorant about the OpenSolaris developer community; this is all based on others' statements and opinions that I've read.) Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption?
Nikola M wrote: >Freddie Cash wrote: >> You definitely want to do the ZFS bits from within FreeBSD. >Why not using ZFS in OpenSolaris? At least it has most stable/tested >implementation and also the newest one if needed? I'd love to use OpenSolaris for exactly those reasons, but I'm wary of using an operating system that may not continue to be updated/maintained. If OpenSolaris had continued to be regularly released after Oracle bought Sun I'd be choosing it. As it is, I don't want to be pessimistic, but the doubt about OpenSolaris's future is enough to make me choose FreeBSD instead. (I'm sure that such sentiments won't make me popular here, but so far Oracle has been frustratingly silent on their plans for OpenSolaris.) At the very least, if FreeBSD doesn't do what I want I can switch the system disk to OpenSolaris and keep using the same pool. (Right?) Going back to my original question: does anyone know of any problems that could be caused by using raidz on top of encrypted drives? If there were a physical read error, which would get amplified by the encryption layer (if I'm understanding full-disk encryption correctly, which I may not be), would ZFS still be able to recover? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption?
on 11/07/2010 15:54 Andriy Gapon said the following: >on 11/07/2010 14:21 Roy Sigurd Karlsbakk said the following: >> >> I'm planning on running FreeBSD in VirtualBox (with a Linux host) >> and giving it raw disk access to four drives, which I plan to >> configure as a raidz2 volume. >> >> Wouldn't it be better or just as good to use fuse-zfs for such a >> configuration? I/O from VirtualBox isn't really very good, but then, I >> haven't tested the linux/fbsd configuration... Like Freddie already mentioned, I'd heard that fuse-zfs wasn't really all that good of an option, and I wanted something that was more stable/reliable. >Hmm, an unexpected question IMHO - wouldn't it better to just install FreeBSD on >the hardware? :-) >If an original poster is using Linux as a host OS, then probably he has some >very good reason to do that. But performance and etc -wise, directly using >FreeBSD, of course, should win over fuse-zfs. Right? > >[Installing and maintaining one OS instead of two is the first thing that comes >to mind] I'm going with a virtual machine because the box I ended up building for this was way more powerful than I needed for just my file server; thus, I figured I'd use it as a personal machine too. (I wanted ECC RAM, and there just aren't that many motherboards that support ECC RAM that are also really cheap and low-powered.) And since I'm much more comfortable with Linux, I wanted to use it for the "personal" side of things. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Encryption?
I'm planning on running FreeBSD in VirtualBox (with a Linux host) and giving it raw disk access to four drives, which I plan to configure as a raidz2 volume. On top of that, I'm considering using encryption. I understand that ZFS doesn't yet natively support encryption, so my idea was to set each drive up with full-disk encryption in the Linux host (e.g., using TrueCrypt or dmcrypt), mount the encrypted drives, and then give the virtual machine access to the virtual unencrypted drives. So the encryption would be transparent to FreeBSD. However, I don't know enough about ZFS to know if this is a good idea. I know that I need to specifically configure VirtualBox to respect cache flushes, so that data really is on disk when ZFS expects it to be. Would putting ZFS on top of full-disk encryption like this cause any problems? E.g., if the (encrypted) physical disk has a problem and as a result a larger chunk of the unencrypted data is corrupted, would ZFS handle that well? Are there any other possible consequences of this idea that I should know about? (I'm not too worried about any hits in performance; I won't be reading or writing heavily, nor in time-sensitive applications.) I should add that since this is a desktop I'm not nearly as worried about encryption as if it were a laptop (theft or loss are less likely), but encryption would still be nice. However, data integrity is the most important thing (I'm storing backups of my personal files on this), so if there's a chance that ZFS wouldn't handle errors well when on top of encryption, I'll just go without it. Thanks, Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Consequences of resilvering failure
I'm just about to start using ZFS in a RAIDZ configuration for a home file server (mostly holding backups), and I wasn't clear on what happens if data corruption is detected while resilvering. For example: let's say I'm using RAIDZ1 and a drive fails. I pull it and put in a new one. While resilvering, ZFS detects corrupt data on one of the remaining disks. Will the resilvering continue, with some files marked as containing errors, or will it simply fail? (I found this process[1] to repair damaged data, but I wasn't sure what would happen if it was detected in the middle of resilvering.) I will of course have a backup of the pool, but I may opt for additional backup if the entire pool could be lost due to data corruption (as opposed to just a few files potentially being lost). Thanks, Michael [1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gbbwl.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b134 pool borked!
Just in case any stray searches finds it way here, this is what happened to my pool: http://phrenetic.to/zfs -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
On Fri, Jun 11, 2010 at 2:50 AM, Alex Blewitt wrote: > You are sadly mistaken. > > From GNU.org on license compatibilities: > > http://www.gnu.org/licenses/license-list.html > > Common Development and Distribution License (CDDL), version 1.0 > This is a free software license. It has a copyleft with a scope > that's similar to the one in the Mozilla Public License, which makes it > incompatible with the GNU GPL. This means a module covered by the GPL and a > module covered by the CDDL cannot legally be linked together. We urge you > not to use the CDDL for this reason. > > Also unfortunate in the CDDL is its use of the term “intellectual > property”. > > Whether a license is classified as "Open Source" or not does not imply that > all open source licenses are compatible with each other. Can we stop the license talk *yet again* Nobody here is a lawyer (IANAL!) and everyone has their own interpretations and are splitting hairs. In my opinion, the source code itself shouldn't be ported, the CONCEPTS should be. Then there's no licensing issues at all. No questions. etc. To me, ZFS is important for bitrot protection, pooled storage and snapshots come in handy in a couple places. Getting a COW filesystem w/ snapshots and storage pooling would cover a lot of the demand for ZFS as far as I'm concerned. (However, that's when a comparison with Btrfs makes sense as it is COW too) The minute I saw "ZFS on Linux" I knew this would degrade into a virtual pissing contest on "my understanding is better than yours" and a licensing fight. To me, this is what needs to happen: a) Get a Sun/Oracle attorney involved who understands this and flat out explains what needs to be done to allow ZFS to be used with the Linux kernel, or b) Port the concepts and not the code (or the portions of code under the restrictive license), or c) Look at Btrfs or other filesystems which may be extended to give the same capabilities as ZFS without the licensing issue and focus all this development time on extending those. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?
For the record, in case anyone else experiences this behaviour: I tried various things which failed, and finally as a last ditch effort, upgraded my freebsd, giving me zpool v14 rather than v13 - and now it's resilvering as it should. Michael On Monday 17 May 2010 09:26:23 Michael Donaghy wrote: > Hi, > > I recently moved to a freebsd/zfs system for the sake of data integrity, > after losing my data on linux. I've now had my first hard disk failure; > the bios refused to even boot with the failed drive (ad18) connected, so I > removed it. I have another drive, ad16, which had enough space to replace > the failed one, so I partitioned it and attempted to use "zpool replace" > to replace the failed partitions for new ones, i.e. "zpool replace tank > ad18s1d ad16s4d". This seemed to simply hang, with no processor or disk > use; any "zpool status" commands also hung. Eventually I attempted to > reboot the system, which also eventually hung; after waiting a while, > having no other option, rightly or wrongly, I hard-rebooted. Exactly the > same behaviour happened with the other zpool replace. > > Now, my zpool status looks like: > arcueid ~ $ zpool status > pool: tank > state: DEGRADED > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz2 DEGRADED 0 0 0 > ad4s1d ONLINE 0 0 0 > ad6s1d ONLINE 0 0 0 > ad9s1d ONLINE 0 0 0 > ad17s1dONLINE 0 0 0 > replacing DEGRADED 0 0 0 > ad18s1d UNAVAIL 0 9.62K 0 cannot open > ad16s4d ONLINE 0 0 0 > ad20s1dONLINE 0 0 0 > raidz2 DEGRADED 0 0 0 > ad4s1e ONLINE 0 0 0 > ad6s1e ONLINE 0 0 0 > ad17s1eONLINE 0 0 0 > replacing DEGRADED 0 0 0 > ad18s1e UNAVAIL 0 11.2K 0 cannot open > ad16s4e ONLINE 0 0 0 > ad20s1eONLINE 0 0 0 > > errors: No known data errors > > It looks like the replace has taken in some sense, but ZFS doesn't seem to > be resilvering as it should. Attempting to zpool offline doesn't work: > arcueid ~ # zpool offline tank ad18s1d > cannot offline ad18s1d: no valid replicas > Attempting to scrub causes a similar hang to before. Data is still readable > (from the zvol which is the only thing actually on this filesystem), > although slowly. > > What should I do to recover this / trigger a proper replace of the failed > partitions? > > Many thanks, > Michael > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
On 19.05.10 17:53, John Andrunas wrote: Not to my knowledge, how would I go about getting one? (CC'ing discuss) man savecore and dumpadm. Michael On Wed, May 19, 2010 at 8:46 AM, Mark J Musante wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM vol2ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?
Hi, I recently moved to a freebsd/zfs system for the sake of data integrity, after losing my data on linux. I've now had my first hard disk failure; the bios refused to even boot with the failed drive (ad18) connected, so I removed it. I have another drive, ad16, which had enough space to replace the failed one, so I partitioned it and attempted to use "zpool replace" to replace the failed partitions for new ones, i.e. "zpool replace tank ad18s1d ad16s4d". This seemed to simply hang, with no processor or disk use; any "zpool status" commands also hung. Eventually I attempted to reboot the system, which also eventually hung; after waiting a while, having no other option, rightly or wrongly, I hard-rebooted. Exactly the same behaviour happened with the other zpool replace. Now, my zpool status looks like: arcueid ~ $ zpool status pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 ad4s1d ONLINE 0 0 0 ad6s1d ONLINE 0 0 0 ad9s1d ONLINE 0 0 0 ad17s1dONLINE 0 0 0 replacing DEGRADED 0 0 0 ad18s1d UNAVAIL 0 9.62K 0 cannot open ad16s4d ONLINE 0 0 0 ad20s1dONLINE 0 0 0 raidz2 DEGRADED 0 0 0 ad4s1e ONLINE 0 0 0 ad6s1e ONLINE 0 0 0 ad17s1eONLINE 0 0 0 replacing DEGRADED 0 0 0 ad18s1e UNAVAIL 0 11.2K 0 cannot open ad16s4e ONLINE 0 0 0 ad20s1eONLINE 0 0 0 errors: No known data errors It looks like the replace has taken in some sense, but ZFS doesn't seem to be resilvering as it should. Attempting to zpool offline doesn't work: arcueid ~ # zpool offline tank ad18s1d cannot offline ad18s1d: no valid replicas Attempting to scrub causes a similar hang to before. Data is still readable (from the zvol which is the only thing actually on this filesystem), although slowly. What should I do to recover this / trigger a proper replace of the failed partitions? Many thanks, Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
I agree on the motherboard and peripheral chipset issue. This, and the last generation AMD quad/six core motherboards all seem to use the AMD SP56x0/SP5100 chipset, which I can't find much information about support on for either OpenSolaris or FreeBSD. Another issue is the LSI SAS2008 chipset for SAS controller which is frequently offered as an onboard option for many motherboards as well and still seems to be somewhat of a work in progress in regards to being 'production ready'. On May 11, 2010, at 3:29 PM, Brandon High wrote: > On Tue, May 11, 2010 at 5:29 AM, Thomas Burgess wrote: >> I'm specificially looking at this motherboard: >> http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230 > > I'd be more concerned that the motherboard and it's attached > peripherals are unsupported than the processor. Solaris can handle 12 > cores with no problems. > > -B > > -- > Brandon High : bh...@freaks.com > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] osol monitoring question
On 10.05.10 08:57, Roy Sigurd Karlsbakk wrote: Hi all It seems that if using zfs, the usual tools like vmstat, sar, top etc are quite worthless, since zfs i/o load is not reported as iowait etc. Are there any plans to rewrite the old performance monitoring tools or the zfs parts to allow for standard monitoring tools? If not, what other tools exist that can do the same? "zpool iostat" for one. Michael -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why both dedup and compression?
This is interesting, but what about iSCSI volumes for virtual machines? Compress or de-dupe? Assuming the virtual machine was made from a clone of the original iSCSI or a master iSCSI volume. Does anyone have any real world data this? I would think the iSCSI volumes would diverge quite a bit over time even with compression and/or de-duplication. Just curious… On 6 May 2010, at 16:39 , Peter Tribble wrote: > On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel wrote: >> I've googled this for a bit, but can't seem to find the answer. >> >> What does compression bring to the party that dedupe doesn't cover already? > > Compression will reduce the storage requirements for non-duplicate data. > > As an example, I have a system that I rsync the web application data > from a whole > bunch of servers (zones) to. There's a fair amount of duplication in > the application > files (java, tomcat, apache, and the like) so dedup is a big win. On > the other hand, > there's essentially no duplication whatsoever in the log files, which > are pretty big, > but compress really well. So having both enabled works really well. > > -- > -Peter Tribble > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
Hi Marc, Well, if you are striping over multiple devices the you I/O should be spread over the devices and you should be reading them all simultaneously rather than just accessing a single device. Traditional striping would give 1/n performance improvement rather than 1/1 where n is the number of disks the stripe is spread across. The round-robin access I am referring to, is the way the L2ARC vdevs appear to be accessed. So, any given object will be taken from a single device rather than from several devices simultaneously, thereby increasing the I/O throughput. So, theoretically, a stripe spread over 4 disks would give 4 times the performance as opposed to reading from a single disk. This also assumes the controller can handle multiple I/O as well or that you are striped over different disk controllers for each disk in the stripe. SSD's are fast, but if I can read a block from more devices simultaneously, it will cut the latency of the overall read. On 7 May 2010, at 02:57 , Marc Nicholas wrote: > Hi Michael, > > What makes you think striping the SSDs would be faster than round-robin? > > -marc > > On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan > wrote: > Everyone, > > Thanks for the help. I really appreciate it. > > Well, I actually walked through the source code with an associate today and > we found out how things work by looking at the code. > > It appears that L2ARC is just assigned in round-robin fashion. If a device > goes offline, then it goes to the next and marks that one as offline. The > failure to retrieve the requested object is treated like a cache miss and > everything goes along its merry way, as far as we can tell. > > I would have hoped it to be different in some way. Like if the L2ARC was > striped for performance reasons, that would be really cool and using that > device as an extension of the VM model it is modeled after. Which would mean > using the L2ARC as an extension of the virtual address space and striping it > to make it more efficient. Way cool. If it took out the bad device and > reconfigured the stripe device, that would be even way cooler. Replacing it > with a hot spare more cool too. However, it appears from the source code > that the L2ARC is just a (sort of) jumbled collection of ZFS objects. Yes, > it gives you better performance if you have it, but it doesn't really use it > in a way you might expect something as cool as ZFS does. > > I understand why it is read only, and it invalidates it's cache when a write > occurs, to be expected for any object written. > > If an object is not there because of a failure or because it has been removed > from the cache, it is treated as a cache miss, all well and good - go fetch > from the pool. > > I also understand why the ZIL is important and that it should be mirrored if > it is to be on a separate device. Though I'm wondering how it is handled > internally when there is a failure of one of it's default devices, but then > again, it's on a regular pool and should be redundant enough, only just some > degradation in speed. > > Breaking these devices out from their default locations is great for > performance, and I understand. I just wish the knowledge of how they work > and their internal mechanisms were not so much of a black box. Maybe that is > due to the speed at which ZFS is progressing and the features it adds with > each subsequent release. > > Overall, I am very impressed with ZFS, its flexibility and even more so, it's > breaking all the rules about how storage should be managed and I really like > it. I have yet to see anything to come close in its approach to disk data > management. Let's just hope it keeps moving forward, it is truly a unique > way to view disk storage. > > Anyway, sorry for the ramble, but to everyone, thanks again for the answers. > > Mike > > --- > Michael Sullivan > michael.p.sulli...@me.com > http://www.kamiogi.net/ > Japan Mobile: +81-80-3202-2599 > US Phone: +1-561-283-2034 > > On 7 May 2010, at 00:00 , Robert Milkowski wrote: > > > On 06/05/2010 15:31, Tomas Ögren wrote: > >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: > >> > >> > >>> On Wed, 5 May 2010, Edward Ned Harvey wrote: > >>> > >>>> In the L2ARC (cache) there is no ability to mirror, because cache device > >>>> removal has always been supported. You can't mirror a cache device, > >>>> because > >>>> you don't need it. > >>>> > >>> How do you know that I don't need it? The ability seems useful to me. > >>
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
Everyone, Thanks for the help. I really appreciate it. Well, I actually walked through the source code with an associate today and we found out how things work by looking at the code. It appears that L2ARC is just assigned in round-robin fashion. If a device goes offline, then it goes to the next and marks that one as offline. The failure to retrieve the requested object is treated like a cache miss and everything goes along its merry way, as far as we can tell. I would have hoped it to be different in some way. Like if the L2ARC was striped for performance reasons, that would be really cool and using that device as an extension of the VM model it is modeled after. Which would mean using the L2ARC as an extension of the virtual address space and striping it to make it more efficient. Way cool. If it took out the bad device and reconfigured the stripe device, that would be even way cooler. Replacing it with a hot spare more cool too. However, it appears from the source code that the L2ARC is just a (sort of) jumbled collection of ZFS objects. Yes, it gives you better performance if you have it, but it doesn't really use it in a way you might expect something as cool as ZFS does. I understand why it is read only, and it invalidates it's cache when a write occurs, to be expected for any object written. If an object is not there because of a failure or because it has been removed from the cache, it is treated as a cache miss, all well and good - go fetch from the pool. I also understand why the ZIL is important and that it should be mirrored if it is to be on a separate device. Though I'm wondering how it is handled internally when there is a failure of one of it's default devices, but then again, it's on a regular pool and should be redundant enough, only just some degradation in speed. Breaking these devices out from their default locations is great for performance, and I understand. I just wish the knowledge of how they work and their internal mechanisms were not so much of a black box. Maybe that is due to the speed at which ZFS is progressing and the features it adds with each subsequent release. Overall, I am very impressed with ZFS, its flexibility and even more so, it's breaking all the rules about how storage should be managed and I really like it. I have yet to see anything to come close in its approach to disk data management. Let's just hope it keeps moving forward, it is truly a unique way to view disk storage. Anyway, sorry for the ramble, but to everyone, thanks again for the answers. Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 On 7 May 2010, at 00:00 , Robert Milkowski wrote: > On 06/05/2010 15:31, Tomas Ögren wrote: >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: >> >> >>> On Wed, 5 May 2010, Edward Ned Harvey wrote: >>> >>>> In the L2ARC (cache) there is no ability to mirror, because cache device >>>> removal has always been supported. You can't mirror a cache device, >>>> because >>>> you don't need it. >>>> >>> How do you know that I don't need it? The ability seems useful to me. >>> >> The gain is quite minimal.. If the first device fails (which doesn't >> happen too often I hope), then it will be read from the normal pool once >> and then stored in ARC/L2ARC again. It just behaves like a cache miss >> for that specific block... If this happens often enough to become a >> performance problem, then you should throw away that L2ARC device >> because it's broken beyond usability. >> >> > > Well if a L2ARC device fails there might be an unacceptable drop in delivered > performance. > If it were mirrored than a drop usually would be much smaller or there could > be no drop if a mirror had an option to read only from one side. > > Being able to mirror L2ARC might especially be useful once a persistent L2ARC > is implemented as after a node restart or a resource failover in a cluster > L2ARC will be kept warm. Then the only thing which might affect L2 > performance considerably would be a L2ARC device failure... > > > -- > Robert Milkowski > http://milek.blogspot.com > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On 6 May 2010, at 13:18 , Edward Ned Harvey wrote: >> From: Michael Sullivan [mailto:michael.p.sulli...@mac.com] >> >> While it explains how to implement these, there is no information >> regarding failure of a device in a striped L2ARC set of SSD's. I have > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Cache_Devices > > It is not possible to mirror or use raidz on cache devices, nor is it > necessary. If a cache device fails, the data will simply be read from the > main pool storage devices instead. > I understand this. > I guess I didn't write this part, but: If you have multiple cache devices, > they are all independent from each other. Failure of one does not negate > the functionality of the others. > Ok, this is what I wanted to know. The that the L2ARC devices assigned to the pool are not striped but are independent. Loss of one drive will just cause a cache miss and force ZFS to go out to the pool for its objects. But then I'm not talking about using RAIDZ on a cache device. I'm talking about a striped device which would be RAID-0. If the SSD's are all assigned to L2ARC, then they are not striped in any fashion (RAID-0), but are completely independent and the L2ARC will continue to operate, just missing a single SSD. > >> I'm running 2009.11 which is the latest OpenSolaris. > > Quoi?? 2009.06 is the latest available from opensolaris.com and > opensolaris.org. > > If you want something newer, AFAIK, you have to go to developer build, such > as osol-dev-134 > > Sure you didn't accidentally get 2008.11? > My mistake… snv_111b which is 2009.06. I know it went up to 11 somewhere. > >> I am also well aware of the effect of losing a ZIL device will cause >> loss of the entire pool. Which is why I would never have a ZIL device >> unless it was mirrored and on different controllers. > > Um ... the log device is not special. If you lose *any* unmirrored device, > you lose the pool. Except for cache devices, or log devices on zpool >=19 > Well, if I've got a separate ZIL which is mirrored for performance, and mirrored because I think my data is valuable and important, I will have something more than RAID-0 on my main storage pool too. More than likely RAIDZ2 since I plan on using L2ARC to help improve performance along with separate SSD mirrored ZIL devices. > >> From the information I've been reading about the loss of a ZIL device, >> it will be relocated to the storage pool it is assigned to. I'm not >> sure which version this is in, but it would be nice if someone could >> provide the release number it is included in (and actually works), it >> would be nice. > > What the heck? Didn't I just answer that question? > I know I said this is answered in ZFS Best Practices Guide. > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Log_Devices > > Prior to pool version 19, if you have an unmirrored log device that fails, > your whole pool is permanently lost. > Prior to pool version 19, mirroring the log device is highly recommended. > In pool version 19 or greater, if an unmirrored log device fails during > operation, the system reverts to the default behavior, using blocks from the > main storage pool for the ZIL, just as if the log device had been gracefully > removed via the "zpool remove" command. > No need to get defensive here, all I'm looking for is the spool version number which supports it and the version of OpenSolaris which supports that ZPOOL version. I think that if you are building for performance, it would be almost intuitive to have a mirrored ZIL in the event of failure, and perhaps even a hot spare available as well. I don't like the idea of my ZIL being transferred back to the pool, but having it transferred back is better than the alternative which would be data loss or corruption. > >> Also, will this functionality be included in the >> mythical 2010.03 release? > > > Zpool 19 was released in build 125. Oct 16, 2009. You can rest assured it > will be included in 2010.03, or 04, or whenever that thing comes out. > Thanks, build 125. > >> So what you are saying is that if a single device fails in a striped >> L2ARC VDEV, then the entire VDEV is taken offline and the fallback is >> to simply use the regular ARC and fetch from the pool whenever there is >> a cache miss. > > It sounds like you're only going to believe it if you test it. Go for it. > That's what I did before I wrote that section of the ZFS Best Practices > Guide. > > In ZFS, there is no such thing as striping, although
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
Hi Ed, Thanks for your answers. Seem to make sense, sort of… On 6 May 2010, at 12:21 , Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Michael Sullivan >> >> I have a question I cannot seem to find an answer to. > > Google for ZFS Best Practices Guide (on solarisinternals). I know this > answer is there. > My Google is very strong and I have the Best Practices Guide committed to bookmark as well as most of it to memory. While it explains how to implement these, there is no information regarding failure of a device in a striped L2ARC set of SSD's. I have been hard pressed to find this information anywhere, short of testing it myself, but I don't have the necessary hardware in a lab to test correctly. If someone has pointers to references, could you please provide them to chapter and verse, rather than the advice to "Go read the manual." > >> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be >> relocated back to the spool. I'd probably have it mirrored anyway, >> just in case. However you cannot mirror the L2ARC, so... > > Careful. The "log device removal" feature exists, and is present in the > developer builds of opensolaris today. However, it's not included in > opensolars 2009.06, and it's not included in the latest and greatest solaris > 10 yet. Which means, right now, if you lose an unmirrored ZIL (log) device, > your whole pool is lost, unless you're running a developer build of > opensolaris. > I'm running 2009.11 which is the latest OpenSolaris. I should have made that clear, and that I don't intend this to be on Solaris 10 system, and am waiting for the next production build anyway. As you say, it does not exist in 2009.06, this is not the latest production Opensolaris which is 2009.11, and I'd be more interested in its behavior than an older release. I am also well aware of the effect of losing a ZIL device will cause loss of the entire pool. Which is why I would never have a ZIL device unless it was mirrored and on different controllers. >From the information I've been reading about the loss of a ZIL device, it will >be relocated to the storage pool it is assigned to. I'm not sure which >version this is in, but it would be nice if someone could provide the release >number it is included in (and actually works), it would be nice. Also, will >this functionality be included in the mythical 2010.03 release? Also, I'd be interested to know what features along these lines will be available in 2010.03 if it ever sees the light of day. > >> What I want to know, is what happens if one of those SSD's goes bad? >> What happens to the L2ARC? Is it just taken offline, or will it >> continue to perform even with one drive missing? > > In the L2ARC (cache) there is no ability to mirror, because cache device > removal has always been supported. You can't mirror a cache device, because > you don't need it. > > If one of the cache devices fails, no harm is done. That device goes > offline. The rest stay online. > So what you are saying is that if a single device fails in a striped L2ARC VDEV, then the entire VDEV is taken offline and the fallback is to simply use the regular ARC and fetch from the pool whenever there is a cache miss. Or, does what you are saying here mean that if I have a 4 SSD's in a stripe for my L2ARC, and one device fails, the L2ARC will be reconfigured dynamically using the remaining SSD's for L2ARC. It would be good to get an answer to this from someone who has actually tested this or is more intimately familiar with the ZFS code rather than all the speculation I've been getting so far. > >> Sorry, if these questions have been asked before, but I cannot seem to >> find an answer. > > Since you said this twice, I'll answer it twice. ;-) > I think the best advice regarding cache/log device mirroring is in the ZFS > Best Practices Guide. > Been there read that, many, many times. It's an invaluable reference, I agree. Thanks Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b134 pool borked!
I got a suggestion to check what fmdump -eV gave to look for PCI errors if the controller might be broken. Attached you'll find the last panic's fmdump -eV. It indicates that ZFS can't open the drives. That might suggest a broken controller, but my slog is on the motherboard's internal controller. One might think that the motherboard itself might be toast or do we have a case of unstable power? -- This message posted from opensolaris.orgMay 04 2010 19:44:31.716566239 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0x645834a4c69584e5 (end detector) pool = tank pool_guid = 0x97541c1ea1ad833e pool_context = 1 pool_failmode = wait vdev_guid = 0x645834a4c69584e5 vdev_type = disk vdev_path = /dev/dsk/c13t1d0s0 vdev_devid = id1,s...@sata_wdc_wd5001aals-0_wd-wmasy3260051/a parent_guid = 0x6041a7903a345374 parent_type = raidz prev_state = 0x1 __ttl = 0x1 __tod = 0x4be05cff 0x2ab5eedf May 04 2010 19:44:31.716565705 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0x928ecd01b281b313 (end detector) pool = tank pool_guid = 0x97541c1ea1ad833e pool_context = 1 pool_failmode = wait vdev_guid = 0x928ecd01b281b313 vdev_type = disk vdev_path = /dev/dsk/c13t2d0s0 vdev_devid = id1,s...@sata_samsung_hd103si___s1vsj90sc22634/a parent_guid = 0x6041a7903a345374 parent_type = raidz prev_state = 0x1 __ttl = 0x1 __tod = 0x4be05cff 0x2ab5ecc9 May 04 2010 19:44:31.716565713 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0xc6c893601f1263cb (end detector) pool = tank pool_guid = 0x97541c1ea1ad833e pool_context = 1 pool_failmode = wait vdev_guid = 0xc6c893601f1263cb vdev_type = disk vdev_path = /dev/dsk/c8t0d0s0 vdev_devid = id1,s...@sata_intel_ssdsa2m080__cvpo003401vt080bgn/a parent_guid = 0x97541c1ea1ad833e parent_type = root prev_state = 0x1 __ttl = 0x1 __tod = 0x4be05cff 0x2ab5ecd1 May 04 2010 19:44:31.716566468 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0x381e0480469b4ed7 (end detector) pool = tank pool_guid = 0x97541c1ea1ad833e pool_context = 1 pool_failmode = wait vdev_guid = 0x381e0480469b4ed7 vdev_type = disk vdev_path = /dev/dsk/c13t3d0s0 vdev_devid = id1,s...@sata_samsung_hd103si___s1vsj90sc22045/a parent_guid = 0x6041a7903a345374 parent_type = raidz prev_state = 0x1 __ttl = 0x1 __tod = 0x4be05cff 0x2ab5efc4 May 04 2010 19:44:31.716566182 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0x6e5ce9b416a3f8a4 (end detector) pool = tank pool_guid = 0x97541c1ea1ad833e pool_context = 1 pool_failmode = wait vdev_guid = 0x6e5ce9b416a3f8a4 vdev_type = disk vdev_path = /dev/dsk/c13t6d0s0 vdev_devid = id1,s...@sata_wdc_wd6400aacs-0_wd-wcauf0934679/a parent_guid = 0x4491e617ebc26c75 parent_type = raidz prev_state = 0x1 __ttl = 0x1 __tod = 0x4be05cff 0x2ab5eea6 May 04 2010 19:44:31.716565740 ereport.fs.zfs.vdev.open_failed nvlist version: 0 class = ereport.fs.zfs.vdev.open_failed ena = 0xeeed67dca00c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x97541c1ea1ad833e vdev = 0x69f0986c92adda53
Re: [zfs-discuss] b134 pool borked!
Thanks for your reply! I ran memtest86 and it did not report any errors. The disk controller I've not replaced, yet. The server is up in multi-user mode with the broken pool in an un-imported state. Format now works and properly lists all my devices without panic'ing. zpool import panic's the box with the same stack trace as above. Could it still be the disk controller? I'd jump through the roof of happiness if that's the case. It's one of those Supermicro thumper controllers. Anyone know any good non-destructive diagnostics to run? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b134 pool borked!
This is how my zpool import command looks like: Attached you'll find the output of zdb -l of each device. pool: tank id: 10904371515657913150 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: tank ONLINE raidz1-0 ONLINE c13t4d0 ONLINE c13t5d0 ONLINE c13t6d0 ONLINE c13t7d0 ONLINE raidz1-1 ONLINE c13t3d0 ONLINE c13t1d0 ONLINE c13t2d0 ONLINE c13t0d0 ONLINE cache c8t2d0 logs c8t0d0 ONLINE -- This message posted from opensolaris.org zdbl.gz Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b134 pool borked!
90 reads and not a single comment? Not the slightest hint of what's going on? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
Ok, thanks. So, if I understand correctly, it will just remove the device from the VDEV and continue to use the good ones in the stripe. Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 On 5 May 2010, at 04:34 , Marc Nicholas wrote: > The L2ARC will continue to function. > > -marc > > On 5/4/10, Michael Sullivan wrote: >> HI, >> >> I have a question I cannot seem to find an answer to. >> >> I know I can set up a stripe of L2ARC SSD's with say, 4 SSD's. >> >> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be >> relocated back to the spool. I'd probably have it mirrored anyway, just in >> case. However you cannot mirror the L2ARC, so... >> >> What I want to know, is what happens if one of those SSD's goes bad? What >> happens to the L2ARC? Is it just taken offline, or will it continue to >> perform even with one drive missing? >> >> Sorry, if these questions have been asked before, but I cannot seem to find >> an answer. >> Mike >> >> --- >> Michael Sullivan >> michael.p.sulli...@me.com >> http://www.kamiogi.net/ >> Japan Mobile: +81-80-3202-2599 >> US Phone: +1-561-283-2034 >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Sent from my mobile device ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss