Re: [zfs-discuss] New SSD options
40k IOPS sounds like best in case, you'll never see it in the real world marketing to me. There are a few benchmarks if you google and they all seem to indicate the performance is probably +/- 10% of an intel x25-e. I would personally trust intel over one of these drives. Is it even possible to buy a zeus iops anywhere? I haven't been able to find one. I get the impression they mostly sell to other vendors like sun? I'd be curious what the price is on a 9GB zeus iops is these days? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Don wrote: With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL? They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at half the price of an Intel X25-E (3.3k IOPS, $400). Needless to say I'd love to know if anyone has evaluated these drives to see if they make sense as a ZIL- for example- do they honor cache flush requests? Are those sustained IOPS numbers? In my understanding nearly the only relevant number is the number of cache flushes a drive can handle per second, as this determines my single thread performance. Has anyone an idea what numbers I can expect from an Intel X25-E or an OCZ Vertex 2? -Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On Tue, May 18, 2010 at 4:28 PM, Don d...@blacksun.org wrote: With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL? The current Sandforce drives out don't have an ultra-capacitor on them, so they could lose data if the system crashed. There are supposed to be enterprise class drives based on the chipset out that do have an ultra-cap released any day now. Needless to say I'd love to know if anyone has evaluated these drives to see if they make sense as a ZIL- for example- do they honor cache flush requests? Are those sustained IOPS numbers? I don't think they do, the chipset was designed to use an ultra-cap to avoid having to honor flushes. Then again, the X25-E has the same problem. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On 2010-05-19 08.32, sensille wrote: Don wrote: With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL? They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at half the price of an Intel X25-E (3.3k IOPS, $400). Needless to say I'd love to know if anyone has evaluated these drives to see if they make sense as a ZIL- for example- do they honor cache flush requests? Are those sustained IOPS numbers? In my understanding nearly the only relevant number is the number of cache flushes a drive can handle per second, as this determines my single thread performance. Has anyone an idea what numbers I can expect from an Intel X25-E or an OCZ Vertex 2? I don't know about OCZ Vertex 2, but the Intel X25-E roughly halves it's IOPS number when you disable it's write cache (IIRC, it was in the range 1300-1600 writes/s or so). Since it ignores Cache Flush command and it doesn't have any persistant buffer storage, disabling the write cache is the best you can do. Note that there were reports of the Intel X25-E loosing a write even though you had the write cache disabled! Since they still haven't fixed this, after more than a year on the market, I believe it rather qualifies into the hardly usable toy class. I am very disappointed, I had hopes for a new class of cheap but usable flash drives. Maybe some day... /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scsi messages and mpt warning in log - harmless, or indicating a problem?
Willard Korfhage wrote: This afternoon, messages like the following started appearing in /var/adm/messages: May 18 13:46:37 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:37 fs8 Log info 0x3108 received for target 5. May 18 13:46:37 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 May 18 13:46:38 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:38 fs8 Log info 0x3108 received for target 5. May 18 13:46:38 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 May 18 13:46:40 fs8 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0): May 18 13:46:40 fs8 Log info 0x3108 received for target 5. May 18 13:46:40 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 ... So, is my system in trouble or not? Particulars of my system: % uname -a SunOS fs8 5.11 snv_134 i86pc i386 i86pc Welcome to the mpt driver / firmware / something bug! I forget if your symptoms were indicative of the card not liking the drives (Hitachis in particular, which I fixed by upgrading to larger Seagates) or an issue with MSI support (which I fixed by adding set xpv_psm:xen_support_msi = -1 to /etc/system, but I was running a Xen enabled kernel). I suggest searching the list archives. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very serious performance degradation
mm.. Service time of sd3..5 are waay too high to be good working disks. 21 writes shouldn't take 1.3 seconds. Some of your disks are not feeling well, possibly doing block-reallocation like mad all the time, or block recovery of some form. Service times should be closer to what sd1 and 2 are doing. sd2,3,4 seems to be getting about the same amount of read+write, but their service time is 15-20 times higher. This will lead to crap performance (and probably broken array in a while). /Tomas Hi ! It is strange because I've checked the SMART data of the 4 disks, and everything seems really OK ! (on another hardware/controller, because I needed Windows to check it). Maybe it's a problem with the SAS/SATA controller ?! One question : if I halt the server, and change the order of the disks on the SATA array, will RAIDZ still detect the array fine The idea is to check if the results (big service times) depends on the drives position, or on the hard drives themself ! Thank you ! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very serious performance degradation
How full is your filesystem? Give us the output of zfs list You might be having a hardware problem, or maybe it's extremely full. Hi Edward, The _db filesystems have a recordsise of 16K (the others have the default 128K) : NAME USED AVAIL REFER MOUNTPOINT zfs_raid 1,02T 1,65T 28,4K /zfs_raid zfs_raid/fs1_db 8,89G 1,65T 7,73G /home/fs1_db zfs_raid/fs2 2,68G 1,65T 1,73G /home/fs2 zfs_raid/fs3 3,38G 1,65T 3,12G /home/fs3 zfs_raid/fs4 10,1G 1,65T 10,0G /home/fs4 zfs_raid/fs5 517G 1,65T 326G /home/fs5 zfs_raid/fs6_db 35,1G 1,65T 28,0G /home/fs6_db zfs_raid/fs7 9,22G 1,65T 7,67G /home/fs7 zfs_raid/fs8_db 22,7G 1,65T 21,6G /home/fs8_db zfs_raid/fs9 179G 1,65T 108G /home/fs9 zfs_raid/fs10 115G 1,65T 97,0G /home/fs10 zfs_raid/fs11_db 28,6G 1,65T 17,3G /home/fs11_db zfs_raid/fs1217,1G 1,65T 4,70G /home/fs12 zfs_raid/fs139,66G 1,65T 6,77G /home/fs13 zfs_raid/fs144,13G 1,65T 3,12G /home/fs14 zfs_raid/fs1515,2G 1,65T 9,48G /home/fs15 zfs_raid/fs1614,7G 1,65T 6,59G /home/fs16 zfs_raid/fs177,49G 1,65T 5,31G /home/fs17 zfs_raid/fs1841,0G 1,65T 21,6G /home/fs18 Also, if you have dedup enabled, on a 3TB filesystem, you surely want more RAM. I don't know if there's any rule of thumb you could follow, but offhand I'd say 16G or 32G. Numbers based on the vapor passing around the room I'm in right now. It seems that the dedup property doesn't exist on my system ! Are you sure this capability is supported on the version of ZFS included in Opensolaris ? Thank you ! Philippe -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very serious performance degradation
On 05/19/10 09:34 PM, Philippe wrote: Hi ! It is strange because I've checked the SMART data of the 4 disks, and everything seems really OK ! (on another hardware/controller, because I needed Windows to check it). Maybe it's a problem with the SAS/SATA controller ?! One question : if I halt the server, and change the order of the disks on the SATA array, will RAIDZ still detect the array fine Yes, it will. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very serious performance degradation
it looks like your 'sd5' disk is performing horribly bad and except for the horrible performance of 'sd5' (which bottlenecks the I/O), 'sd4' would look just as bad. Regardless, the first step would be to investigate 'sd5'. Hi Bob ! I've already tried the pool without the sd5 disk (so pool in degraded mode), but the performances was still the same... So the sd5 disk itself is not the (only) bottleneck... Use 'iostat -xen' to obtain more information, including the number of reported errors. iostat -xen extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 9 0 9 c8t0d0 0.30.4 14.13.2 0.0 0.00.0 17.9 0 0 0 0 0 0 c7t0d0 65.56.6 1234.3 97.6 0.0 1.00.0 14.5 0 14 0 0 0 0 c7t2d0 70.66.1 1229.2 97.6 0.0 1.30.0 16.3 0 16 0 0 0 0 c7t3d0 94.06.7 2349.2 97.0 0.0 3.60.0 36.1 0 23 0 0 0 0 c7t4d0 80.4 12.1 2306.5 91.3 0.0 16.60.0 179.7 0 68 0 0 0 0 c7t5d0 Thanks ! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] inodes in snapshots
If I create a file in a file system and then snapshot the file system. Then delete the file. Is it guaranteed that while the snapshot exists no new file will be created with the same inode number as the deleted file? --chris -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] inodes in snapshots
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- If I create a file in a file system and then snapshot the file system. Then delete the file. Is it guaranteed that while the snapshot exists no new file will be created with the same inode number as the deleted file? I believe the answer to your question is No. Meaning, Yes an inode number could be recycled, in the present filesystem, although that inode number exists for some other object in a snapshot gone by. AFAIK. Informed guesses aside, here's what I really have to say: You must have a special case, if you care about this. Because a snapshot is treated as a different device, it's allowed for a new inode to be created in the present filesystem, having the same inode number. Generally speaking, there's no reason you should care if an inode number gets recycled. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS memory recommendations
I am currently doing research on how much memory ZFS should have for a storage server. I came across this blog http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance It recommends that for every TB of storage you have you want 1GB of RAM just for the metadata. Is this really the case that ZFS metadata consumes so much RAM? I'm currently building a storage server which will eventually hold up to 20TB of storage, I can't fit in 20GB of RAM on the motherboard! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
My work has bought a bunch of IBM servers recently as ESX hosts. They all come with LSI SAS1068E controllers as standard, which we remove and upgrade to a raid 5 controller. So I had a bunch of them lying around. We've bought a 16x SAS hotswap case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as the mobo. In the two 16x PCI-E slots I've put in the 1068E controllers I had lying around. Everything is still being put together and I still haven't even installed opensolaris yet but I'll see if I can get you some numbers on the controllers when I am done. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Well- 40k IOPS is the current claim from ZEUS- and they're the benchmark. They use to be 17k IOPS. How real any of these numbers are from any manufacturer is a guess. Given the Intel's refusal to honor a cache flush, and their performance problems with the cache disabled- I don't trust them any more than anyone else right now. As for the Vertex drives- if they are within +-10% of the Intel they're still doing it for half of what the Intel drive costs- so it's an option- not a great option- but still an option. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] inodes in snapshots
The reason for wanting to know is to try and find versions of a file. If a file is renamed then the only way to know that the renamed file was the same as a file in a snapshot would be if the inode numbers matched. However for that to be reliable it would require the i-nodes are not reused. If they are able to be reused then when an inode number matches I would also have to compare the real creation time which requires looking at the extended attributes. --chris -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
As for the Vertex drives- if they are within +-10% of the Intel they're still doing it for half of what the Intel drive costs- so it's an option- not a great option- but still an option. Yes, but Intel is SLC. Much more endurance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
On Tue, May 18, 2010 20:45, Edward Ned Harvey wrote: The whole point of a log device is to accelerate sync writes, by providing nonvolatile storage which is faster than the primary storage. You're not going to get this if any part of the log device is at the other side of a WAN. So either add a mirror of log devices locally and not across the WAN, or don't do it at all. A good example of using distant iSCSI with close-by SSDs: http://blogs.sun.com/jkshah/entry/zfs_with_cloud_storage_and ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Review: SuperMicro’s SC847 (S C847A) 4U chassis with 36 drive bays
http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/ Review: SuperMicro’s SC847 (SC847A) 4U chassis with 36 drive bays May 7, 2010 · 9 comments in Geek Stuff, Linux, Storage, Virtualization, Work Stuff SuperMicro SC847 Thumbnail [Or my quest for the ultimate home-brew storage array.] At my day job, we use a variety of storage solutions based on the type of data we’re hosting. Over the last year, we have started to deploy SuperMicro-based hardware with OpenSolaris and ZFS for storage of some classes of data. The systems we have built previously have not had any strict performance requirements, and were built with SuperMicro’s SC846E2 chassis, which supports 24 total SAS/SATA drives, with an integrated port multiplier in the backplane to support multipath to SAS drives. We’re building out a new system that we hope to be able to promote to tier-1 for some “less critical data”, so we wanted better drive density and more performance. We landed on the relatively new SuperMicro SC847 chassis, which supports 36 total 3.5″ drives (24 front and 12 rear) in a 4U enclosure. While researching this product, I didn’t find many reviews and detailed pictures of the chassis, so figured I’d take some pictures while building the system and post them for the benefit of anyone else interested in such a solution. In the systems we’ve built so far, we’ve only deployed SATA drives since OpenSolaris can still get us decent performance with SSD for read and write cache. This means that in the 4U cases we’ve used with integrated port multipliers, we have only used one of the two SFF-8087 connectors on the backplane; this works fine, but limits the total throughput of all drives in the system to 4 3gbit/s channels (on this chassis, 6 drives would be on each 3gbit channel.) On our most recent build, we built it with the intention of using it both for “nearline”-class storage, and as a test platform to see if we can get the performance we need to store VM images. As part of this decision, we decided to go with a backplane that supports full throughput to each drive. [...] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On Wed, May 19, 2010 02:09, thomas wrote: Is it even possible to buy a zeus iops anywhere? I haven't been able to find one. I get the impression they mostly sell to other vendors like sun? I'd be curious what the price is on a 9GB zeus iops is these days? Correct, their Zeus products are only available to OEMs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
On Tue, 18 May 2010, Edward Ned Harvey wrote: Either I'm crazy, or I completely miss what you're asking. You want to have one side of a mirror attached locally, and the other side of the mirror attached ... via iscsi or something ... across the WAN? Even if you have a really fast WAN (1Gb or so) your performance is going to be terrible, and I would be very concerned about reliability. What happens if a switch reboots or crashes? Then suddenly half of the mirror isn't available anymore (redundancy is degraded on all pairs) and ... Will it be a degraded mirror? Or will the system just hang, waiting for iscsi IO to timeout? When it comes back online, will it intelligently resilver only the parts which have changed since? Since the mirror is now broken, and local operations can happen faster than the WAN can carry them across, will the resilver ever complete, ever? I don't know. This has been accomplished successfully before. There used to be a fellow posting here (from New Zealand I think) who used distributed storage just like that. If the WAN goes away, then zfs writes will likely hang for the iSCSI timeout period (likely 3 minutes) and then continue normally once iSCSI/zfs decides that the mirror device is not available. When the WAN returns, then zfs will send only the missing updates. The whole point of a log device is to accelerate sync writes, by providing nonvolatile storage which is faster than the primary storage. You're not going to get this if any part of the log device is at the other side of a WAN. So either add a mirror of log devices locally and not across the WAN, or don't do it at all. This depends on the nature of the WAN. The WAN latency may still be relatively low as compared with drive latency. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of John Hoogerdijk I'm building a campus cluster with identical storage in two locations with ZFS mirrors spanning both storage frames. Data will be mirrored using zfs. I'm looking for the best way to add log devices to this campus cluster. Either I'm crazy, or I completely miss what you're asking. You want to have one side of a mirror attached locally, and the other side of the mirror attached ... via iscsi or something ... across the WAN? Even if you have a really fast WAN (1Gb or so) your performance is going to be terrible, and I would be very concerned about reliability. What happens if a switch reboots or crashes? Then suddenly half of the mirror isn't available anymore (redundancy is degraded on all pairs) and ... Will it be a degraded mirror? Or will the system just hang, waiting for iscsi IO to timeout? When it comes back online, will it intelligently resilver only the parts which have changed since? Since the mirror is now broken, and local operations can happen faster than the WAN can carry them across, will the resilver ever complete, ever? I don't know. anyway, it just doesn't sound like a good idea to me. It sounds like omething that was meant for a clustering filesystem of some kind, not particularly for ZFS. If you are adding log devices to this, I have a couple of things to say: The whole point of a log device is to accelerate sync writes, by providing nonvolatile storage which is faster than the primary storage. You're not going to get this if any part of the log device is at the other side of a WAN. So either add a mirror of log devices locally and not across the WAN, or don't do it at all. I am considering building a separate mirrored zpool of Flash disk that span the frames, then creating zvols to use as log devices for the data zpool. Will this work? Any other suggestions? This also sounds nonsensical to me. If your primary pool devices are Flash, then there's no point to add separate log devices. Unless you have another ype of even faster nonvolatile storage. Both frames are FC connected with Flash devices in the frame. Latencies are additive, so there is benefit to a logging device. The cluster is a standard HA cluster about 10km apart with identical storage in both locations, mirrored using ZFS. Think about the potential problems if I don't mirror the log devices across the WAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
On Tue, May 18, 2010 20:45, Edward Ned Harvey wrote: The whole point of a log device is to accelerate sync writes, by providing nonvolatile storage which is faster than the primary storage. You're not going to get this if any part of the log device is at the other side of a WAN. So either add a mirror of log devices locally and not across the WAN, or don't do it at all. A good example of using distant iSCSI with close-by SSDs: http://blogs.sun.com/jkshah/entry/zfs_with_cloud_stora ge_and Good stuff, but doesn't address HA clusters and consistent storage. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
comment below... On May 19, 2010, at 7:50 AM, John Hoogerdijk wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of John Hoogerdijk I'm building a campus cluster with identical storage in two locations with ZFS mirrors spanning both storage frames. Data will be mirrored using zfs. I'm looking for the best way to add log devices to this campus cluster. Either I'm crazy, or I completely miss what you're asking. You want to have one side of a mirror attached locally, and the other side of the mirror attached ... via iscsi or something ... across the WAN? Even if you have a really fast WAN (1Gb or so) your performance is going to be terrible, and I would be very concerned about reliability. What happens if a switch reboots or crashes? Then suddenly half of the mirror isn't available anymore (redundancy is degraded on all pairs) and ... Will it be a degraded mirror? Or will the system just hang, waiting for iscsi IO to timeout? When it comes back online, will it intelligently resilver only the parts which have changed since? Since the mirror is now broken, and local operations can happen faster than the WAN can carry them across, will the resilver ever complete, ever? I don't know. anyway, it just doesn't sound like a good idea to me. It sounds like omething that was meant for a clustering filesystem of some kind, not particularly for ZFS. If you are adding log devices to this, I have a couple of things to say: The whole point of a log device is to accelerate sync writes, by providing nonvolatile storage which is faster than the primary storage. You're not going to get this if any part of the log device is at the other side of a WAN. So either add a mirror of log devices locally and not across the WAN, or don't do it at all. I am considering building a separate mirrored zpool of Flash disk that span the frames, then creating zvols to use as log devices for the data zpool. Will this work? Any other suggestions? This also sounds nonsensical to me. If your primary pool devices are Flash, then there's no point to add separate log devices. Unless you have another ype of even faster nonvolatile storage. Both frames are FC connected with Flash devices in the frame. Latencies are additive, so there is benefit to a logging device. The cluster is a standard HA cluster about 10km apart with identical storage in both locations, mirrored using ZFS. There are quite a few metro clusters in the world today. Many use traditional mirroring software. Some use array-based sync replication. A ZFS-based solution works and behaves similarly. Think about the potential problems if I don't mirror the log devices across the WAN. If you use log devices, mirror them. -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
On Wed, 19 May 2010, Deon Cui wrote: http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance It recommends that for every TB of storage you have you want 1GB of RAM just for the metadata. Interesting conclusion. Is this really the case that ZFS metadata consumes so much RAM? I'm currently building a storage server which will eventually hold up to 20TB of storage, I can't fit in 20GB of RAM on the motherboard! Unless you do something like enable dedup (which is still risky to use), then there is no rule of thumb that I know of. ZFS will take advantage of available RAM. You should have at least 1GB of RAM available for ZFS to use. Beyond that, it depends entirely on the size of your expected working set. The size of accessed files, the randomness of the access, the number of simultaneous accesses, and the maximum number of files per directory all make a difference to how much RAM you should have for good performance. If you have 200TB of stored data, but only actually access 2GB of it at any one time, then the caching requirements are not very high. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs mount -a kernel panic
Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM vol2ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM vol2ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
Not to my knowledge, how would I go about getting one? (CC'ing discuss) On Wed, May 19, 2010 at 8:46 AM, Mark J Musante mark.musa...@oracle.com wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM vol2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
On 19.05.10 17:53, John Andrunas wrote: Not to my knowledge, how would I go about getting one? (CC'ing discuss) man savecore and dumpadm. Michael On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM vol2ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] inodes in snapshots
On Wed, May 19, 2010 at 05:33:05AM -0700, Chris Gerhard wrote: The reason for wanting to know is to try and find versions of a file. No, there's no such guarantee. The same inode and generation number pair is extremely unlikely to be re-used, but the inode number itself is likely to be re-used. If a file is renamed then the only way to know that the renamed file was the same as a file in a snapshot would be if the inode numbers matched. However for that to be reliable it would require the i-nodes are not reused. There's also the crtime (creation time, not to be confused with ctime), which you can get with ls(1). If they are able to be reused then when an inode number matches I would also have to compare the real creation time which requires looking at the extended attributes. Right, that's what you'll have to do. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
On Wed, May 19, 2010 at 07:50:13AM -0700, John Hoogerdijk wrote: Think about the potential problems if I don't mirror the log devices across the WAN. If you don't mirror the log devices then your disaster recovery semantics will be that you'll miss any transactions that hadn't been committed to disk yet at the time of the disaster. Which means that the log devices' effects is purely local: for recovery from local power failures (not extending to local disasters) and for acceleration. This may or may not be acceptable to you. If not, then mirror the log devices. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
- Deon Cui deon@gmail.com skrev: I am currently doing research on how much memory ZFS should have for a storage server. I came across this blog http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance It recommends that for every TB of storage you have you want 1GB of RAM just for the metadata. That's for dedup, 150 bytes per block, meaning approx 1GB per 1TB if all (or most) are 128kB blocks, and way more memory (or L2ARC) if you have small files. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
Bob Friesenhahn wrote: On Wed, 19 May 2010, Deon Cui wrote: http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance It recommends that for every TB of storage you have you want 1GB of RAM just for the metadata. Interesting conclusion. Is this really the case that ZFS metadata consumes so much RAM? I'm currently building a storage server which will eventually hold up to 20TB of storage, I can't fit in 20GB of RAM on the motherboard! Unless you do something like enable dedup (which is still risky to use), then there is no rule of thumb that I know of. ZFS will take advantage of available RAM. You should have at least 1GB of RAM available for ZFS to use. Beyond that, it depends entirely on the size of your expected working set. The size of accessed files, the randomness of the access, the number of simultaneous accesses, and the maximum number of files per directory all make a difference to how much RAM you should have for good performance. If you have 200TB of stored data, but only actually access 2GB of it at any one time, then the caching requirements are not very high. Bob I'd second Bob's notes here - for non-dedup purposes, you need at a very bare minimum of 512MB of RAM just for ZFS (Bob's recommendation of 1GB is much better, I'm quoting a real basement level beyond which you're effectively crippling ZFS). The primary RAM consumption determination for pools without dedup is the size of your active working set (as Bob mentioned). It's unrealistic to expect to cache /all/ metadata for every file for large pools, and I can't really see the worth in it anyhow (you end up with very infrequently-used metadata sitting in RAM, which gets evicted for use by other things in most cases). Storing any more metadata than what you need for your working set isn't going to bring much performance bonus. What you need to have is sufficient RAM to cache your async writes (remember, this amount is relatively small in most cases - it's 3 pending transactions per pool), plus enough RAM to hold your all the files (plus metadata) you expect to use (i.e. read more than once or write to) within about 5 minutes. Here's three examples to show the differences (all without dedup): (1) 100TB system which contains scientific data used in a data-mining app. The system will need to frequently access very large amounts of the available data, but seldom writes much. As it is doing data-mining, a specific piece of data is read seldom, though the system needs to read large aggregate amounts continuously. In this case, you're pretty much out of luck for caching. You'll need enough RAM to cache your maximum write size, and a little bit for read-ahead, but since you're accessing the pool almost at random for large amounts of data which aren't re-used, caching isn't going to help really at all. In this cases, 1-2GB of RAM is likely all that really can be used. (2) 1TB of data are being used for a Virtual Machine disk server. That is, the machine exports iSCSI (or FCoE, or NFS, or whatever) volumes for use on client hardware to run a VM. Typically in this case, there are lots of effectively random read requests coming in for a bunch of hot files (which tend to be OS files in the VM-hosted OSes). There's also fairly frequent write requests. However, the VMs will do a fair amount of read-caching of their own, so the amount of read requests is lower than one would think. For performance and administrative reasons, it is likely that you will want multiple pools, rather than a single large pool. In this case, you need a reasonable amount of write-cache for *each* pool, plus enough RAM to cache all of the OS files very often used for ALL the VMs. In this case, dedup would actually really help RAM consumption, since it is highly likely that frequently-accessed files from multiple VMs are in fact identical, and thus with dedup, you'd only need to store one copy in the cache. In any case, here you'd need a few GB for the write caching, plus likely a dozen or more GB for read caching, as your working set is moderately large, and frequently re-used. (3) 100TB of data for NFS home directory serving. Access pattern here is likely highly random, with only small amounts of re-used data. However, you'll often have non-trivial write sizes. Having a ZIL is probably a good idea, but in any case, you'll want a couple of GB (call it 3-4) for write caching per pool, and then several dozen MB per active user as read cache. That is, in this case, it's likely that your determining factor is not total data size, but the number of simultaneous users, since the latter will dictate your frequency of file access. I'd say all of the recommendations/insights on the referenced link are good, except for #1. The base amount of RAM is highly variable based on the factors discussed above, and the blanket assumption
Re: [zfs-discuss] zfs mount -a kernel panic
Hmmm... no coredump even though I configured it. Here is the trace though I will see what I can do about the coredump r...@cluster:/export/home/admin# zfs mount vol2/vm2 panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault) rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL pointer deree zpool-vol2: #pf Page fault Bad kernel fault at addr=0x30 pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296 cr0: 8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de cr2: 30cr3: 500cr8: c rdi:0 rsi: ff05208b2388 rdx: ff001f45e888 rcx:0 r8:3000900ff r9: 198f5ff6 rax:0 rbx: 200 rbp: ff001f45ea50 r10: c0130803 r11: ff001f45ec60 r12: ff05208b2388 r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8 fsb:0 gsb: ff04eb9b8080 ds: 4b es: 4b fs:0 gs: 1c3 trp:e err:2 rip: f795d054 cs: 30 rfl:10296 rsp: ff001f45ea48 ss: 38 ff001f45e830 unix:die+dd () ff001f45e940 unix:trap+177b () ff001f45e950 unix:cmntrap+e6 () ff001f45ea50 zfs:ddt_phys_decref+c () ff001f45ea80 zfs:zio_ddt_free+55 () ff001f45eab0 zfs:zio_execute+8d () ff001f45eb50 genunix:taskq_thread+248 () ff001f45eb60 unix:thread_start+8 () syncing file systems... done skipping system dump - no dump device configured rebooting... On Wed, May 19, 2010 at 8:55 AM, Michael Schuster michael.schus...@oracle.com wrote: On 19.05.10 17:53, John Andrunas wrote: Not to my knowledge, how would I go about getting one? (CC'ing discuss) man savecore and dumpadm. Michael On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM vol2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
et == Erik Trimble erik.trim...@oracle.com writes: et frequently-accessed files from multiple VMs are in fact et identical, and thus with dedup, you'd only need to store one et copy in the cache. although counterintuitive I thought this wasn't part of the initial release. Maybe I'm wrong altogether or maybe it got added later? http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000 pgp4W7jhfu4MV.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
Miles Nordin wrote: et == Erik Trimble erik.trim...@oracle.com writes: et frequently-accessed files from multiple VMs are in fact et identical, and thus with dedup, you'd only need to store one et copy in the cache. although counterintuitive I thought this wasn't part of the initial release. Maybe I'm wrong altogether or maybe it got added later? http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000 No, you're reading that blog right - dedup is on a per-pool basis. What I was talking about was inside a single pool. Without dedup enabled on a pool, if I have 2 VM images, both of which are say WinXP, then I'd have to cache identical files twice. With dedup, I'd only have to cache those blocks once, even if they were being accessed by both VMs. So, dedup is both hard on RAM (you need the DDT), and easier (it lowers the amount of actual data blocks which have to be stored in cache). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
OK, I got a core dump, what do I do with it now? It is 1.2G in size. On Wed, May 19, 2010 at 10:54 AM, John Andrunas j...@andrunas.net wrote: Hmmm... no coredump even though I configured it. Here is the trace though I will see what I can do about the coredump r...@cluster:/export/home/admin# zfs mount vol2/vm2 panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault) rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL pointer deree zpool-vol2: #pf Page fault Bad kernel fault at addr=0x30 pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296 cr0: 8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de cr2: 30cr3: 500cr8: c rdi: 0 rsi: ff05208b2388 rdx: ff001f45e888 rcx: 0 r8: 3000900ff r9: 198f5ff6 rax: 0 rbx: 200 rbp: ff001f45ea50 r10: c0130803 r11: ff001f45ec60 r12: ff05208b2388 r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8 fsb: 0 gsb: ff04eb9b8080 ds: 4b es: 4b fs: 0 gs: 1c3 trp: e err: 2 rip: f795d054 cs: 30 rfl: 10296 rsp: ff001f45ea48 ss: 38 ff001f45e830 unix:die+dd () ff001f45e940 unix:trap+177b () ff001f45e950 unix:cmntrap+e6 () ff001f45ea50 zfs:ddt_phys_decref+c () ff001f45ea80 zfs:zio_ddt_free+55 () ff001f45eab0 zfs:zio_execute+8d () ff001f45eb50 genunix:taskq_thread+248 () ff001f45eb60 unix:thread_start+8 () syncing file systems... done skipping system dump - no dump device configured rebooting... On Wed, May 19, 2010 at 8:55 AM, Michael Schuster michael.schus...@oracle.com wrote: On 19.05.10 17:53, John Andrunas wrote: Not to my knowledge, how would I go about getting one? (CC'ing discuss) man savecore and dumpadm. Michael On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM vol2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' -- John -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import On Fail Over Server Using Shared SAS zpool Storage But Not Shared cache SSD Devices
Hello and good day, I will have two OpenSolaris snv_134 storage servers both connected to a SAS chassis with SAS disks used to store zpool data. One storage server will be the active storage server and the other will be the passive fail over storage server. Both servers will be able to access the same disks in the SAS chassis. I am planning on having unique, non-shared, SSD cache directly connected to each storage node to allow for better performance when utilizing the cache SSDs. Would having unique, non-shared, SSD cache directly connected to each storage node's motherboard actually allow for better performance when utilizing the cache SSDs? Or would having the SSD cache devices be shared going through a shared controller yield just the same performance? If having unique, non-shared, SSD cache directly connected to each storage node's motherboard actually yields better performance how would a zpool import of a zpool utilizing the unique SSD cache devices work on the passive, fail over storage node when a fail over happened? Would OpenSolaris/ZFS use these directly connected cache devices automatically or would we have to add these cache devices into the zpool? If there was data on the cache devices on the active storage node and say the power went out and fail over occurred would the data on the cache devices be lost during a zpool import on the fail over node? Also, if you would like any other details about this storage environment to better provide myself and the list with insight to these questions please just ask! -- Thank you, Preston Connors Atlantic.Net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Well the larger size of the Vertex, coupled with their smaller claimed write amplification should result in sufficient service life for my needs. Their claimed MTBF also matches the Intel X25-E's. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Since it ignores Cache Flush command and it doesn't have any persistant buffer storage, disabling the write cache is the best you can do. This actually brings up another question I had: What is the risk, beyond a few seconds of lost writes, if I lose power, there is no capacitor and the cache is not disabled? My ZFS system is shared storage for a large VMWare based QA farm. If I lose power then a few seconds of writes are the least of my concerns. All of the QA tests will need to be restarted and all of the file systems will need to be checked. A few seconds of writes won't make any difference unless it has the potential to affect the integrity of the pool itself. Considering the performance trade-off, I'd happily give up a few seconds worth of writes for significantly improved IOPS. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On May 19, 2010, at 2:29 PM, Don wrote: Since it ignores Cache Flush command and it doesn't have any persistant buffer storage, disabling the write cache is the best you can do. This actually brings up another question I had: What is the risk, beyond a few seconds of lost writes, if I lose power, there is no capacitor and the cache is not disabled? The data risk is a few moments of data loss. However, if the order of the uberblock updates is not preserved (which is why the caches are flushed) then recovery from a reboot may require manual intervention. The amount of manual intervention could be significant for builds prior to b128. My ZFS system is shared storage for a large VMWare based QA farm. If I lose power then a few seconds of writes are the least of my concerns. All of the QA tests will need to be restarted and all of the file systems will need to be checked. A few seconds of writes won't make any difference unless it has the potential to affect the integrity of the pool itself. Considering the performance trade-off, I'd happily give up a few seconds worth of writes for significantly improved IOPS. Space, dependability, performance: pick two :-) -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On Wed, May 19, 2010 at 02:29:24PM -0700, Don wrote: Since it ignores Cache Flush command and it doesn't have any persistant buffer storage, disabling the write cache is the best you can do. This actually brings up another question I had: What is the risk, beyond a few seconds of lost writes, if I lose power, there is no capacitor and the cache is not disabled? You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). (You also lose writes from the currently open transaction, but that's unavoidable in any system.) Nowadays the system will let you know at boot time that the last transaction was not committed properly and you'll have a chance to go back to the previous transaction. For me, getting much-better-than-disk performance out of an SSD with cache disabled is enough to make that SSD worthwhile, provided the price is right of course. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). And I don't think that bothers me. As long as the array itself doesn't go belly up- then a few seconds of lost transactions are largely irrelevant- all of the QA virtual machines are going to have to be rolled back to their initial states anyway. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount -a kernel panic
First, I suggest you open a bug at https://defect.opensolaris.org/bz and get a bug number. Then, name your core dump something like bug.bugnumber and upload it using the instructions here: http://supportfiles.sun.com/upload Update the bug once you've uploaded the core and supply the name of the core file. Lori On 05/19/10 12:40 PM, John Andrunas wrote: OK, I got a core dump, what do I do with it now? It is 1.2G in size. On Wed, May 19, 2010 at 10:54 AM, John Andrunasj...@andrunas.net wrote: Hmmm... no coredump even though I configured it. Here is the trace though I will see what I can do about the coredump r...@cluster:/export/home/admin# zfs mount vol2/vm2 panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault) rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL pointer deree zpool-vol2: #pf Page fault Bad kernel fault at addr=0x30 pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296 cr0: 8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de cr2: 30cr3: 500cr8: c rdi:0 rsi: ff05208b2388 rdx: ff001f45e888 rcx:0 r8:3000900ff r9: 198f5ff6 rax:0 rbx: 200 rbp: ff001f45ea50 r10: c0130803 r11: ff001f45ec60 r12: ff05208b2388 r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8 fsb:0 gsb: ff04eb9b8080 ds: 4b es: 4b fs:0 gs: 1c3 trp:e err:2 rip: f795d054 cs: 30 rfl:10296 rsp: ff001f45ea48 ss: 38 ff001f45e830 unix:die+dd () ff001f45e940 unix:trap+177b () ff001f45e950 unix:cmntrap+e6 () ff001f45ea50 zfs:ddt_phys_decref+c () ff001f45ea80 zfs:zio_ddt_free+55 () ff001f45eab0 zfs:zio_execute+8d () ff001f45eb50 genunix:taskq_thread+248 () ff001f45eb60 unix:thread_start+8 () syncing file systems... done skipping system dump - no dump device configured rebooting... On Wed, May 19, 2010 at 8:55 AM, Michael Schuster michael.schus...@oracle.com wrote: On 19.05.10 17:53, John Andrunas wrote: Not to my knowledge, how would I go about getting one? (CC'ing discuss) man savecore and dumpadm. Michael On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com wrote: Do you have a coredump? Or a stack trace of the panic? On Wed, 19 May 2010, John Andrunas wrote: Running ZFS on a Nexenta box, I had a mirror get broken and apparently the metadata is corrupt now. If I try and mount vol2 it works but if I try and mount -a or mount vol2/vm2 is instantly kernel panics and reboots. Is it possible to recover from this? I don't care if I lose the file listed below, but the other data in the volume would be really nice to get back. I have scrubbed the volume to no avail. Any other thoughts. zpool status -xv vol2 pool: vol2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM vol2ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Regards, markm -- michael.schus...@oracle.com http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' -- John ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). I'll pick one- performance :) Honestly- I wish I had a better grasp on the real world performance of these drives. 50k IOPS is nice- and considering the incredible likelihood of data duplication in my environment- the SandForce controller seems like a win. That said- does anyone have a good set of real world performance numbers for these drives that you can link to? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] vibrations and consumer drives
A recent post on StorageMojo has some interesting numbers on how vibrations can affect disks, especially consumer drives: http://storagemojo.com/2010/05/19/shock-vibe-and-awe/ He mentions a 2005 study that I wasn't aware of. In its conclusion it states: Based on the results of these measurements, it was determined that the effects of vibration can be observed and quantified. Furthermore, it demonstrates that [Consumer Storage (CS)] disk drives are more sensitive to the vibration from physically coupled adjacent disk drives [than Enterprise-class disk drives]. However, even though the CS drives are more sensitive to vibration, there was no evidence of data corruption when the vibration affected write operations. https://dtc.umn.edu/publications/reports/2005_08.pdf Another study gives numbers of 20% decrease in IO throughput, 25% increase completion time, and 25% increase in energy consumption. Probably not a big deal for home use, but it can certainly add up if you've got lots of shelves. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mpt hotswap procedure
I'm not having any luck hotswapping a drive attached to my Intel SASUC8I (LSI-based) controller. The commands which work for the AMD AHCI ports don't work for the LSI. Here's what cfgadm -a reports with all drives installed and operational: Ap_Id Type Receptacle Occupant Condition c4 scsi-sas connectedconfigured unknown c4::dsk/c4t0d0 disk connectedconfigured unknown c4::dsk/c4t1d0 disk connectedconfigured unknown c4::dsk/c4t2d0 disk connectedconfigured unknown c4::dsk/c4t3d0 disk connectedconfigured unknown c4::dsk/c4t4d0 disk connectedconfigured unknown c4::dsk/c4t5d0 disk connectedconfigured unknown c4::dsk/c4t6d0 disk connectedconfigured unknown c4::dsk/c4t7d0 disk connectedconfigured unknown sata0/0::dsk/c5t0d0disk connectedconfigured ok sata0/1::dsk/c5t1d0disk connectedconfigured ok sata0/2::dsk/c5t2d0disk connectedconfigured ok sata0/3::dsk/c5t3d0disk connectedconfigured ok sata0/4::dsk/c5t4d0disk connectedconfigured ok sata0/5::dsk/c5t5d0disk connectedconfigured ok [irrelevant USB entries snipped] Now, if I yank out a drive on one of the AHCI ports (let's use port 3 as an example), I can use: cfgadm -c connect sata0/3 cfgadm -c configure sata0/3 and bring the new drive online. I have had no luck with the SASUC8I; even though I can see messages in the system log that a drive was inserted, the only way I've been able to actually use the drive afterwards has been via a reboot. A command like: cfgadm -c connect c4::dsk/c4t4d0 will be greeted with the message: cfgadm: Hardware specific failure: operation not supported for SCSI device Is cfgadm -c connect c4 sufficient, or is there some other incantation I'm missing? :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
Deon Cui deon.cui at gmail.com writes: So I had a bunch of them lying around. We've bought a 16x SAS hotswap case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as the mobo. In the two 16x PCI-E slots I've put in the 1068E controllers I had lying around. Everything is still being put together and I still haven't even installed opensolaris yet but I'll see if I can get you some numbers on the controllers when I am done. This is a well-architected config with no bottlenecks on the PCIe links to the 890GX northbridge or on the HT link to the CPU. If you run 16 concurrent dd if=/dev/rdsk/c?d?t?p0 of=/dev/zero bs=1024k and assuming your drives can do ~100MB/s sustained reads at the beginning of the platter, you should literally see an aggregate throughput of ~1.6GB/s... -mrb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss