Re: [zfs-discuss] X4540 32GB SSD in x4500 as slog
Paul B. Henson wrote: On Wed, 13 May 2009, Richard Elling wrote: I didn't find that exact part number, but I notice that manufacturing part 371-4196 32GB Solid State Drive, SATA Interface is showing up in a number of systems. IIRC, this would be an Intel X25-E. Hmm, the part number I provided was off an official quote from our authorized reseller, googling it comes up with one sun.com link: http://www.sun.com/executives/iforce/mysun/docs/Support2a_ReleaseContentInfo.html and a bunch of Japanese sites. List price was $1500, if it is actually an OEM'd Intel X25-E that's quite a markup, street price on that has dropped below $500. If it's not, it sure would be nice to see some specs. Generally, Sun doesn't qualify new devices with EOLed systems. Understood, it just sucks to have bought a system on its deathbed without prior knowledge thereof. Since it costs real $$ to do such things, given the current state of the economy, I don't think you'll find anyone in the computer business not trying to sell new product. Today, you can remove a cache device, but not a log device. You can replace a log device. I guess if we ended up going this way replacing the log device with a standard hard drive in case of support issues would be the only way to go. Those log device replacement also require the replacement device be of equal or greater size? Yes, standard mirror rules apply. This is why I try to make it known that you don't generally need much size for the log device. They are solving a latency problem, not a space or bandwidth problem. If I wanted to swap between a 32GB SSD and a 1TB SATA drive, I guess I would need to make a partition/slice on the TB drive of exactly the size of the SSD? Yes, but note that an SMI label hangs onto the outdated notion of cylinders and you can't make a slice except on cylinder boundaries. Before you start down this path, you should take a look at the workload using zilstat, which will show you the kind of work the ZIL is doing. If you don't see any ZIL activity, no need to worry about a separate log. http://www.richardelling.com/Home/scripts-and-programs-1/zilstat Would a dramatic increase in performance when disabling the ZIL also be sufficient evidence? Even with only me as the only person using our test x4500 disabling the ZIL provides markedly better performance as originally described for certain use cases. Yes. If the latency through the data path to write to the log was zero, then it would perform the same as disabling the ZIL. Usually, the log device does not need to be very big. A good strategy would be to create a small partition or slice, say 1 GByte, on an idle disk. If the log device was too small, you potentially could end up bottlenecked waiting for transactions to be committed to free up log device blocks? zilstat can give you an idea of how much data is being written to the log, so you can make that decision. Of course you can always grow the log, or add another. But I think you will find that if a txg commits in 30 seconds or less (less as it becomes more busy), then the amount of data sent to the log will be substantially less than 1 GByte per txg commit. Once the txg commits, then the log space is freed. Intel claims > 3,300 4kByte random write iops. Is that before after the device gets full and starts needing to erase whole pages to write new blocks 8-/? Buy two, if you add two log devices, then the data is striped across them (add != attach) My rule of thumb is to have a hot spare. Having lots of hot spares only makes a big difference for sites where you cannot service the systems within a few days, such as remote locations. Eh, they're just downstairs, and we have 7x24 gold on them. Plus I have 5, each with 2 hot spares. I wouldn't have an issue trading a hot spare for a log device other than potential issues with the log device failing if not mirrored. Yes, and this is what would happen in the case where the log device completely failed while the pool was operational -- the ZIL will revert to using the main pool. But would then go belly up if the system ever rebooted? You said currently you cannot remove a log device, if the pool reverts to an embedded log upon slog failure, and continues to work after a reboot, you've effectively removed the slog, other than I guess it might keep complaining and showing a dead slog device. In that case, the pool knows the log device is failed. This is the case where the log device fails completely while the pool is not operational. Upon import, the pool will look for an operational log device and will not find it. This means that any committed transactions that would have been in the log device are not recoverable *and* the pool won't know the extent of this missing information. So is there simply no recovery available for such a pool? Presumably the majority of the data in the pool would
Re: [zfs-discuss] What causes slow performance under load?
On Mon, Apr 27, 2009 at 04:47:27PM -0500, Gary Mills wrote: > On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: > > We have an IMAP server with ZFS for mailbox storage that has recently > > become extremely slow on most weekday mornings and afternoons. When > > one of these incidents happens, the number of processes increases, the > > load average increases, but ZFS I/O bandwidth decreases. Users notice > > very slow response to IMAP requests. On the server, even `ps' becomes > > slow. > > The cause turned out to be this ZFS bug: > > 6596237: Stop looking and start ganging > > Apparently, the ZFS code was searching the free list looking for the > perfect fit for each write. With a fragmented pool, this search took > a very long time, delaying the write. Eventually, the requests arrived > faster than writes could be sent to the devices, causing the server > to be unresponsive. We also had another problem, due to this ZFS bug: 6591646: Hang while trying to enter a txg while holding a txg open This was a deadlock, with one thread blocking hundreds of other threads. Our symptom was that all zpool I/O would stop and the `ps' command would hang. A reboot was the only way out. If you have a support contract, Sun will supply an IDR that fixes both problems. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 32GB SSD in x4500 as slog
On Wed, May 13 at 17:27, Paul B. Henson wrote: On Wed, 13 May 2009, Richard Elling wrote: Intel claims > 3,300 4kByte random write iops. Is that before after the device gets full and starts needing to erase whole pages to write new blocks 8-/? The quoted numbers are minimums, not "up to" like on the X25-M devices. I believe that they're measuring sustained 4k full-pack random writes, long after the device has filled and needs to be doing garbage collection, wear leveling, etc. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 32GB SSD in x4500 as slog
On Wed, 13 May 2009, Richard Elling wrote: > I didn't find that exact part number, but I notice that manufacturing part >371-4196 32GB Solid State Drive, SATA Interface > is showing up in a number of systems. IIRC, this would be an Intel X25-E. Hmm, the part number I provided was off an official quote from our authorized reseller, googling it comes up with one sun.com link: http://www.sun.com/executives/iforce/mysun/docs/Support2a_ReleaseContentInfo.html and a bunch of Japanese sites. List price was $1500, if it is actually an OEM'd Intel X25-E that's quite a markup, street price on that has dropped below $500. If it's not, it sure would be nice to see some specs. > Generally, Sun doesn't qualify new devices with EOLed systems. Understood, it just sucks to have bought a system on its deathbed without prior knowledge thereof. > Today, you can remove a cache device, but not a log device. You can > replace a log device. I guess if we ended up going this way replacing the log device with a standard hard drive in case of support issues would be the only way to go. Those log device replacement also require the replacement device be of equal or greater size? If I wanted to swap between a 32GB SSD and a 1TB SATA drive, I guess I would need to make a partition/slice on the TB drive of exactly the size of the SSD? > Before you start down this path, you should take a look at the workload > using zilstat, which will show you the kind of work the ZIL is doing. If > you don't see any ZIL activity, no need to worry about a separate log. > http://www.richardelling.com/Home/scripts-and-programs-1/zilstat Would a dramatic increase in performance when disabling the ZIL also be sufficient evidence? Even with only me as the only person using our test x4500 disabling the ZIL provides markedly better performance as originally described for certain use cases. > Usually, the log device does not need to be very big. A good strategy > would be to create a small partition or slice, say 1 GByte, on an idle disk. If the log device was too small, you potentially could end up bottlenecked waiting for transactions to be committed to free up log device blocks? > Intel claims > 3,300 4kByte random write iops. Is that before after the device gets full and starts needing to erase whole pages to write new blocks 8-/? > My rule of thumb is to have a hot spare. Having lots of hot > spares only makes a big difference for sites where you cannot > service the systems within a few days, such as remote locations. Eh, they're just downstairs, and we have 7x24 gold on them. Plus I have 5, each with 2 hot spares. I wouldn't have an issue trading a hot spare for a log device other than potential issues with the log device failing if not mirrored. > Yes, and this is what would happen in the case where the log device > completely failed while the pool was operational -- the ZIL will revert > to using the main pool. But would then go belly up if the system ever rebooted? You said currently you cannot remove a log device, if the pool reverts to an embedded log upon slog failure, and continues to work after a reboot, you've effectively removed the slog, other than I guess it might keep complaining and showing a dead slog device. > This is the case where the log device fails completely while > the pool is not operational. Upon import, the pool will look > for an operational log device and will not find it. This means > that any committed transactions that would have been in the > log device are not recoverable *and* the pool won't know > the extent of this missing information. So is there simply no recovery available for such a pool? Presumably the majority of the data in the pool would probably be fine. > OTOH, if you are paranoid and feel very strongly about CYA, then by all > means, mirror the log :-). That all depends on the outcome in that rare as it might case where the log device fails and the pool is inaccessible. If it's just a matter of some manual intervention to reset the pool to a happy state and the potential loss of any uncommitted transactions (which, according to the evil zfs tuning guide don't result in a corrupted zfs filesystem, only in potentially unhappy nfs clients), I could live with that. If all of the data in the poll is trashed and must be restored from backup, that would be problematic. > [editorial comment: it would be to Sun's benefit if Sun people would > respond to Sun product questions. Harrrummppff.] Maybe they're too busy running in circles trying to figure out what life under Oracle dominion is going to be like :(. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 32GB SSD in x4500 as slog
Paul B. Henson wrote: I see Sun has recently released part number XRA-ST1CH-32G2SSD, a 32GB SATA SSD for the x4540 server. I didn't find that exact part number, but I notice that manufacturing part 371-4196 32GB Solid State Drive, SATA Interface is showing up in a number of systems. IIRC, this would be an Intel X25-E. (shock rated at 1,000 Gs @ 0.5ms, so it should still work if I fall off my horse ;-) We have five x4500's we purchased last year that we are deploying to provide file and web services to our users. One issue that we have had is horrible performance for the "single threaded process creating lots of small files over NFS" scenario. The bottleneck in that case is fairly clear, and to verify it we temporarily disabled the ZIL on one of the servers. Extraction time for a large tarball into an NFSv4 mounted filesystem dropped from 20 minutes to 2 minutes. Obviously, it is strongly recommended not to run with the ZIL disabled, and we don't particularly want to do so in production. However, for some of our users, performance is simply unacceptable for various usage cases (including not only tar extracts, but other common software development processes such as svn checkouts). Yep. Same sort of workload. As such, we have been investigating the possibility of improving performance via a slog, preferably on some type of NVRAM or SSD. We haven't really found anything appropriate, and now we see Sun has officially released something very possibly like what we have been looking for. My sales rep tells me the drive is only qualified for use in an x4540. However, as a standard SATA interface SSD there is theoretically no reason why it would not work in an x4500, they even share the exact same drive sleds. I was told Sun just didn't want to spend the time/effort to qualify it for the older hardware (kind of sucks that servers we bought less than a year ago are being abandoned). We are considering using them anyway, in the worst case if Sun support complains that they are installed and refuses to continue any diagnostic efforts, presumably we can simply swap them out for standard hard drives. slog devices can be replaced like any other zfs vdev, correct? Or alternatively, what is the state of removing a slog device and reverting back to a pool embedded log? Generally, Sun doesn't qualify new devices with EOLed systems. Today, you can remove a cache device, but not a log device. You can replace a log device. Before you start down this path, you should take a look at the workload using zilstat, which will show you the kind of work the ZIL is doing. If you don't see any ZIL activity, no need to worry about a separate log. http://www.richardelling.com/Home/scripts-and-programs-1/zilstat If you decide you need a log device... read on. Usually, the log device does not need to be very big. A good strategy would be to create a small partition or slice, say 1 GByte, on an idle disk. Add this as a log device to the pool. If this device is a HDD, then you might not see much of a performance boost. But now that you have a log device setup, you can experiment with replacing the log device with another. You won't be able to remove the log device, but you can relocate or grow it on the fly. So, has anyone played with this new SSD in an x4500 and can comment on whether or not they seemed to work okay? I can't imagine no one inside of Sun, regardless of official support level, hasn't tried it :). Feel free to post anonymously or reply off list if you don't want anything on the record ;). >From reviewing the Sun hybrid storage documentation, it describes two different flash devices, the "Logzilla", optimized for blindingly fast writes and intended as a ZIL slog, and the "Cachezilla", optimized for fast reads and intended for use as L2ARC. Is this one of those, or some other device? If the latter, what are its technical read/write performance characteristics? Intel claims > 3,300 4kByte random write iops. A really fast HDD may reach 300 4kByte random write iops, but there are no really fast SATA HDDs. http://www.intel.com/design/flash/nand/extreme/index.htm We currently have all 48 drives allocated, 23 mirror pairs and two hot spares. Is there any timeline on the availability of removing an active vdev from a pool, which would allow us to swap out a couple of devices without destroying and having to rebuild our pool? My rule of thumb is to have a hot spare. Having lots of hot spares only makes a big difference for sites where you cannot service the systems within a few days, such as remote locations. But you can remove a hot spare, so that could be a source of your experimental 1 GByte log. What is the current state of behavior in the face of slog failure? It depends on both the failure and event tree... Theoretically, if a dedicated slog device failed, the pool could simply revert to logging embedded in the pool. Yes, and this is what would happen in the case where
Re: [zfs-discuss] ZFS + USB media + mirror/raidz = weird
Bogdan M. Maryniuk wrote: Hello, folks. kind of problem with mirrors and raidz. System & config: — SunOS 5.11, snv_111a — Service system/filesystem/rmvolmgr is disabled. Hardware: — Asus EeePC Box B202 — Two USB 3.5" inches boxes. If I reboot: mirror or raidz works fine. But if I connect physical USB cables in different USB port, then I get my mirror (or raidz) completely screwed up. If I just shutdown the box, restore physical configuration — everything fubar and no errors reported. The current imported configuration is stored in the /etc/zfs/zpool.cache file. You can take a look at it using "zdb -C" Upon reboot, the imported pools (and devices) shown in the zpool.cache will be automatically imported. Others won't. The solution is to clear the pool entries from the zpool.cache. This is easily accomplished by doing "zpool export poolname" even if the pool is not imported. If you know you will be juggling devices around, you can also export prior to reboot. -- richard I took an output of zdb before and after cable swap, so that diff shows me only different txg on rpool (not related to my own pool of USB drives). Everything else is just identical. When zpool with USB drives are corrupted due to cables swap, "zdb -uuu " won't report anything, yelling that there are no such pool at all (although just "zdb " reports it exists). I understand, that there is a lot of steps, before ZFS see the drives. But... how come that USB cable swapping can affect ZFS corrupted to be like this? It somewhat does not sounds right to me in general: I assume ZFS must find disks by a labels, right? Anyone help me to understand what's going on here, please? If here is any hack/cure/idea/wish or at least necrology, please share as well. :-) Thanks a lot. -- Kind regards, bm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup and restore of ZFS root disk using DVD driveand DAT tape drive
On Wed, May 13, 2009 at 11:32 AM, Erik Trimble wrote: > the "zfs send" and "zfs receive" commands can be used analogously to > "ufsdump" and "ufsrestore". > > You'll have to create the root pool by hand when doing a system restore, but > it's not really any different than having to partition the disk under the > old ufs-way. > > So, use "zfs send" to write the zfs filesystems to DAT. > > To recover: This page lists detailed step-by-step instruction http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#ZFS_Root_Pool_Recovery Regards, Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup and restore of ZFS root disk using DVD driveand DAT tape drive
Dedhi Sujatmiko wrote: > Dear all, > > given a DVD drive and DAT Tape Drive, and using Solaris 10 U7 (5/09), > how can we plan for a total backup of ZFS root disk and procedure to > recover that? > Previously using UFS, we just need to use boot from Solaris OS DVD > media, also using ufsdump, ufsrestore and installboot. star includes an implementation of the basic ideas from ufsdump/ufsrestore in a OS and FS independent way. Star allows you to do incremental dumps and restores. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
Rince wrote: > On Tue, Apr 21, 2009 at 3:20 PM, Joerg Schilling > wrote: > > The license combination used by cdrtools was verified by several lawywers > > including Sun Legal and Eben Moglen and no lawyer did find a problem. > > [citation needed] What is the reason for restarting your FUD campagin? Please stop this, it is completely off topic. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS + USB media + mirror/raidz = weird
Hello, folks. kind of problem with mirrors and raidz. System & config: — SunOS 5.11, snv_111a — Service system/filesystem/rmvolmgr is disabled. Hardware: — Asus EeePC Box B202 — Two USB 3.5" inches boxes. If I reboot: mirror or raidz works fine. But if I connect physical USB cables in different USB port, then I get my mirror (or raidz) completely screwed up. If I just shutdown the box, restore physical configuration — everything fubar and no errors reported. I took an output of zdb before and after cable swap, so that diff shows me only different txg on rpool (not related to my own pool of USB drives). Everything else is just identical. When zpool with USB drives are corrupted due to cables swap, "zdb -uuu " won't report anything, yelling that there are no such pool at all (although just "zdb " reports it exists). I understand, that there is a lot of steps, before ZFS see the drives. But... how come that USB cable swapping can affect ZFS corrupted to be like this? It somewhat does not sounds right to me in general: I assume ZFS must find disks by a labels, right? Anyone help me to understand what's going on here, please? If here is any hack/cure/idea/wish or at least necrology, please share as well. :-) Thanks a lot. -- Kind regards, bm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss