Re: [zfs-discuss] (Practical) limit on the number of snapshots?
Lutz, On Mon, Jan 11, 2010 at 09:38:16PM -0800, Lutz Schumann wrote: > Cause you mention the fixed / bugs I have a more general question. > > Is there a way to see all commits to OSOL that are related to a Bug Report ? You can go to : src.opensolaris.org and give the bug-id in the history field and select ON gate and search. That should list all the files that were modified by the fix for that bug. Now for each file you can got to the histry and get a diff of version where the fix was integrated and the previous version. Hope that helps. Regards, Sanjeev -- Sanjeev Bagewadi Solaris RPE Bangalore, India ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] internal backup power supplies?
Actually for the ZIL you may use the a-card (memory sata disk + bbu + compact flash write out). For the data disks there is no solution yet - would be nice. However I prefer the "supercapacitor on disk" method. Why ? because the recharge logic is chellenging. There needs to be communication between the disk and the power supply. The interesting case is "fluctucating power" (see below) and battery maintenance. If you are charged you can run fine, but the corner cases are tricky. Image the following scenario: 1) Operations: Normal 2) Power outage: 1 hour 3) UPS failing after 30 minutes 4) Power comes back 5) ALL Servers power on at the same time (e.g. misconfiguration) 6) Peak -> Power goes down again At 3) your batteries are empty. At 6) your batteries are not fully charged, however because the device does not know the "status" of the local UPS, write cache is still enabled. Thus a simple design does not solve the problem well (eneugh). Another thing is maintenance of a battery. You have to check if your battery still works (charge cycle). You have to alarm if not (monitoring). You have to replace them online then. So in general - batteries are bad if your server lifes longer then 3 years :) For google it works fine, maybe because the server will life < 3 years anyhow and because they can "jus treplace" the server due to their internal redundancy options (google backend technology is designed to handle failure well). For a storage system I don't see that. The BBU / Capcitor needs to implement the same logic a Raid BBU implements. If (not_working_fully(BBU)) { disable_write_cache(); } else { enable_write_cache(); } Or better (explict state whitelisting guaranteeing data integrity also for unexpected states): If (working_fully(BBU)) { enable_write_cache(); } else { disable_write_cache(); } p.s. While writing this I'm thinking if a-card handles this case well ? ... maybe not. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
> > .. however ... a lot of snaps still have a impact > on system performance. After the import of the 1 > snaps volume, I saw "devfsadm" eating up all CPU: > > If you are snapshotting ZFS volumes, then each will > create an entry in the > device tree. In other words, if these were file > systems instead of volumes, you > would not see devfsadm so busy. Ok, nice to know that. For our use case we are focused on zvols (comstar iSCSi to virtualized hosts). And still it works fine for a reasonable number of zvols with a nice backup/snapshot cycle (~ 12 (5min) + 24 (1h) +7 (daily) + 4 (weekly) + 12 (monthly) + 1 for each year -> ~ 60 Snaps for each zvol). > > .. another strange issue: > > r...@osol_dev130:/dev/zvol/dsk/ssd# ls -al > > > > load averages: 1.16, 2.17, 1.50; > up 0+00:22:0718:54:00 > : 97 sleeping, 2 on cpu > > CPU states: 49.1% idle, 0.1% user, 50.8% kernel, > I don't see the issue, could you elaborate? How ( I know how to "truss", but not not so familiar with debugging at the kernel level :) ? > > .. so having a 1 snaps of a single zvol is not > nice :) > > AIUI, devfsadm creates a database. Could you try the > last experiment again, Will try, but have no access to test equipment right now. Thanks for the feedback. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
Cause you mention the fixed / bugs I have a more general question. Is there a way to see all commits to OSOL that are related to a Bug Report ? Background: I'm interested in how e.g. the zfs import bug was fixed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] internal backup power supplies?
> [google server with batteries] These are cool, and a clever rethink of the typical data centre power supply paradigm. They keep the server running, until either a generator is started or a graceful shutdown can be done. Just to be clear, I'm talking about something much smaller, that provides power only for drives, for a few moments after the host powers down (for whatever reason) to let the drives sync their caches safely. Basically, just wrapping the drive with the supercap (or equivalent) the manufacturer didn't include, plus whatever minimal power supply circuitry is needed (to avoid big inrush recharge currents on startup, to avoid sending power back out into the rest of the case, etc). Because there's no integration for an emergency "sync now!" signal, we have to rely on timeouts and wait "long enough" for the cache to be sync'ed. It might be larger and need to hold longer than an on-board supercap, but not very long in absolute terms. There seems to be lots of room for a comfortable niche in the gap between common commodity hardware (that would be plenty good enough otherwise) and the $5k F20's and LogZilla's and similar. -- Dan. pgpkuQRgS2Ae0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] opensolaris-vmware
On Mon, Jan 11, 2010 at 6:17 PM, Greg wrote: > Hello All, > I hope this makes sense, I have two opensolaris machines with a bunch of > hard disks, one acts as a iSCSI SAN, and the other is identical other than > the hard disk configuration. The only thing being served are VMWare esxi raw > disks, which hold either virtual machines or data that the particular > virtual machine uses, I.E. we have exchange 2007 virtualized and through its > iSCSI initiator we are mounting two LUNs one for the database and another > for the Logs, all on different arrays of course. Any how we are then > snapshotting this data across the SAN network to the other box using > snapshot send/recv. In the case the other box fails this box can immediatly > serve all of the iSCSI LUNs. The problem, I don't really know if its a > problem...Is when I snapshot a running vm will it come up alive in esxi or > do I have to accomplish this in a different way. These snapshots will then > be written to tape with bacula. I hope I am posting this in the correct > place. > > Thanks, > Greg > -- > > What you've got are crash consistent snapshots. The disks are in the same state they would be in if you pulled the power plug. They may come up just fine, or they may be in a corrupt state. If you take snapshots frequently enough, you should have at least one good snapshot. Your other option is scripting. You can build custom scripts to leverage the VSS providers in Windows... but it won't be easy. Any reason in particular you're using iSCSI? I've found NFS to be much more simple to manage, and performance to be equivalent if not better (in large clusters). -- --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly
On Jan 11, 2010, at 6:35 PM, Paul B. Henson wrote: > On Mon, 11 Jan 2010, Eric Schrock wrote: > >> No, there is no way to tell if a pool has DTL (dirty time log) entries. > > Hmm, I hadn't heard that term before, but based on a quick search I take it > that's the list of data in the pool that is not fully redundant? So if a > 2-way mirror vdev lost a half, everything written after the loss would be > on the DTL, and if the same device came back, recovery would entail just > running through the DTL and writing out what it missed? Although presumably > if the failed device was replaced with another device entirely all of the > data would need to be written out. > > I'm not quite sure that answered my question. My original question was, for > example, given a 2-way mirror, one half fails. There is a hot spare > available, which is pulled in, and while the pool isn't optimal, it does > have the same number of devices that it's supposed to. On the other hand, > the same mirror loses a device, there's no hot spare, and the pool is short > one device. My understanding is that in both scenarios the pool status > would be "DEGRADED", but it seems there's an important difference. In the > first case, another device could fail, and the pool would still be ok. In > the second, another device failing would result in complete loss of data. > > While you can tell the difference between these two different states by > looking at the detailed output and seeing if a hot spare is in use, I was > just saying that it would be nice for the short status to have some > distinction between "device failed, hot spare in use" and "device failed, > keep fingers crossed" ;). > > Back to your answer, if the existance of DTL entries means the pool doesn't > have full redundancy for some data, and you can't tell if a pool has DTL > entries, are you saying there's no way to tell if the current state of your > pool could survive a device failure? If a resilver successfully completes, > barring another device failure, doesn't that mean the pool is restored to > full redundancy? I feel like I must be misunderstanding something :(. DTLs are a more specific answer to your question. It implies that a toplevel vdev has a known time when there is invalid data for it or one of its children. This may because a device failed and is accumulating DTL time, a new replacing or spare vdev was attached, or it may be because a device was unplugged and then plugged back in. Your example (hot spares) is but one of the ways in which this can happen, but in any of the cases it implies that data is not fully replicated. There is obviously a way to detect this in the kernel, it's simply not exported to userland in any useful way. The reason I focused on DTLs is that if any mechanism were provided to distinguish a pool lacking full redundancy, it would be based on DTLs - nothing else makes sense. - Eric > > Thanks... > > > -- > Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ > Operating Systems and Network Analyst | hen...@csupomona.edu > California State Polytechnic University | Pomona CA 91768 -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly
On Mon, 11 Jan 2010, Eric Schrock wrote: > No, there is no way to tell if a pool has DTL (dirty time log) entries. Hmm, I hadn't heard that term before, but based on a quick search I take it that's the list of data in the pool that is not fully redundant? So if a 2-way mirror vdev lost a half, everything written after the loss would be on the DTL, and if the same device came back, recovery would entail just running through the DTL and writing out what it missed? Although presumably if the failed device was replaced with another device entirely all of the data would need to be written out. I'm not quite sure that answered my question. My original question was, for example, given a 2-way mirror, one half fails. There is a hot spare available, which is pulled in, and while the pool isn't optimal, it does have the same number of devices that it's supposed to. On the other hand, the same mirror loses a device, there's no hot spare, and the pool is short one device. My understanding is that in both scenarios the pool status would be "DEGRADED", but it seems there's an important difference. In the first case, another device could fail, and the pool would still be ok. In the second, another device failing would result in complete loss of data. While you can tell the difference between these two different states by looking at the detailed output and seeing if a hot spare is in use, I was just saying that it would be nice for the short status to have some distinction between "device failed, hot spare in use" and "device failed, keep fingers crossed" ;). Back to your answer, if the existance of DTL entries means the pool doesn't have full redundancy for some data, and you can't tell if a pool has DTL entries, are you saying there's no way to tell if the current state of your pool could survive a device failure? If a resilver successfully completes, barring another device failure, doesn't that mean the pool is restored to full redundancy? I feel like I must be misunderstanding something :(. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach
On Mon, Jan 11, 2010 at 06:03:40PM -0800, Richard Elling wrote: > IMHO, a split mirror is not as good as a decent backup :-) I know.. that was more by way of introduction and background. It's not the only method of backup, but since this disk does get plugged into the netbook frequently enough it seemed like a useful measure, at the cost of a couple of setup commands (I hoped). The technical questions here, however, are "why does zpool offline not close the device", and/or "what else do I have to do to get the device closed". -- Dan. pgp3C4T8Hv261.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
One thing which may help is the zfs import was single threaded, ie it open every disk one disk (maybe slice) at a time and processed it, as of 128b it is multi-threaded, ie it opens N disks/slices at once and process N disks/slices at once. When N is the number of threads it decides to use. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191 This most like/maybe cause other parts of the process to now become multi-threaded as well. It would be nice to no longer have /etc/zfs/zpool.cache, now zfs import is fast enough. (which is a second reason I longed the bug) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach
On Jan 11, 2010, at 4:42 PM, Daniel Carosone wrote: > I have a netbook with a small internal ssd as rpool. I have an > external usb HDD with much larger storage, as a separate pool, which > is sometimes attached to the netbook. > > I created a zvol on the external pool, the same size as the internal > ssd, and attached it as a mirror to rpool for backup. I don't care so > much about it not being bootable, as long as I can read whatever data > I might need in case of failure or loss. IMHO, a split mirror is not as good as a decent backup :-) At one time, Tim Foster's scripts had a backup feature where you could specify a backup flag on a file system which would automatically backup to removable media. This used send/recv which is a clean way of managing such things. There has been some talk recently about the future of that feature, see the zfs-auto-snapshot forum to catch up on the conversation. NB. one reason send/recv in ZFS works like a well-designed split mirror using some other RAID software is because the same method is used to send incremental snapshots and resilver mirrors. > The mirror works fine, and resilvers properly and selectively when I > use "zpool offline" and "zpool online" on the zvol submirror. Not really. If you want to split mirrors for "backup" purposes, then you need "zpool split" which recently integrated into b131. It takes care of the dangling participles. -- richard > I don't want to have the usb disk attached all the time, nor even to > run with the mirror always active (usb is - just - slower than the > internal ssd). I'd like to be able to move this external disk between > hosts, and potentially repeat the rpool mirror for each, having them > resilver whenever the disk is attached. > > However, with the rpool mirror in place, I can't find a way to "zpool > export black". It complains that the poool is busy, because of the > zvol in use. This happens regardless of whether I have set the zvol > submirror offline. I expected that, with the subdevice in the offline > state, the zvol would be closed. > > Any suggestions? Is this worth filing as a bug (is the device really > offline)? Would it work differently if I used a file on the external > pool, instead of a zvol? (I haven't tried that yet, but don't really > expect a difference unless umount -f can help). > > -- > Dan. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach
On Tue, Jan 12, 2010 at 02:38:56PM +1300, Ian Collins wrote: > How did you set the subdevice in the off line state? # zpool offline rpool /dev/zvol/dsk/ sorry if that wasn't clear. > Did you detach the device from the mirror? No, because then: - it will have to resilver fully on next attach, no quick update - it will be marked detached, and harder to import for recovery. I'm not even sure it can be easily recovered, maybe with import -D? If I do detach the subdevice, of course then the external pool can be exported - but that's not helpful for the original goal. If I shutdown ungracefully, and boot without the external disk attached, the expected components are faulted, but again that's not especially desirable for normal operations. Perhaps the new "zpool split" will be better for the second issue, but won't address the first issue. One of the several reasons for wanting to use a zvol (or file on zfs) is to be able to clone the backing store for an "import -f" while keeping the original for further incremental resilvers. -- Dan. pgph33v8yXkyr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly
On 01/11/10 17:42, Paul B. Henson wrote: On Sat, 9 Jan 2010, Eric Schrock wrote: No, it's fine. DEGRADED just means the pool is not operating at the ideal state. By definition a hot spare is always DEGRADED. As long as the spare itself is ONLINE it's fine. One more question on this; so there's no way to tell just from the status the difference between a pool degraded due to disk failure but still with full redundancy from a hot spare vs a pool degraded due to disk failure that has lost redundancy due to that failure? I guess you can review the pool details for the specifics but for large pools it seems it would be valuable to be able to quickly distinguish these states from the short status. No, there is no way to tell if a pool has DTL (dirty time log) entries. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly
On Sat, 9 Jan 2010, Eric Schrock wrote: > No, it's fine. DEGRADED just means the pool is not operating at the > ideal state. By definition a hot spare is always DEGRADED. As long as > the spare itself is ONLINE it's fine. One more question on this; so there's no way to tell just from the status the difference between a pool degraded due to disk failure but still with full redundancy from a hot spare vs a pool degraded due to disk failure that has lost redundancy due to that failure? I guess you can review the pool details for the specifics but for large pools it seems it would be valuable to be able to quickly distinguish these states from the short status. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach
Daniel Carosone wrote: However, with the rpool mirror in place, I can't find a way to "zpool export black". It complains that the poool is busy, because of the zvol in use. This happens regardless of whether I have set the zvol submirror offline. I expected that, with the subdevice in the offline state, the zvol would be closed. How did you set the subdevice in the off line state? Did you detach the device from the mirror? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] internal backup power supplies?
On Jan 11, 2010, at 19:00, Toby Thain wrote: On 11-Jan-10, at 5:59 PM, Daniel Carosone wrote: Does anyone know of such a device being made and sold? Feel like designing and marketing one, or publising the design? FWIW I think Google server farm uses something like this. It looks slightly "ghetto", but it seems like it works for them: http://blogs.sun.com/geekism/entry/holy_battery_backup_batman http://tinyurl.com/cpt4yq http://arstechnica.com/hardware/news/2009/04/the-beast-unveiled-inside-a-google-server.ars ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach
I should have mentioned: - opensolaris b130 - of course I could use partitions on the usb disk, but that's so much less flexible. -- Dan. pgp5A8rwHnenC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] rpool mirror on zvol, can't offline and detach
I have a netbook with a small internal ssd as rpool. I have an external usb HDD with much larger storage, as a separate pool, which is sometimes attached to the netbook. I created a zvol on the external pool, the same size as the internal ssd, and attached it as a mirror to rpool for backup. I don't care so much about it not being bootable, as long as I can read whatever data I might need in case of failure or loss. The mirror works fine, and resilvers properly and selectively when I use "zpool offline" and "zpool online" on the zvol submirror. I don't want to have the usb disk attached all the time, nor even to run with the mirror always active (usb is - just - slower than the internal ssd). I'd like to be able to move this external disk between hosts, and potentially repeat the rpool mirror for each, having them resilver whenever the disk is attached. However, with the rpool mirror in place, I can't find a way to "zpool export black". It complains that the poool is busy, because of the zvol in use. This happens regardless of whether I have set the zvol submirror offline. I expected that, with the subdevice in the offline state, the zvol would be closed. Any suggestions? Is this worth filing as a bug (is the device really offline)? Would it work differently if I used a file on the external pool, instead of a zvol? (I haven't tried that yet, but don't really expect a difference unless umount -f can help). -- Dan. pgpfzvPA0LwYP.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] opensolaris-vmware
Hello All, I hope this makes sense, I have two opensolaris machines with a bunch of hard disks, one acts as a iSCSI SAN, and the other is identical other than the hard disk configuration. The only thing being served are VMWare esxi raw disks, which hold either virtual machines or data that the particular virtual machine uses, I.E. we have exchange 2007 virtualized and through its iSCSI initiator we are mounting two LUNs one for the database and another for the Logs, all on different arrays of course. Any how we are then snapshotting this data across the SAN network to the other box using snapshot send/recv. In the case the other box fails this box can immediatly serve all of the iSCSI LUNs. The problem, I don't really know if its a problem...Is when I snapshot a running vm will it come up alive in esxi or do I have to accomplish this in a different way. These snapshots will then be written to tape with bacula. I hope I am posting this in the correct place. Thanks, Greg -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] internal backup power supplies?
On 11-Jan-10, at 5:59 PM, Daniel Carosone wrote: With all the recent discussion of SSD's that lack suitable power-failure cache protection, surely there's an opportunity for a separate modular solution? I know there used to be (years and years ago) small internal UPS's that fit in a few 5.25" drive bays. They were designed to power the motherboard and peripherals, with the advantage of simplicity and efficiency that comes from being behind the PC PSU and working entirely on DC. ... Does anyone know of such a device being made and sold? Feel like designing and marketing one, or publising the design? FWIW I think Google server farm uses something like this. --Toby -- Dan.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Jan 11, 2010, at 2:23 PM, Bob Friesenhahn > wrote: On Mon, 11 Jan 2010, bank kus wrote: Are we still trying to solve the starvation problem? I would argue the disk I/O model is fundamentally broken on Solaris if there is no fair I/O scheduling between multiple read sources until that is fixed individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a new thread focussing on just this problem. While I will readily agree that zfs has a I/O read starvation problem (which has been discussed here many times before), I doubt that it is due to the reasons you are thinking. A true fair I/O scheduling model would severely hinder overall throughput in the same way that true real-time task scheduling cripples throughput. ZFS is very much based on its ARC model. ZFS is designed for maximum throughput with minimum disk accesses in server systems. Most reads and writes are to and from its ARC. Systems with sufficient memory hardly ever do a read from disk and so you will only see writes occuring in 'zpool iostat'. The most common complaint is read stalls while zfs writes its transaction group, but zfs may write this data up to 30 seconds after the application requested the write, and the application might not even be running any more. Maybe an IO scheduler like Linux's 'deadline' IO scheduler whose only purpose is to reduce the effect of writers starving readers while providing some form of guaranteed latency. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] internal backup power supplies?
With all the recent discussion of SSD's that lack suitable power-failure cache protection, surely there's an opportunity for a separate modular solution? I know there used to be (years and years ago) small internal UPS's that fit in a few 5.25" drive bays. They were designed to power the motherboard and peripherals, with the advantage of simplicity and efficiency that comes from being behind the PC PSU and working entirely on DC. Something similar in a smaller form factor, similar to the drive bay sleds that mount one or two 2.5" disks in a 3.5" (or even 5.25") bay, with a small and simple power storage and circuit, would be great. Alternately, something that took up a drive bay and provided power for multiple disks in other bays, though that might be messier for cabling. It wouldn't need to hold power long. We could then use any SSD selected on other design and performance and price criteria. Does anyone know of such a device being made and sold? Feel like designing and marketing one, or publising the design? -- Dan. pgphFnWRkVEmO.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
On 11/01/10 11:57 PM, Arnaud Brand wrote: According to various posts the LSI SAS3081E-R seems to work well with OpenSolaris. But I've got pretty chilled-out from my recent problems with Areca-1680's. Could anyone please confirm that the LSI SAS3081E-R works well ? Is hotplug supported ? Anything else I should know before buying one of these cards ? These cards work very well with OpenSolaris, and attach using the mpt(7d) driver - supports hotplugging and MPxIO too. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] abusing zfs boot disk for fun and DR
> Ben, > I have found that booting from cdrom and importing > the pool on the new host, then boot the hard disk > will prevent these issues. > That will reconfigure the zfs to use the new disk > device. > When running, zpool detach the missing mirror device > and attach a new one. Thanks. I'm well versed in dealing with zfs issues. The reason I reported this boot/rpool issue, was that it was similar in nature to issues that occured trying to remediate an x4500 which had suffered may sata disks go offline (due to the buggy Marvell driver) as well as corruption that occured while trying to fix said issue. Backline spent a fair amount of time just trying to remediate the issue with hot spares that looked exactly like the faulted config in my rpool. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
comment below... On Jan 11, 2010, at 10:00 AM, Lutz Schumann wrote: > Ok, tested this myself ... > > (same hardware used for both tests) > > OpenSolaris svn_104 (actually Nexenta Core 2): > > 100 Snaps > > r...@nexenta:/volumes# time for i in $(seq 1 100); do zfs snapshot > ssd/v...@test1_$i; done > > > real0m24.991s > user0m0.297s > sys 0m0.679s > > Import: > r...@nexenta:/volumes# time zpool import ssd > > real0m25.053s > user0m0.031s > sys 0m0.216s > > 2) 500 snaps (400 Created, 500 imported) > - > > r...@nexenta:/volumes# time for i in $(seq 101 500); do zfs snapshot > ssd/v...@test1_$i; done > > real3m6.257s > user0m1.190s > sys 0m2.896s > > r...@nexenta:/volumes# time zpool import ssd > > real3m59.206s > user0m0.091s > sys 0m0.956s > > 3) 1500 Snaps (1000 created, 1500 imported) > - > > r...@nexenta:/volumes# time for i in $(seq 501 1500); do zfs snapshot > ssd/v...@test1_$i; done > > real22m23.206s > user0m3.041s > sys 0m8.785s > > r...@nexenta:/volumes# time zpool import ssd > > real36m26.765s > user0m0.233s > sys 0m4.545s > > you see where this goes - its expotential !! > > Now with svn_130 (same pool, still 1500 snaps on it) > > .. not we are booting OpenSolaris svn_130. > Sun Microsystems Inc. SunOS 5.11 snv_130 November 2008 > > r...@osol_dev130:~# zpool import > pool: ssd >id: 16128137881522033167 > state: ONLINE > status: The pool is formatted using an older on-disk version. > action: The pool can be imported using its name or numeric identifier, though >some features will not be available without an explicit 'zpool > upgrade'. > config: > >ssd ONLINE > c9d1 ONLINE > > r...@osol_dev130:~# time zpool import ssd > > real0m0.756s > user0m0.014s > sys 0m0.056s > > r...@osol_dev130:~# zfs list -t snapshot | wc -l >1502 > > r...@osol_dev130:~# time zpool export ssd > > real0m0.425s > user0m0.003s > sys 0m0.029s > > I like this one :) > > ... just for fun ... (5K Snaps) > > r...@osol_dev130:~# time for i in $(seq 1501 5000); do zfs snapshot > ssd/v...@test1_$i; done > > real1m18.977s > user0m9.889s > sys 0m19.969s > > r...@osol_dev130:~# zpool export ssd > r...@osol_dev130:~# time zpool import ssd > > real0m0.421s > user0m0.014s > sys 0m0.055s > > ... just for fun ... (10K Snaps) > > r...@osol_dev130:~# time for i in $(seq 5001 1); do zfs snapshot > ssd/v...@test1_$i; done > > real2m6.242s > user0m14.107s > sys 0m28.573s > > r...@osol_dev130:~# time zpool import ssd > > real0m0.405s > user0m0.014s > sys 0m0.057s > > Very nice, so volume import is solved. cool > > .. however ... a lot of snaps still have a impact on system performance. > After the import of the 1 snaps volume, I saw "devfsadm" eating up all > CPU: If you are snapshotting ZFS volumes, then each will create an entry in the device tree. In other words, if these were file systems instead of volumes, you would not see devfsadm so busy. > > load averages: 5.00, 3.32, 1.58; up 0+00:18:12 > 18:50:05 > 99 processes: 95 sleeping, 2 running, 2 on cpu > CPU states: 0.0% idle, 4.6% user, 95.4% kernel, 0.0% iowait, 0.0% swap > Kernel: 409 ctxsw, 14 trap, 47665 intr, 1223 syscall > Memory: 8190M phys mem, 5285M free mem, 4087M total swap, 4087M free swap > > PID USERNAME NLWP PRI NICE SIZE RES STATETIMECPU COMMAND > 167 root6 220 25M 13M run 3:14 49.41% devfsadm > > ... a truss showed that it is the device node allocation eating up the CPU: > > /5: 0.0010 xstat(2, "/devices/pseudo/z...@0:8941,raw", 0xFE32FCE0) = 0 > /5: 0.0005 fcntl(7, F_SETLK, 0xFE32FED0) = 0 > /5: 0. close(7)= 0 > /5: 0.0001 lwp_unpark(3) = 0 > /3: 0.0200 lwp_park(0xFE61EF58, 0) = 0 > /3: 0. time() = 1263232337 > /5: 0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR|O_CREAT, 0644) = 7 > /5: 0.0001 fcntl(7, F_SETLK, 0xFE32FEF0) = 0 > /5: 0. read(7, "A7\0\0\0", 4) = 4 > /5: 0.0001 getpid()= 167 [1] > /5: 0. getpid()= 167 [1] > /5: 0.0001 open("/devices/pseudo/devi...@0:devinfo", O_RDONLY) = 10 > /5: 0. ioctl(10, DINFOIDENT, 0x) = 57311 > /5: 0.0138 ioctl(10, 0xDF06, 0xFE32FA60) = 2258109 > /5: 0.0027 ioctl(10, DINFOUSRLD, 0x086CD000) = 2260992 > /5: 0.0001 close(10)
[zfs-discuss] Does ZFS use large memory pages?
Last April we put this in /etc/system on a T2000 server with large ZFS filesystems: set pg_contig_disable=1 This was while we were attempting to solve a couple of ZFS problems that were eventually fixed with an IDR. Since then, we've removed the IDR and brought the system up to Solaris 10 10/09 with current patches. It's stable now, but seems slower. This line was a workaround for bug 6642475 that had to do with searching for for large contiguous pages. The result was high system time and slow response. I can't find any public information on this bug, although I assume it's been fixed by now. It may have only affected Oracle database. I'd like to remove this line from /etc/system now, but I don't know if it will have any adverse effect on ZFS or the Cyrus IMAP server that runs on this machine. Does anyone know if ZFS uses large memory pages? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HW raid vs ZFS
On Mon, 11 Jan 2010, Anil wrote: ZFS will definitely benefit from battery backed RAM on the controller as long as the controller immediately acknowledges cache flushes (rather than waiting for battery-protected data to flush to the I am little confused with this. Do we not want the controller to ignore these cache flushes (since the cache is battery protected)? What did you mean by acknowledge? If it acknowledges and flushes the cache, then what is the benefit of the cache at all (if ZFS keeps telling to flush ever few seconds)? We want the controller to flush unwritten data to disk as quickly as it can regardless of whether it receives a cache flush request. If the data is "safely" stored in battery backed RAM, then we would like the controller to acknowledge the flush request immediately. The primary benefit of the battery-protected cache is to reduce latency for small writes. I use mostly DAS for my servers. This is a x4170 with 8 drive bays. So, what's the final recommendation? Should I just RAID 1 on the hardware and put ZFS on top of it? Unless you have a severe I/O bottleneck through your controller, you should do the mirroring in zfs rather than in the controller. The reason for this is that zfs mirrors are highly resilient, intelligent, and resilver time is reduced. Zfs will be able to detect and correct errors that the controller might not be aware of, or be unable to correct. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Mon, 11 Jan 2010, bank kus wrote: Are we still trying to solve the starvation problem? I would argue the disk I/O model is fundamentally broken on Solaris if there is no fair I/O scheduling between multiple read sources until that is fixed individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a new thread focussing on just this problem. While I will readily agree that zfs has a I/O read starvation problem (which has been discussed here many times before), I doubt that it is due to the reasons you are thinking. A true fair I/O scheduling model would severely hinder overall throughput in the same way that true real-time task scheduling cripples throughput. ZFS is very much based on its ARC model. ZFS is designed for maximum throughput with minimum disk accesses in server systems. Most reads and writes are to and from its ARC. Systems with sufficient memory hardly ever do a read from disk and so you will only see writes occuring in 'zpool iostat'. The most common complaint is read stalls while zfs writes its transaction group, but zfs may write this data up to 30 seconds after the application requested the write, and the application might not even be running any more. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HW raid vs ZFS
On 11-Jan-10, at 1:12 PM, Bob Friesenhahn wrote: On Mon, 11 Jan 2010, Anil wrote: What is the recommended way to make use of a Hardware RAID controller/HBA along with ZFS? ... Many people will recommend against using RAID5 in "hardware" since then zfs is not as capable of repairing errors, and because most RAID5 controller cards use a particular format on the drives so that the drives become tied to the controller brand/model and it is not possible to move the pool to a different system without using an identical controller. If the controller fails and is no longer available for purchase, or the controller is found to have a design defect, then the pool may be toast. +1 These drawbacks of proprietary RAID are frequently overlooked. Marty Scholes had a neat summary in a posting here, 21 October 2009: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30452.html Back when I did storage admin for a smaller company where availability was hyper-critical (but we couldn't afford EMC/ Veritas), we had a hardware RAID5 array. After a few years of service, we ran into some problems: * Need to restripe the array? Screwed. * Need to replace the array because current one is EOL? Screwed. * Array controller barfed for whatever reason? Screwed. * Need to flash the controller with latest firmware? Screwed. * Need to replace a component on the array, e.g. NIC, controller or power supply? Screwed. * Need to relocate the array? Screwed. If we could stomach downtime or short-lived storage solutions, none of this would have mattered. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HW raid vs ZFS
> ZFS will definitely benefit from battery backed RAM > on the controller > as long as the controller immediately acknowledges > cache flushes > (rather than waiting for battery-protected data to > flush to the I am little confused with this. Do we not want the controller to ignore these cache flushes (since the cache is battery protected)? What did you mean by acknowledge? If it acknowledges and flushes the cache, then what is the benefit of the cache at all (if ZFS keeps telling to flush ever few seconds)? > The notion of "performance" is highly arbitrary since > there are > different types of performance. What sort of > performance do you need > to optimize for? It won't be a big deal. Just general web/small databases. Just wondering how much of magnitude in performance difference there is. > > Many people will recommend against using RAID5 in > "hardware" since > then zfs is not as capable of repairing errors, and > because most RAID5 > controller cards use a particular format on the > drives so that the > drives become tied to the controller brand/model and > it is not > possible to move the pool to a different system > without using an > identical controller. If the controller fails and is > no longer > available for purchase, or the controller is found to > have a design > defect, then the pool may be toast. > I use mostly DAS for my servers. This is a x4170 with 8 drive bays. So, what's the final recommendation? Should I just RAID 1 on the hardware and put ZFS on top of it? Got three options: Get the Storagetek Internal HBA (with batteries) and...: 7 disks 600gb usable === no hardware raid 2 rpool 3 raidz 1 hot spare 1 ssd 8 disks 600gb usable === 2 rpool 2 mirror + 2 mirror (the mirrors would be in hardware with zfs doing striping across the mirrors) 1 hot spare 1 ssd 8 disks 1200gb usable === no hardware raid 2 rpool 3 raidz + 3 raidz The hot spare and the SSD are optional and I can add them at a later point. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Cache + ZIL on single SSD
Thank you Thomas and Mertol for your feedback. I was indeed aiming for the x25-E because of their write performance. However since these are around 350? for 32 GB I find it disturbing to only use it for ZIL :-) I will do some tests with a cheap MLC disk. I also read about the disk cache needing to be disabled to avoid dataloss after a powerloss. If a journaling FS is run over iscsi this should not become an issue right? ( Or will ZFS be affected by missing data from cache? ) Since I am hanging here now: Do people experience trouble on exporting iscsi zvol's with their cpu-load because of the TCP-IP calculations? Or is is safe to ignore the TOE network cards? Regards, Armand - Original Message - From: Thomas Burgess To: A. Krijgsman Cc: zfs-discuss@opensolaris.org Sent: Monday, January 11, 2010 3:02 AM Subject: Re: [zfs-discuss] ZFS Cache + ZIL on single SSD Next to that I am reading all kind of performance benefits using seperate devices for the ZIL (write) and the Cache (read). I was wondering if I could share a single SSD between both ZIL and Cache device? Or is this not recommended? i asked something similar recently. The answers i got were along these lines: you can use a single ssd but it's not a great idea. If you DO need to use a single ssd for such a thing, make sure it's one of the more expensive SLC variety like the intel x25-e The MLC variety of SSD works well for L2ARC and is much cheaper (you can pick up some for less than 100 bucks) while the ZIL really should have the SLC variety. I'm not an expert though, i'm just passing on advice i've been given. For reads and dedup L2ARC does make a dramatic difference, and for NFS and database stuff an SSD ZIL will make a huge difference. I've heard of people getting 5-10x's performance increase on reads just by adding a cheap ssd so i'd say it's worth it if that's the type of dataset you have. Another thing i was told on more than one occasion was that you may not even NEED a ssd for your ZIL (basically you should run a script to see if you need a ZIL at all, i don't remember where this script is but i'm SURE someone will reply with it) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
> Are we still trying to solve the starvation problem? I would argue the disk I/O model is fundamentally broken on Solaris if there is no fair I/O scheduling between multiple read sources until that is fixed individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a new thread focussing on just this problem. http://opensolaris.org/jive/thread.jspa?threadID=121479&tstart=0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
Hello, On Jan 11, 2010, at 6:53 PM, bank kus wrote: >> For example, you could set it to half your (8GB) memory so that 4GB is >> immediately available for other uses. >> >> * Set maximum ZFS ARC size to 4GB > > capping max sounds like a good idea. Are we still trying to solve the starvation problem? I filed a bug on the non-ZFS related urandom stall problem yesterday, primary since it can do nasty things from inside a resource capped zone: CR 6915579 solaris-cryp/random Large read from /dev/urandom can stall system Regards Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
> For example, you could set it to half your (8GB) memory so that 4GB is > immediately available for other uses. > > * Set maximum ZFS ARC size to 4GB capping max sounds like a good idea thanks banks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HW raid vs ZFS
On Mon, 11 Jan 2010, Anil wrote: What is the recommended way to make use of a Hardware RAID controller/HBA along with ZFS? Does it make sense to do RAID5 on the HW and then RAIDZ on the software? OR just stick to ZFS RAIDZ and connect the drives to the controller, w/o any HW RAID (to benefit from the batteries). What will give the most performance benefit? ZFS will definitely benefit from battery backed RAM on the controller as long as the controller immediately acknowledges cache flushes (rather than waiting for battery-protected data to flush to the disks). There will be benefit as long as the size of the write data backlog does not exceed controller RAM size. The notion of "performance" is highly arbitrary since there are different types of performance. What sort of performance do you need to optimize for? I think that the best performance for most cases is to use mirroring. Use two controllers with battery-backed RAM and split the mirrors across the controllers so that a write to a mirror pair results in a write to each controller. Unfortunately, this is not nearly as space efficient as RAID5 or raidz. Many people will recommend against using RAID5 in "hardware" since then zfs is not as capable of repairing errors, and because most RAID5 controller cards use a particular format on the drives so that the drives become tied to the controller brand/model and it is not possible to move the pool to a different system without using an identical controller. If the controller fails and is no longer available for purchase, or the controller is found to have a design defect, then the pool may be toast. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
Ok, tested this myself ... (same hardware used for both tests) OpenSolaris svn_104 (actually Nexenta Core 2): 100 Snaps r...@nexenta:/volumes# time for i in $(seq 1 100); do zfs snapshot ssd/v...@test1_$i; done real0m24.991s user0m0.297s sys 0m0.679s Import: r...@nexenta:/volumes# time zpool import ssd real0m25.053s user0m0.031s sys 0m0.216s 2) 500 snaps (400 Created, 500 imported) - r...@nexenta:/volumes# time for i in $(seq 101 500); do zfs snapshot ssd/v...@test1_$i; done real3m6.257s user0m1.190s sys 0m2.896s r...@nexenta:/volumes# time zpool import ssd real3m59.206s user0m0.091s sys 0m0.956s 3) 1500 Snaps (1000 created, 1500 imported) - r...@nexenta:/volumes# time for i in $(seq 501 1500); do zfs snapshot ssd/v...@test1_$i; done real22m23.206s user0m3.041s sys 0m8.785s r...@nexenta:/volumes# time zpool import ssd real36m26.765s user0m0.233s sys 0m4.545s you see where this goes - its expotential !! Now with svn_130 (same pool, still 1500 snaps on it) .. not we are booting OpenSolaris svn_130. Sun Microsystems Inc. SunOS 5.11 snv_130 November 2008 r...@osol_dev130:~# zpool import pool: ssd id: 16128137881522033167 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: ssd ONLINE c9d1 ONLINE r...@osol_dev130:~# time zpool import ssd real0m0.756s user0m0.014s sys 0m0.056s r...@osol_dev130:~# zfs list -t snapshot | wc -l 1502 r...@osol_dev130:~# time zpool export ssd real0m0.425s user0m0.003s sys 0m0.029s I like this one :) ... just for fun ... (5K Snaps) r...@osol_dev130:~# time for i in $(seq 1501 5000); do zfs snapshot ssd/v...@test1_$i; done real1m18.977s user0m9.889s sys 0m19.969s r...@osol_dev130:~# zpool export ssd r...@osol_dev130:~# time zpool import ssd real0m0.421s user0m0.014s sys 0m0.055s ... just for fun ... (10K Snaps) r...@osol_dev130:~# time for i in $(seq 5001 1); do zfs snapshot ssd/v...@test1_$i; done real2m6.242s user0m14.107s sys 0m28.573s r...@osol_dev130:~# time zpool import ssd real0m0.405s user0m0.014s sys 0m0.057s Very nice, so volume import is solved. .. however ... a lot of snaps still have a impact on system performance. After the import of the 1 snaps volume, I saw "devfsadm" eating up all CPU: load averages: 5.00, 3.32, 1.58; up 0+00:18:1218:50:05 99 processes: 95 sleeping, 2 running, 2 on cpu CPU states: 0.0% idle, 4.6% user, 95.4% kernel, 0.0% iowait, 0.0% swap Kernel: 409 ctxsw, 14 trap, 47665 intr, 1223 syscall Memory: 8190M phys mem, 5285M free mem, 4087M total swap, 4087M free swap PID USERNAME NLWP PRI NICE SIZE RES STATETIMECPU COMMAND 167 root6 220 25M 13M run 3:14 49.41% devfsadm ... a truss showed that it is the device node allocation eating up the CPU: /5: 0.0010 xstat(2, "/devices/pseudo/z...@0:8941,raw", 0xFE32FCE0) = 0 /5: 0.0005 fcntl(7, F_SETLK, 0xFE32FED0) = 0 /5: 0. close(7)= 0 /5: 0.0001 lwp_unpark(3) = 0 /3: 0.0200 lwp_park(0xFE61EF58, 0) = 0 /3: 0. time() = 1263232337 /5: 0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR|O_CREAT, 0644) = 7 /5: 0.0001 fcntl(7, F_SETLK, 0xFE32FEF0) = 0 /5: 0. read(7, "A7\0\0\0", 4) = 4 /5: 0.0001 getpid()= 167 [1] /5: 0. getpid()= 167 [1] /5: 0.0001 open("/devices/pseudo/devi...@0:devinfo", O_RDONLY) = 10 /5: 0. ioctl(10, DINFOIDENT, 0x) = 57311 /5: 0.0138 ioctl(10, 0xDF06, 0xFE32FA60) = 2258109 /5: 0.0027 ioctl(10, DINFOUSRLD, 0x086CD000) = 2260992 /5: 0.0001 close(10) = 0 /5: 0.0015 modctl(MODGETNAME, 0xFE32F060, 0x0401, 0xFE32F05C, 0xFD1E0008) = 0 /5: 0.0010 xstat(2, "/devices/pseudo/z...@0:8941", 0xFE32FCE0) = 0 /5: 0.0005 fcntl(7, F_SETLK, 0xFE32FED0) = 0 /5: 0. close(7)= 0 /5: 0.0001 lwp_unpark(3) = 0 /3: 0.0201 lwp_park(0xFE61EF58, 0) = 0 /3: 0.0001 time() = 1263232337 /5: 0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR
[zfs-discuss] HW raid vs ZFS
I am sure this is not the first discussion related to this... apologies for the duplication. What is the recommended way to make use of a Hardware RAID controller/HBA along with ZFS? Does it make sense to do RAID5 on the HW and then RAIDZ on the software? OR just stick to ZFS RAIDZ and connect the drives to the controller, w/o any HW RAID (to benefit from the batteries). What will give the most performance benefit? If there isn't much difference, I just feel like I am spending too much money on a RAID controller but not making 100% use of it. :) perhaps, maybe the route I should take is to look for drives (SAS or SSD) with built-in batteries/capacitors - but perhaps then it becomes much more expensive then the HW RAID controller? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
On Mon, 11 Jan 2010, bank kus wrote: However I noticed something weird, long after the file operations are done the free memory doesnt seem to grow back (below) Essentially ZFS File Data claims to use 76% of memory long after the file has been written. How does one reclaim it back. Is ZFS File Data a pool that once grown to a size doesnt shrink back even though its current contents might not be used by any process? It is normal for the ZFS ARC to retain data as long as there is not other memory pressure. This should not cause a problem other than a small delay when starting an application which does need a lot of memory since the ARC will give memory back to the kernel. For better interactive use, you can place a cap on the maximum ARC size via an entry in /etc/system: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE For example, you could set it to half your (8GB) memory so that 4GB is immediately available for other uses. * Set maximum ZFS ARC size to 4GB * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE * set zfs:zfs_arc_max = 0x1 Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
vmstat does show something interesting. The free memory shrinks while doing the first dd (generating the 8G file) from around 10G to 1.5Gish. The copy operations thereafter dont consume much and it stays at 1.2G after all operations have completed. (btw at the point of system slugishness theres 1.5G free RAM so that shouldnt explain the problem) However I noticed something weird, long after the file operations are done the free memory doesnt seem to grow back (below) Essentially ZFS File Data claims to use 76% of memory long after the file has been written. How does one reclaim it back. Is ZFS File Data a pool that once grown to a size doesnt shrink back even though its current contents might not be used by any process? > ::memstat Page SummaryPagesMB %Tot Kernel 234696 9167% ZFS File Data 2384657 9315 76% Anon 145915 5695% Exec and libs4250160% Page cache 28582 1111% Free (cachelist)53147 2072% Free (freelist)290158 11339% Total 3141405 12271 Physical 3141404 12271 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
According to various posts the LSI SAS3081E-R seems to work well with OpenSolaris. But I've got pretty chilled-out from my recent problems with Areca-1680's. Could anyone please confirm that the LSI SAS3081E-R works well ? Is hotplug supported ? It works well in Solaris 10 including hotplugging. -- Maurice Volaski, maurice.vola...@einstein.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
Hi Gary, You might consider running OSOL on a later build, like build 130. Have you reviewed the fmdump -eV output to determine on which devices the ereports below have been generated? This might give you more clues as to what the issues are. I would also be curious if you have any driver-level errors reported in /var/adm/messages or the iostat -En command. I think of cable problems or controller issues with repeated random problems across disks. Thanks, Cindy On 01/10/10 08:40, Gary Gendel wrote: I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. fmdump only reports, three types of errors: ereport.fs.zfs.checksum ereport.io.scsi.cmd.disk.tran ereport.io.scsi.cmd.disk.recovered The middle one seems to be the issue I'd like to track down the source. Any docs on how to do this? Thanks, Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly
Hi Paul, Example 11-1 in this section describes how to replace a disk on an x4500 system: http://docs.sun.com/app/docs/doc/819-5461/gbcet?a=view Cindy On 01/09/10 16:17, Paul B. Henson wrote: On Sat, 9 Jan 2010, Eric Schrock wrote: If ZFS removed the drive from the pool, why does the system keep complaining about it? It's not failing in the sense that it's returning I/O errors, but it's flaky, so it's attaching and detaching. Most likely it decided to attach again and then you got transport errors. Ok, how do I make it stop logging messages about the drive until it is replaced? It's still filling up the logs with the same errors about the drive being offline. Looks like hdadm isn't it: r...@cartman ~ # hdadm offline disk c1t2d0 /usr/bin/hdadm[1762]: /dev/rdsk/c1t2d0d0p0: cannot open /dev/rdsk/c1t2d0d0p0 is not available Hmm, I was able to unconfigure it with cfgadm: r...@cartman ~ # cfgadm -c unconfigure sata1/2::dsk/c1t2d0 It went from: sata1/2::dsk/c1t2d0disk connectedconfigured failed to: sata1/2disk connectedunconfigured failed Hopefully that will stop the errors until it's replaced and not break anything else :). No, it's fine. DEGRADED just means the pool is not operating at the ideal state. By definition a hot spare is always DEGRADED. As long as the spare itself is ONLINE it's fine. The spare shows as "INUSE", but I'm guessing that's fine too. Hope that helps That was perfect, thank you very much for the review. Now I can not worry about it until Monday :). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.
On Wed, Jan 6, 2010 at 12:11 PM, Carl Rathman wrote: > On Tue, Jan 5, 2010 at 10:35 AM, Carl Rathman wrote: >> On Tue, Jan 5, 2010 at 10:12 AM, Richard Elling >> wrote: >>> On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote: >>> I didn't mean to destroy the pool. I used zpool destroy on a zvol, when I should have used zfs destroy. When I used zpool destroy -f mypool/myvolume the machine hard locked after about 20 minutes. >>> >>> This would be a bug. "zpool destroy" should only destroy pools. >>> Volumes are datasets and are destroyed by "zfs destroy." Using >>> "zpool destroy -f" will attempt to force unmounts of any mounted >>> datasets, but volumes are not mounted, per se. Upon reboot, nothing >>> will be mounted until after the pool is imported. >>> >>> I don't want to destroy the pool, I just wanted to destroy the one volume. -- Which is why I now want to import the pool itself. Does that make sense? >>> >>> If the pool was destroyed, then you can try to import using -D. >>> >>> Are you sure you didn't "zfs destroy" instead? Once the pool is imported, >>> "zpool history" will show all of the commands issued against the pool. >>> -- richard >>> >>> >> >> Hi Richard, >> >> If I could import the pool, I'd love to do a history on it. >> >> At this point, if I attempt to import the pool, the machine will have >> heavy disk activity on the pool for approximately 10 minutes, then the >> machine will hard lock. This will happen when I boot the machine from >> its snv_130 rpool, or if I boot the machine from a snv_130 live cd. >> >> Thanks, >> Carl >> > > Any suggestions on how to begin debugging this, or if data recovery is > possible? > > Thanks, > Carl > Just wanted to update everyone on this... I installed 2009.06 (snv_111b), and gave the import of the pool one last try. After approximately 20 minutes of grinding, the pool imported properly! No clue why, but all seems to be working now. Thanks for the insight from the list, I really appreciate it. -Carl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help needed backing ZFS to tape
Good question. Zmanda seems to be a popular open source solution with commercial licenses and support available. We try to keep the Best Practices Guide up to date on this topic: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Using_ZFS_With_Enterprise_Backup_Solutions Additions or corrections are greatly appreciated. -- richard On Jan 11, 2010, at 7:13 AM, Julian Regel wrote: > Hi > > We have a number of customers (~150) that have a single Sun server with > directly attached storage and directly attached tape drive/library. These > servers are currently running UFS, but we are looking at deploying ZFS in > future builds. > > At present, we backup the server to the local tape drive using ufsdump, but > there appears to be no equivalent for ZFS. Is anyone aware why Sun have never > provided a zfsdump/zfsrestore for this sort of configuration? > > Can anyone advise the best way to backup a ZFS-based server to a locally > attached tape drive? It is possible that the filesystems to backup are bigger > than a single tape, so multiple volume support is a must. > > I thought about doing a "zfs send tank/f...@monday > /dev/rmt/0" but this > won't manage multiple tapes. Also, I'm not sure if this contains all the > metadata required to perform a bare metal restore. > > Any help much appreciated! > > Thanks > > JR > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
On Mon, 11 Jan 2010, Kjetil Torgrim Homme wrote: (BTW, thank you for testing forceful removal of power. the result is as expected, but it's good to see that theory and practice match.) Actually, the result is not "as expected" since the device should not have lost any data preceding a cache flush request. These sort of results should be cause for concern for anyone currently using one as a zfs log device, or using it for any write-sensitive application at all. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Help needed backing ZFS to tape
Hi We have a number of customers (~150) that have a single Sun server with directly attached storage and directly attached tape drive/library. These servers are currently running UFS, but we are looking at deploying ZFS in future builds. At present, we backup the server to the local tape drive using ufsdump, but there appears to be no equivalent for ZFS. Is anyone aware why Sun have never provided a zfsdump/zfsrestore for this sort of configuration? Can anyone advise the best way to backup a ZFS-based server to a locally attached tape drive? It is possible that the filesystems to backup are bigger than a single tape, so multiple volume support is a must. I thought about doing a "zfs send tank/f...@monday > /dev/rmt/0" but this won't manage multiple tapes. Also, I'm not sure if this contains all the metadata required to perform a bare metal restore. Any help much appreciated! Thanks JR ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Maybe it is lost in this much text :) .. thus this re-post Does anyone know the impact of disabeling the write cache for the write amplification factor of the intel SSD's ? How can I permanently disable the write cache on the Intel X25-M SSD's ? Thanks, Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
According to various posts the LSI SAS3081E-R seems to work well with OpenSolaris. But I've got pretty chilled-out from my recent problems with Areca-1680's. Could anyone please confirm that the LSI SAS3081E-R works well ? Is hotplug supported ? Anything else I should know before buying one of these cards ? Thanks, Arnaud ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to zfs destroy
On Fri, 8 Jan 2010, Rob Logan wrote: this one has me alittle confused. ideas? j...@opensolaris:~# zpool import z cannot mount 'z/nukeme': mountpoint or dataset is busy cannot share 'z/cle2003-1': smb add share failed j...@opensolaris:~# zfs destroy z/nukeme internal error: Bad exchange descriptor EBADE is used by ZFS to indicate checksum errors, which supports the zpool status output: config: NAMESTATE READ WRITE CKSUM z ONLINE 0 0 2 c3t0d0s7 ONLINE 0 0 4 c3t1d0s7 ONLINE 0 0 0 c2d0 ONLINE 0 0 4 errors: Permanent errors have been detected in the following files: z/nukeme:<0x0> So, yeah, without a way for zfs to repair itself, it looks like the only way forward is to destroy the zpool and restore from a backup. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
I've just made a couple of consecutive scrubs, each time it found a couple of checksum errors but on different drives. No indication of any other errors. That a disk scrubs cleanly on a quiescent pool in one run but fails in the next is puzzling. It reminds me of the snv_120 odd number of disks raidz bug I reported. Looks like I've got to bite the bullet and upgrade to the dev tree and hope for the best. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Lutz Schumann writes: > Actually the performance decrease when disableing the write cache on > the SSD is aprox 3x (aka 66%). for this reason, you want a controller with battery backed write cache. in practice this means a RAID controller, even if you don't use the RAID functionality. of course you can buy SSDs with capacitors, too, but I think that will be more expensive, and it will restrict your choice of model severely. (BTW, thank you for testing forceful removal of power. the result is as expected, but it's good to see that theory and practice match.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
Hi Banks, Some basic stats might shed some light, e.g. vmstat 5, mpstat 5, iostat -xnz 5, prstat -Lmc 5 ... all running from just before you start the tests until things are "normal" again. Memory starvation is certainly a possibility. The ARC can be greedy and slow to release memory under pressure. Phil Sent from my iPhone On 10 Jan 2010, at 13:29, bank kus wrote: Hi Phil You make some interesting points here: -> yes bs=1G was a lazy thing -> the GNU cp I m using does __not__ appears to use mmap open64 open64 read write close close is the relevant sequence -> replacing cp with dd 128K * 64K does not help no new apps can be launched until the copies complete. Regards banks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS import hangs with over 66000 context switches shown in top
I should also mention that once the "lock" starts, the disk activity light on my case stays busy for a bit (1-2 minutes MAX), then does nothing. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS import hangs with over 66000 context switches shown in top
Howdy All, I made a 1 TB zfs volume within a 4.5 TB zpool called vault for testing iscsi. Both DeDup and Compression were off. After my tests, I issued a zfs destroy to remove the volume. This command hung. After 5 hours, I hard rebooted into single user mode and removed my zfs cache file (I had to do this in order to boot up, as with the zfs cache file, my system would hang at reading zfs config). Now I cannot import my pool as the box always hangs after about 30 minutes. It's not a complete hang, I can still ping the box, but I cannot do anything. The keyboard is still responsive, but the server will do nothing with any input I make. I cannot ssh to the box as well. The only thing I can do is hard reboot the box. At first I thought I was running out of RAM, because the hang always happened right when my free RAM hit 0 (still had swap available however), but I've made tweaks to /etc/system and now I get freezes with over a gig of RAM free (8GB total in the box). The strange thing is, the context switches shown in top skyrocket from about 2000-6000 to over 66,000 just before the freeze. Would anyone know what that would skyrocket like that? If I do a ps -ef before the freeze, there are a normal amount of processes running. I have also tried a zpool import -f vault using a snv_130 live CD, as well as trying a zpool import -fFX vault. The same thing happens. System Specs: snv_130 AMD Phenom 925 8GB DDR2 RAM 2x 500 GB rpool mirrored drives 4x 1.5TB vault raidz1 drives I have let the import run for over 24 hours with no luck. Thanks for the assistance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss