Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nico Williams As for swap... really, you don't want to swap. If you're swapping you have problems. In solaris, I've never seen it swap out idle processes; I've only seen it use swap for the bad bad bad situation. I assume that's all it can do with swap. You would be wrong. Solaris uses swap space for paging. Paging out unused portions of an executing process from real memory to the swap device is certainly beneficial. Swapping out complete processes is a desperation move, but paging out most of an idle process is a good thing. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LUN sizes
On Mon, Oct 29, 2012 at 09:30:47AM -0500, Brian Wilson wrote: First I'd like to note that contrary to the nomenclature there isn't any one SAN product that all operates the same. There are a number of different vendor provided solutions that use a FC SAN to deliver luns to hosts, and they each have their own limitations. Forgive my pedanticism please. On Sun, Oct 28, 2012 at 04:43:34PM +0700, Fajar A. Nugraha wrote: On Sat, Oct 27, 2012 at 9:16 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-() boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha So my suggestion is actually just present one huge 25TB LUN to zfs and let the SAN handle redundancy. You are entering the uncharted waters of ``multi-level disk management'' here. Both ZFS and the SAN use redundancy and error- checking to ensure data integrity. Both of them also do automatic replacement of failing disks. A good SAN will present LUNs that behave as perfectly reliable virtual disks, guaranteed to be error free. Almost all of the time, ZFS will find no errors. If ZFS does find an error, there's no nice way to recover. Most commonly, this happens when the SAN is powered down or rebooted while the ZFS host is still running. On your host side, there's also the consideration of ssd/scsi queuing. If you're running on only one LUN, you're limiting your IOPS to only one IO queue over your FC paths, and if you have that throttled (per many storage vendors recommendations about ssd:ssd_max_throttle and zfs:zfs_vdev_max_pending), then one LUN will throttle your IOPS back on your host. That might also motivate you to split into multiple LUNS so your OS doesn't end up bottle-necking your IO before it even gets to your SAN HBA. That's a performance issue rather than a reliability issue. The other performance issue to consider is block size. At the last place I worked, we used an Iscsi LUN from a Netapp filer. This LUN reported a block size of 512 bytes, even though the Netapp itself used a 4K block size. This means that the filer was doing the block size conversion, resulting in much more I/O than the ZFS layer intended. The fact that Netapp does COW made this situation even worse. My impression was that very few of their customers encountered this performance problem because almost all of them used their Netapp only for NFS or CIFS. Our Netapp was extremely reliable but did not have the Iscsi LUN performance that we needed. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool LUN Sizes
On Sun, Oct 28, 2012 at 04:43:34PM +0700, Fajar A. Nugraha wrote: On Sat, Oct 27, 2012 at 9:16 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha So my suggestion is actually just present one huge 25TB LUN to zfs and let the SAN handle redundancy. create a bunch of 1-disk volumes and let ZFS handle them as if they're JBOD. Last time I use IBM's enterprise storage (which was, admittedly, a long time ago) you can't even do that. And looking at Morris' mail address, it should be revelant :) ... or probably it's just me who haven't found how to do that. Which why I suggested just use whatever the SAN can present :) You are entering the uncharted waters of ``multi-level disk management'' here. Both ZFS and the SAN use redundancy and error- checking to ensure data integrity. Both of them also do automatic replacement of failing disks. A good SAN will present LUNs that behave as perfectly reliable virtual disks, guaranteed to be error free. Almost all of the time, ZFS will find no errors. If ZFS does find an error, there's no nice way to recover. Most commonly, this happens when the SAN is powered down or rebooted while the ZFS host is still running. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What happens when you rm zpool.cache?
On Sun, Oct 21, 2012 at 11:40:31AM +0200, Bogdan Ćulibrk wrote: Follow up question regarding this: is there any way to disable automatic import of any non-rpool on boot without any hacks of removing zpool.cache? Certainly. Import it with an alternate cache file. You do this by specifying the `cachefile' property on the command line. The `zpool' man page describes how to do this. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 10:41:14AM -0500, Nico Williams wrote: On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith alan.coopersm...@oracle.com wrote: On 06/26/12 05:46 AM, Lionel Cons wrote: On 25 June 2012 11:33, casper@oracle.com wrote: To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. This may be wrong way to go if it breaks existing applications which rely on this feature. It does break applications in our case. Existing applications rely on the ability to corrupt UFS filesystems? Sounds horrible. My guess is that the OP just wants unlink() of an empty directory to be the same as rmdir() of the same. Or perhaps they want unlink() of a non-empty directory to result in a recursive rm... But if they really want hardlinks to directories, then yeah, that's horrible. This all sounds like a good use for LD_PRELOAD and a tiny library that intercepts and modernizes system calls. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs and iscsi performance help
On Fri, Jan 27, 2012 at 03:25:39PM +1100, Ivan Rodriguez wrote: We have a backup server with a zpool size of 20 TB, we transfer information using zfs snapshots every day (we have around 300 fs on that pool), the storage is a dell md3000i connected by iscsi, the pool is currently version 10, the same storage is connected to another server with a smaller pool of 3 TB(zpool version 10) this server is working fine and speed is good between the storage and the server, however in the server with 20 TB pool performance is an issue after we restart the server performance is good but with the time lets say a week the performance keeps dropping until we have to bounce the server again (same behavior with new version of solaris in this case performance drops in 2 days), no errors in logs or storage or the zpool status -v This sounds like a ZFS cache problem on the server. You might check on how cache statistics change over time. Some tuning may eliminate this degradation. More memory may also help. Does a scrub show any errors? Does the performance drop affect reads or writes or both? We suspect that the pool has some issues probably there is corruption somewhere, we tested solaris 10 8/11 with zpool 29, although we haven't update the pool itself, with the new solaris the performance is even worst and every time that we restart the server we get stuff like this: SOURCE: zfs-diagnosis, REV: 1.0 EVENT-ID: 0168621d-3f61-c1fc-bc73-c50efaa836f4 DESC: All faults associated with an event id have been addressed. Refer to http://sun.com/msg/FMD-8000-4M for more information. AUTO-RESPONSE: Some system components offlined because of the original fault may have been brought back online. IMPACT: Performance degradation of the system due to the original fault may have been recovered. REC-ACTION: Use fmdump -v -u EVENT-ID to identify the repaired components. [ID 377184 daemon.notice] SUNW-MSG-ID: FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor And we need to export and import the pool in order to be able to access it. This is a separate problem, introduced with an upgrade to the Iscsi service. The new one has a dependancy on the name service (typically DNS), which means that it isn't available when the zpool import is done during the boot. Check with Oracle support to see if they have found a solution. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to access the zpool after issue a reboot
On Thu, Jan 26, 2012 at 04:36:58PM +0100, Christian Meier wrote: Hi Sudheer 3)bash-3.2# zpool status pool: pool name state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scan: none requested config:NAMESTATE READ WRITE CKSUM pool name UNAVAIL 0 0 0 insufficient replicas c5t1d1UNAVAIL 0 0 0 cannot open This means that, at the time of that import, device c5t1d1 was not available. What does `ls -l /dev/rdsk/c5t1d1s0' show for the physical path? And the important thing is when I export import the zpool, then I was able to access it. Yes, later the device became available. After the boot, `svcs' will show you the services listed in order of their completion times. The ZFS mount is done by this service: svc:/system/filesystem/local:default The zpool import (without the mount) is done earlier. Check to see if any of the FC services run too late during the boot. As Gary and Bob mentioned, I saw this Issue with ISCSI Devices. Instead of export / import is a zpool clear also working? mpathadm list LU mpathadm show LU /dev/rdsk/c5t1d1s2 -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to access the zpool after issue a reboot
On Tue, Jan 24, 2012 at 05:33:39PM +0530, sureshkumar wrote: I am new to Solaris I am facing an issue with the dynapath [multipath s/w] for Solaris10u10 x86 . I am facing an issue with the zpool. Whats my problem is unable to access the zpool after issue a reboot. I've seen this happen when the zpool was built on an Iscsi LUN. At reboot time, the ZFS import was done before the Iscsi driver was able to connect to its target. After the system was up, an export and import was successful. The solution was to add a new service that imported the zpool later during the reboot. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
On Mon, Jan 16, 2012 at 09:13:03AM -0600, Bob Friesenhahn wrote: On Mon, 16 Jan 2012, Jim Klimov wrote: I think that in order to create a truly fragmented ZFS layout, Edward needs to do sync writes (without a ZIL?) so that every block and its metadata go to disk (coalesced as they may be) and no two blocks of the file would be sequenced on disk together. Although creating snapshots should give that effect... In my experience, most files on Unix systems are re-written from scatch. For example, when one edits a file in an editor, the editor loads the file into memory, performs the edit, and then writes out the whole file. Given sufficient free disk space, these files are unlikely to be fragmented. The case of slowly written log files or random-access databases are the worse cases for causing fragmentation. The case I've seen was with an IMAP server with many users. E-mail folders were represented as ZFS directories, and e-mail messages as files within those directories. New messages arrived randomly in the INBOX folder, so that those files were written all over the place on the storage. Users also deleted many messages from their INBOX folder, but the files were retained in snapshots for two weeks. On IMAP session startup, the server typically had to read all of the messages in the INBOX folder, making this portion slow. The server also had to refresh the folder whenever new messages arrived, making that portion slow as well. Performance degraded when the storage became 50% full. It would increase markedly when the oldest snapshot was deleted. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote: On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov jimkli...@cos.ru wrote: Does raidzN actually protect against bitrot? That's a kind of radical, possibly offensive, question formula that I have lately. Yup, it does. That's why many of us use it. There's actually no such thing as bitrot on a disk. Each sector on the disk is accompanied by a CRC that's verified by the disk controller on each read. It will either return correct data or report an unreadable sector. There's nothing inbetween. Of course, if something outside of ZFS writes to the disk, then data belonging to ZFS will be modified. I've heard of RAID controllers or SAN devices doing this when they modify the disk geometry or reserved areas on the disk. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!
On Mon, Dec 19, 2011 at 11:58:57AM +, Jan-Aage Frydenbø-Bruvoll wrote: 2011/12/19 Hung-Sheng Tsao (laoTsao) laot...@gmail.com: did you run a scrub? Yes, as part of the previous drive failure. Nothing reported there. Now, interestingly - I deleted two of the oldest snapshots yesterday, and guess what - the performance went back to normal for a while. Now it is severely dropping again - after a good while on 1.5-2GB/s I am again seeing write performance in the 1-10MB/s range. That behavior is a symptom of fragmentation. Writes slow down dramatically when there are no contiguous blocks available. Deleting a snapshot provides some of these, but only temporarily. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does the zpool cache file affect import?
I have a system with ZFS root that imports another zpool from a start method. It uses a separate cache file for this zpool, like this: if [ -f $CCACHE ] then echo Importing $CPOOL with cache $CCACHE zpool import -o cachefile=$CCACHE -c $CCACHE $CPOOL else echo Importing $CPOOL with device scan zpool import -o cachefile=$CCACHE $CPOOL fi It also exports that zpool from the stop method, which has the side effect of deleting the cache. This all works nicely when the server is rebooted. What will happen when the server is halted without running the stop method, so that that zpool is not exported? I know that there is a flag in the zpool that indicates when it's been exported cleanly. The cache file will exist when the server reboots. Will the import fail with the `The pool was last accessed by another system.' error, or will the import succeed? Does the cache change the import behavior? Does it recognize that the server is the same system? I don't want to include the `-f' flag in the commands above when it's not needed. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How create a FAT filesystem on a zvol?
On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote: On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills mi...@cc.umanitoba.ca wrote: The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? seems not. [...] Some solaris tools (like fdisk, or mkfs -F pcfs) needs disk geometry to function properly. zvols doesn't provide that. If you want to use zvols to work with such tools, the easiest way would be using lofi, or exporting zvols as iscsi share and import it again. For example, if you have a 10MB zvol and use lofi, fdisk would show these geometry Total disk size is 34 cylinders Cylinder size is 602 (512 byte) blocks ... which will then be used if you run mkfs -F pcfs -o nofdisk,size=20480. Without lofi, the same command would fail with Drive geometry lookup (need tracks/cylinder and/or sectors/track: Operation not supported So, why can I do it with UFS? # zfs create -V 10m rpool/vol1 # newfs /dev/zvol/rdsk/rpool/vol1 newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y Warning: 4130 sector(s) in last cylinder unallocated /dev/zvol/rdsk/rpool/vol1: 20446 sectors in 4 cylinders of 48 tracks, 128 sectors 10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, Why is this different from PCFS? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How create a FAT filesystem on a zvol?
The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? Nothing I've tried seems to work. Isn't the zvol just another block device? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote: On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Sent: Saturday, June 18, 2011 7:47 PM Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( Could you clarify what you mean by that? Yes. I've been looking at what the value of zfs_vdev_max_pending should be. The old value was 35 (a guess, but a really bad guess) and the new value is 10 (another guess, but a better guess). I observe that data from a fast, modern HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 IOPS. But as we add threads, the average response time increases from 2.3ms to 137ms. Since the whole idea is to get lower response time, and we know disks are not simple queues so there is no direct IOPS to response time relationship, maybe it is simply better to limit the number of outstanding I/Os. How would this work for a storage device with an intelligent controller that provides only a few LUNs to the host, even though it contains a much larger number of disks? I would expect the controller to be more efficient with a large number of outstanding IOs because it could distribute those IOs across the disks. It would, of course, require a non-volatile cache to provide fast turnaround for writes. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD recommendation for ZFS usage
On Mon, May 30, 2011 at 08:06:31AM +0200, Thomas Nau wrote: We are looking for JBOD systems which (1) hold 20+ 3.3 SATA drives (2) are rack mountable (3) have all the nive hot-swap stuff (4) allow 2 hosts to connect via SAS (4+ lines per host) and see all available drives as disks, no RAID volume. In a perfect world both hosts would connect each using two independent SAS connectors The box will be used in a ZFS Solaris/based fileserver in a fail-over cluster setup. Only one host will access a drive at any given time. I'm using a J4200 array as shared storage for a cluster. It needs a SAS HBA in each cluster node. The disks in the array are visible to both nodes in the cluster. Here's the feature list. I don't know if it's still available: Sun Storage J4200 Array: # Scales up to 48 SAS/SATA disk drives # Provides up to 72 Gb/sec of total bandwidth * Up to 72 Gb/sec of total bandwidth * Four x4-wide 3 Gb/sec SAS host/uplink ports (48 Gb/sec bandwidth) * Two x4-wide 3 Gb/sec SAS expansion ports (24 Gb/sec bandwidth) * Scales up to 48 drives -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for boot partition layout in ZFS
On Wed, Apr 06, 2011 at 08:08:06AM -0700, Erik Trimble wrote: On 4/6/2011 7:50 AM, Lori Alt wrote: On 04/ 6/11 07:59 AM, Arjun YK wrote: I'm not sure there's a defined best practice. Maybe someone else can answer that question. My guess is that in environments where, before, a separate ufs /var slice was used, a separate zfs /var dataset with a quota might now be appropriate. Lori Traditionally, the reason for a separate /var was one of two major items: (a) /var was writable, and / wasn't - this was typical of diskless or minimal local-disk configurations. Modern packaging systems are making this kind of configuration increasingly difficult. (b) /var held a substantial amount of data, which needed to be handled separately from / - mail and news servers are a classic example For typical machines nowdays, with large root disks, there is very little chance of /var suddenly exploding and filling / (the classic example of being screwed... wink). Outside of the above two cases, about the only other place I can see that having /var separate is a good idea is for certain test machines, where you expect frequent memory dumps (in /var/crash) - if you have a large amount of RAM, you'll need a lot of disk space, so it might be good to limit /var in this case by making it a separate dataset. People forget (c), the ability to set different filesystem options on /var. You might want to have `setuid=off' for improved security, for example. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] One LUN per RAID group
With ZFS on a Solaris server using storage on a SAN device, is it reasonable to configure the storage device to present one LUN for each RAID group? I'm assuming that the SAN and storage device are sufficiently reliable that no additional redundancy is necessary on the Solaris ZFS server. I'm also assuming that all disk management is done on the storage device. I realize that it is possible to configure more than one LUN per RAID group on the storage device, but doesn't ZFS assume that each LUN represents an independant disk, and schedule I/O accordingly? In that case, wouldn't ZFS I/O scheduling interfere with I/O scheduling already done by the storage device? Is there any reason not to use one LUN per RAID group? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One LUN per RAID group
On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote: On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote: Is there any reason not to use one LUN per RAID group? [...] In other words, if you build a zpool with one vdev of 10GB and another with two vdev's each of 5GB (both coming from the same array and raid set) you get almost exactly twice the random read performance from the 2x5 zpool vs. the 1x10 zpool. This finding is surprising to me. How do you explain it? Is it simply that you get twice as many outstanding I/O requests with two LUNs? Is it limited by the default I/O queue depth in ZFS? After all, all of the I/O requests must be handled by the same RAID group once they reach the storage device. Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot spares), you get substantially better random read performance using 10 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of ZFS aith number of vdevs and not spindles. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool-poolname has 99 threads
After an upgrade of a busy server to Oracle Solaris 10 9/10, I notice a process called zpool-poolname that has 99 threads. This seems to be a limit, as it never goes above that. It is lower on workstations. The `zpool' man page says only: Processes Each imported pool has an associated process, named zpool- poolname. The threads in this process are the pool's I/O processing threads, which handle the compression, checksum- ming, and other tasks for all I/O associated with the pool. This process exists to provides visibility into the CPU utilization of the system's storage pools. The existence of this process is an unstable interface. There are several thousand processes doing ZFS I/O on the busy server. Could this new process be a limitation in any way? I'd just like to rule it out before looking further at I/O performance. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?
On Fri, Sep 24, 2010 at 12:01:35AM +0200, Alexander Skwar wrote: Suppose they gave you two huge lumps of storage from the SAN, and you mirrored them with ZFS. What would you do if ZFS reported that one of its two disks had failed and needed to be replaced? You can't do disk management with ZFS in this situation anyway because those aren't real disks. Disk management all has to be done on the SAN storage device. Yes. I was rather thinking about RAIDZ instead of mirroring. I was just using a simpler example. Anyway. Without redundancy, ZFS cannot do recovery, can it? As far as I understand, it could detect block level corruption, even if there's not redundancy. But it could not correct such a corruption. Or is that a wrong understanding? That's correct, but it also should never happen. If I got the gist of what you wrote, it boils down to how reliable the SAN is? But also SANs could have block level corruption, no? I'm a bit confused, because of the (perceived?) contra- diction to the Best Practices Guide? :) The real problem is that ZFS was not designed to run in a SAN environment, that is one where all of the disk management and sufficient redundancy reside in the storage device on the SAN. ZFS certainly can't do any disk management in this situation. Error detection and correction is still a debatable issue, one that quickly becomes exceedingly complex. The decision rests on probabilities rather than certainties. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?
On Tue, Sep 21, 2010 at 05:48:09PM +0200, Alexander Skwar wrote: We're using ZFS via iSCSI on a S10U8 system. As the ZFS Best Practices Guide http://j.mp/zfs-bp states, it's advisable to use redundancy (ie. RAIDZ, mirroring or whatnot), even if the underlying storage does its own RAID thing. Now, our storage does RaID and the storage people say, it is impossible to have it export iSCSI devices which have no redundancy/ RAID. If you have a reliable Iscsi SAN and a reliable storage device, you don't need the additional redundancy provided by ZFS. Actually, were would there be a difference? I mean, those iSCSI devices anyway don't represent real disks/spindles, but it's just some sort of abstractation. So, if they'd give me 3x400 GB compared to 1200 GB in one huge lump like they do now, it could be, that those would use the same spots on the real hard drives. Suppose they gave you two huge lumps of storage from the SAN, and you mirrored them with ZFS. What would you do if ZFS reported that one of its two disks had failed and needed to be replaced? You can't do disk management with ZFS in this situation anyway because those aren't real disks. Disk management all has to be done on the SAN storage device. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with Equallogic storage
On Sat, Aug 21, 2010 at 06:36:37PM -0400, Toby Thain wrote: On 21-Aug-10, at 3:06 PM, Ross Walker wrote: On Aug 21, 2010, at 2:14 PM, Bill Sommerfeld bill.sommerf...@oracle.com wrote: On 08/21/10 10:14, Ross Walker wrote: ... Would I be better off forgoing resiliency for simplicity, putting all my faith into the Equallogic to handle data resiliency? IMHO, no; the resulting system will be significantly more brittle. Exactly how brittle I guess depends on the Equallogic system. If you don't let zfs manage redundancy, Bill is correct: it's a more fragile system that *cannot* self heal data errors in the (deep) stack. Quantifying the increased risk, is a question that Richard Elling could probably answer :) That's because ZFS does not have a way to handle a large class of storage designs, specifically the ones with raw storage and disk management being provided by reliable SAN devices. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
On Wed, Aug 18, 2010 at 12:16:04AM -0700, Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) Yes, it's actually quite easy. You need to create an SMF manifest and method. The manifest should make the ZFS mount dependant on it with the `dependent' and `/dependent' tag pair. It also needs to be dependant on resources it needs, with the `dependency' and `/dependency' pairs. It should also specify a `single_instance/' and `transient' service. The method script can do whatever the mount requires, such as creating the ramdisk. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Fri, Aug 13, 2010 at 01:54:13PM -0700, Erast wrote: On 08/13/2010 01:39 PM, Tim Cook wrote: http://www.theregister.co.uk/2010/08/13/opensolaris_is_dead/ I'm a bit surprised at this development... Oracle really just doesn't get it. The part that's most disturbing to me is the fact they won't be releasing nightly snapshots. It appears they've stopped Illumos in its tracks before it really even got started (perhaps that explains the timing of this press release) Wrong. Be patient, with the pace of current Illumos development it soon will have all the closed binaries liberated and ready to sync up with promised ON code drops as dictated by GPL and CDDL licenses. Is this what you mean, from: http://hub.opensolaris.org/bin/view/Main/opensolaris_license Any Covered Software that You distribute or otherwise make available in Executable form must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License. You must include a copy of this License with every copy of the Source Code form of the Covered Software You distribute or otherwise make available. You must inform recipients of any such Covered Software in Executable form as to how they can obtain such Covered Software in Source Code form in a reasonable manner on or through a medium customarily used for software exchange. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS development moving behind closed doors
If this information is correct, http://opensolaris.org/jive/thread.jspa?threadID=133043 further development of ZFS will take place behind closed doors. Opensolaris will become the internal development version of Solaris with no public distributions. The community has been abandoned. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs upgrade unmounts filesystems
Zpool upgrade on this system went fine, but zfs upgrade failed: # zfs upgrade -a cannot unmount '/space/direct': Device busy cannot unmount '/space/dcc': Device busy cannot unmount '/space/direct': Device busy cannot unmount '/space/imap': Device busy cannot unmount '/space/log': Device busy cannot unmount '/space/mysql': Device busy 2 filesystems upgraded Do I have to shut down all the applications before upgrading the filesystems? This is on a Solaris 10 5/09 system. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs upgrade unmounts filesystems
On Thu, Jul 29, 2010 at 10:26:14PM +0200, Pawel Jakub Dawidek wrote: On Thu, Jul 29, 2010 at 12:00:08PM -0600, Cindy Swearingen wrote: I found a similar zfs upgrade failure with the device busy error, which I believe was caused by a file system mounted under another file system. If this is the cause, I will file a bug or find an existing one. No, it was caused by processes active on those filesystems. The workaround is to unmount the nested file systems and upgrade them individually, like this: # zfs upgrade space/direct # zfs upgrade space/dcc Except that I couldn't unmount them because the filesystems were busy. 'zfs upgrade' unmounts file system first, which makes it hard to upgrade for example root file system. The only work-around I found is to clone root file system (clone is created with most recent version), change root file system to newly created clone, reboot, upgrade original root file system, change root file system back, reboot, destroy clone. In this case it wasn't the root filesystem, but I still had to disable twelve services before doing the upgrade and enable them afterwards. `fuser -c' is useful to identify the processes. Mapping them to services can be difficult. The server is essentially down during the upgrade. For a root filesystem, you might have to boot off the failsafe archive or a DVD and import the filesystem in order to upgrade it. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS disks hitting 100% busy
Our e-mail server started to slow down today. One of the disk devices is frequently at 100% usage. The heavy writes seem to cause reads to run quite slowly. In the statistics below, `c0t0d0' is UFS, containing the / and /var slices. `c0t1d0' is ZFS, containing /var/log/syslog, a couple of databases, and the GNU mailman files. It's this latter disk that's been hitting 100% usage. $ iostat -xn 5 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 8.2 57.8 142.6 538.2 0.0 1.70.1 25.2 0 48 c0t0d0 5.8 273.0 303.4 24115.9 0.0 18.60.0 66.7 0 73 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 57.20.0 294.6 0.0 1.30.0 22.1 0 64 c0t0d0 0.2 370.21.1 33968.5 0.0 31.40.0 84.9 1 100 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.8 61.06.4 503.0 0.0 2.50.0 40.0 0 70 c0t0d0 0.0 295.80.0 35273.3 0.0 35.00.0 118.3 0 100 c0t1d0 This system is running Solaris 10 5/09 on a Sun 4450 server. Both the disk devices are actually hardware-mirrored pairs of SAS disks, with the Adaptec RAID controller. Can anything be done to either reduce the amount of I/O or to improve the write bandwidth? I assume that adding another disk device to the zpool will double the bandwidth. /var/log/syslog is quite large, reaching about 600 megabytes before it's rotated. This takes place each night, with compression bringing it down to about 70 megabytes. The server handles about 500,000 messages a day. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is the J4200 SAS array suitable for Sun Cluster?
On Sun, May 16, 2010 at 01:14:24PM -0700, Charles Hedrick wrote: We use this configuration. It works fine. However I don't know enough about the details to answer all of your questions. The disks are accessible from both systems at the same time. Of course with ZFS you had better not actually use them from both systems. That's what I wanted to know. I'm not familiar with SAS fabrics, so it's good to know that they operate similarly to multi-initiator SCSI in a cluster. Actually, let me be clear about what we do. We have two J4200's and one J4400. One J4200 uses SAS disks, the others SATA. The two with SATA disks are used in Sun cluster configurations as NFS servers. They fail over just fine, losing no state. The one with SAS is not used with Sun Cluster. Rather, it's a Mysql server with two systems, one of them as a hot spare. (It also acts as a mysql slave server, but it uses different storage for that.) That means that our actual failover experience is with the SATA configuration. I will say from experience that in the SAS configuration both systems see the disks at the same time. I even managed to get ZFS to mount the same pool from both systems, which shouldn't be possible. Behavior was very strange until we realized what was going on. Our situation is that we only need a small amount of shared storate in the cluster. It's intended for high-availability of core services, such as DNS and NIS, rather than as a NAS server. I get the impression that they have special hardware in the SATA version that simulates SAS dual interface drives. That's what lets you use SATA drives in a two-node configuration. There's also some additional software setup for that configuration. That would be the SATA interposer that does that. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does ZFS use large memory pages?
On Thu, May 06, 2010 at 07:46:49PM -0700, Rob wrote: Hi Gary, I would not remove this line in /etc/system. We have been combatting this bug for a while now on our ZFS file system running JES Commsuite 7. I would be interested in finding out how you were able to pin point the problem. Our problem was a year ago. Careful reading of Sun bug reports helped. Opening a support case with Sun helped even more. Large memory pages were likely not involved. We seem to have no worries with the system currently, but when the file system gets above 80% we seems to have quite a number of issues, much the same as what you've had in the past, ps and prstats hanging. are you able to tell me the IDR number that you applied? The IDR was only needed last year. Upgrading to Solaris 10 10/09 and applying the latest patches resolved the problem. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is the J4200 SAS array suitable for Sun Cluster?
I'm setting up a two-node cluster with 1U x86 servers. It needs a small amount of shared storage, with two or four disks. I understand that the J4200 with SAS disks is approved for this use, although I haven't seen this information in writing. Does anyone have experience with this sort of configuration? I have a few questions. I understand that the J4200 with SATA disks will not do SCSI reservations. Will it with SAS disks? The X4140 seems to require two SAS HBAs, one for the internal disks and one for the external disks. Is this correct? Will the disks in the J4200 be accessible from both nodes, so that the cluster can fail over the storage? I know this works with a multi-initiator SCSI bus, but I don't know about SAS behavior. Is there a smaller, and cheaper, SAS array that can be used in this configuration? It would still need to have redundant power and redundant SAS paths. I plan to use ZFS everywhere, for the root filesystem and the shared storage. The only exception will be UFS for /globaldevices . -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?
On Mon, Apr 26, 2010 at 01:32:33PM -0500, Dave Pooser wrote: On 4/26/10 10:10 AM, Richard Elling richard.ell...@gmail.com wrote: SAS shines with multiple connections to one or more hosts. Hence, SAS is quite popular when implementing HA clusters. So that would be how one builds something like the active/active controller failover in standalone RAID boxes. Is there a good resource on doing something like that with an OpenSolaris storage server? I could see that as a project I might want to attempt. This is interesting. I have a two-node SPARC cluster that uses a multi-initiator SCSI array for shared storage. As an application server, it need only two disks in the array. They are a ZFS mirror. This all works quite nicely under Sun Cluster. I'd like to duplicate this configuration with two small x86 servers and a small SAS array, also with only two disks. It should be easy to find a pair of 1U servers, but what's the smallest SAS array that's available? Does it need an array controller? What's needed on the servers to connect to it? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. I'm pleased to report that I found the culprit and the culprit was me! Well, ZFS peculiarities may be involved as well. Let me explain: We had a single second-level filesystem and five third-level filesystems, all with 14 daily snapshots. The snapshots were maintained by a cron command that did a `zfs list -rH -t snapshot -o name' to get the names of all of the snapshots, extracted the part after the `@', and then sorted them uniquely to get a list of suffixes that were older than 14 days. The suffixes were Julian dates so they sorted correctly. It then did a `zfs destroy -r' to delete them. The recursion was always done from the second-level filesystem. The top-level filesystem was empty and had no snapshots. Here's a portion of the script: zfs list -rH -t snapshot -o name $FS | \ cut -d@ -f2 | \ sort -ur | \ sed 1,${NR}d | \ xargs -I '{}' zfs destroy -r $FS@'{}' zfs snapshot -r $...@$jd Just over two weeks ago, I rearranged the filesystems so that the second-level filesystem was newly-created and initially had no snapshots. It did have a snapshot taken every day thereafter, so that eventually it also had 14 of them. It was during that interval that the complaints started. My statistics clearly showed the performance stall and subsequent recovery. Once that filesystem reached 14 snapshots, the complaints stopped and the statistics showed only a modest increase in CPU activity, but no stall. During this interval, the script was doing a recursive destroy for a snapshot that didn't exist at the specified level, but only existed in the descendent filesystems. I'm assuming that that unusual situation was the cause of the stall, although I don't have good evidence. By the time the complaints reached my ears, and I was able to refine my statistics gathering sufficiently, the problem had gone away. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Mon, Mar 08, 2010 at 03:18:34PM -0500, Miles Nordin wrote: gm == Gary Mills mi...@cc.umanitoba.ca writes: gm destroys the oldest snapshots and creates new ones, both gm recursively. I'd be curious if you try taking the same snapshots non-recursively instead, does the pause go away? I'm still collecting statistics, but that is one of the things I'd like to try. Because recursive snapshots are special: they're supposed to atomically synchronize the cut-point across all the filesystems involved, AIUI. I don't see that recursive destroys should be anything special though. gm Is it destroying old snapshots or creating new ones that gm causes this dead time? sortof seems like you should tell us this, not the other way around. :) Seriously though, isn't that easy to test? And I'm curious myself too. Yes, that's another thing I'd like to try. I'll just put a `sleep' in the script between the two actions to see if the dead time moves later in the day. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Mon, Mar 08, 2010 at 01:23:10PM -0800, Bill Sommerfeld wrote: On 03/08/10 12:43, Tomas Ögren wrote: So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. Out of curiosity, how much physical memory does this system have? Mine has 64 GB of memory with the ARC limited to 32 GB. The Cyrus IMAP processes, thousands of them, use memory mapping extensively. I don't know if this design affects the snapshot recycle behavior. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. I should mention that this seems to be a new problem. We've been using the same scheme to cycle snapshots for several years. The complaints of an unresponsive interval have only happened recently. I'm still waiting for our help desk to report on when the complaints started. It may be the result of some recent change we made, but so far I can't tell what that might have been. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Snapshot recycle freezes system activity
We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. Is it destroying old snapshots or creating new ones that causes this dead time? What does each of these procedures do that could affect the system? What can I do to make this less visible to users? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 07:51:13PM -0300, Giovanni Tirloni wrote: On Thu, Mar 4, 2010 at 7:28 PM, Ian Collins [1]...@ianshome.com wrote: Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. Is it destroying old snapshots or creating new ones that causes this dead time? What does each of these procedures do that could affect the system? What can I do to make this less visible to users? I have a couple of Solaris 10 boxes that do something similar (hourly snaps) and I've never seen any lag in creating and destroying snapshots. One system with 16 filesystems takes 5 seconds to destroy the 16 oldest snaps and create 5 recursive new ones. I logged load average on these boxes and there is a small spike on the hour, but this is down to sending the snaps, not creating them. We've seen the behaviour that Gary describes while destroying datasets recursively (600GB and with 7 snapshots). It seems that close to the end the server stalls for 10-15 minutes and NFS activity stops. For small datasets/snapshots that doesn't happen or is harder to notice. Does ZFS have to do something special when it's done releasing the data blocks at the end of the destroy operation ? That does sound similar to the problem here. The zpool is 3 TB in size with about 1.4 TB used. It does sound as if the stall happens during the `zfs destroy -r' rather than during the `zfs snapshot -r'. What can zfs be doing when the CPU load average drops and disk I/O is close to zero? I also had peculiar problem here recently when I was upgrading the ZFS filesystems on our test server from 3 to 4. When I tried `zfs upgrade -a', the command hung for a long time and could not be interrupted, killed, or traced. Eventually it terminated on its own. Only the two upper-level filesystems had been upgraded. I upgraded the lower- level ones individually with `zfs upgrade' with no further problems. I had previously upgraded the zpool with no problems. I don't know if this behavior is related to the stall on the production server. I haven't attempted the upgrades there yet. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Thu, Jan 14, 2010 at 10:58:48AM +1100, Daniel Carosone wrote: On Wed, Jan 13, 2010 at 08:21:13AM -0600, Gary Mills wrote: Yes, I understand that, but do filesystems have separate queues of any sort within the ZIL? I'm not sure. If you can experiment and measure a benefit, understanding the reasons is helpful but secondary. If you can't experiment so easily, you're stuck asking questions, as now, to see whether the effort of experimenting is potentially worthwhile. Yes, we're stuck asking questions. I appreciate your responses. Some other things to note (not necessarily arguments for or against): * you can have multiple slog devices, in case you're creating so much ZIL traffic that ZIL queueing is a real problem, however shared or structured between filesystems. For the time being, I'd like to stay with the ZIL that's internal to the zpool. * separate filesystems can have different properties which might help tuning and experiments (logbias, copies, compress, *cache), as well the recordsize. Maybe you will find that compress on mailboxes helps, as long as you're not also compressing the db's? Yes, that's a good point in favour of a separate filesystem. * separate filesystems may have different recovery requirements (snapshot cycles). Note that taking snapshots is ~free, but keeping them and deleting them have costs over time. Perhaps you can save some of these costs if the db's are throwaway/rebuildable. Also a good point. If not, would it help to put the database filesystems into a separate zpool? Maybe, if you have the extra devices - but you need to compare with the potential benefit of adding those devices (and their IOPS) to benefit all users of the existing pool. For example, if the databases are a distinctly different enough load, you could compare putting them on a dedicated pool on ssd, vs using those ssd's as additional slog/l2arc. Unless you can make quite categorical separations between the workloads, such that an unbalanced configuration matches an unbalanced workload, you may still be better with consolidated IO capacity in the one pool. As well, I'd like to keep all of the ZFS pools on the same external storage device. This makes migrating to a different server quite easy. Note, also, you can only take recursive atomic snapshots within the one pool - this might be important if the db's have to match the mailbox state exactly, for recovery. That's another good point. It's certainly better to have synchronized snapshots. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does ZFS use large memory pages?
On Mon, Jan 11, 2010 at 01:43:27PM -0600, Gary Mills wrote: This line was a workaround for bug 6642475 that had to do with searching for for large contiguous pages. The result was high system time and slow response. I can't find any public information on this bug, although I assume it's been fixed by now. It may have only affected Oracle database. I eventually found it. The bug is not visible from Sunsolve even with a contract, but it is in bugs.opensolaris.org without one. This is extremely confusing. I'd like to remove this line from /etc/system now, but I don't know if it will have any adverse effect on ZFS or the Cyrus IMAP server that runs on this machine. Does anyone know if ZFS uses large memory pages? Bug 6642475 is still outstanding, although related bugs have been fixed. I'm going to leave `set pg_contig_disable=1' in place. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How do separate ZFS filesystems affect performance?
I'm working with a Cyrus IMAP server running on a T2000 box under Solaris 10 10/09 with current patches. Mailboxes reside on six ZFS filesystems, each containing about 200 gigabytes of data. These are part of a single zpool built on four Iscsi devices from our Netapp filer. One of these ZFS filesystems contains a number of global and per-user databases in addition to one sixth of the mailboxes. I'm thinking of moving these databases to a separate ZFS filesystem. Access to these databases must be quick to ensure responsiveness of the server. We are currently experiencing a slowdown in performance when the number of simultaneous IMAP sessions rises above 3000. These databases are opened and memory-mapped by all processes. They have the usual requirement for locking and synchronous writes whenever they are updated. Is moving the databases (IMAP metadata) to a separate ZFS filesystem likely to improve performance? I've heard that this is important, but I'm not clear why this is. Does each filesystem have its own queue in the ARC or ZIL? Here are some statistics taken while the server was busy and access was slow: # /usr/local/sbin/zilstat 5 5 N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops =4kB 4-32kB =32kB 1126664 225332 515872 1148518422970363469312292163 51 79 740536 148107 250896953548819070974005888198106 24 68 758344 151668 179104 1254604825092092682880227 93 45 89 603304 120660 204344917913618358272084864179 89 23 67 948896 189779 346520 1588019231760384173824262108 32123 # /usr/local/sbin/arcstat 5 5 Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 10:50:16 191M 31M 16 14M8 17M 48 18M 1230G 32G 10:50:211K 148 1076572 5878 1530G 32G 10:50:261K 154 1288765 7296 1830G 32G 10:50:31 79661 7547 6 3525830G 32G 10:50:361K 117 9 105812 5344 1030G 32G -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Tue, Jan 12, 2010 at 11:11:36AM -0600, Bob Friesenhahn wrote: On Tue, 12 Jan 2010, Gary Mills wrote: Is moving the databases (IMAP metadata) to a separate ZFS filesystem likely to improve performance? I've heard that this is important, but I'm not clear why this is. There is an obvious potential benefit in that you are then able to tune filesystem parameters to best fit the needs of the application which updates the data. For example, if the database uses a small block size, then you can set the filesystem blocksize to match. If the database uses memory mapped files, then using a filesystem blocksize which is closest to the MMU page size may improve performance. I found a couple of references that suggest just putting the databases on their own ZFS filesystem has a great benefit. One is an e-mail message to a mailing list from Vincent Fox at UC Davis. They run a similar system to ours at that site. He says: Particularly the database is important to get it's own filesystem so that it's queue/cache are separated. The second one is from: http://blogs.sun.com/roch/entry/the_dynamics_of_zfs He says: For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. This sounds like the ZIL queue mentioned above. Is I/O for each of those handled separately? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does ZFS use large memory pages?
Last April we put this in /etc/system on a T2000 server with large ZFS filesystems: set pg_contig_disable=1 This was while we were attempting to solve a couple of ZFS problems that were eventually fixed with an IDR. Since then, we've removed the IDR and brought the system up to Solaris 10 10/09 with current patches. It's stable now, but seems slower. This line was a workaround for bug 6642475 that had to do with searching for for large contiguous pages. The result was high system time and slow response. I can't find any public information on this bug, although I assume it's been fixed by now. It may have only affected Oracle database. I'd like to remove this line from /etc/system now, but I don't know if it will have any adverse effect on ZFS or the Cyrus IMAP server that runs on this machine. Does anyone know if ZFS uses large memory pages? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS filesystems not mounted on reboot with Solaris 10 10/09
I have a system that was recently upgraded to Solaris 10 10/09. It has a UFS root on local disk and a separate zpool on Iscsi disk. After a reboot, the ZFS filesystems were not mounted, although the zpool had been imported. `zfs mount' showed nothing. `zfs mount -a' mounted them nicely. The `canmount' property is `on'. Why would they not be mounted at boot? This used to work with earlier releases of Solaris 10. The `zfs mount -a' at boot is run by the /system/filesystem/local:default service. It didn't record any errors on the console or in the log [ Dec 19 08:09:11 Executing start method (/lib/svc/method/fs-local) ] [ Dec 19 08:09:12 Method start exited with status 0 ] Is a dependancy missing? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Fri, Dec 04, 2009 at 02:52:47PM -0700, Cindy Swearingen wrote: If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I'm not sure if this is why only object numbers are displayed. Yes, it's mounted and is quite an active filesystem. I would also check fmdump -eV to see how frequent the hardware has had problems. That shows ZFS checksum errors in July, but nothing since that time. There were also DIMM errors before that, starting in June. We replaced the failed DIMMs, also in July. This is an X4450 with ECC memory. There were no disk errors reported. I suppose we can blame the memory. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Sat, Dec 05, 2009 at 01:52:12AM +0300, Victor Latushkin wrote: On Dec 5, 2009, at 0:52, Cindy Swearingen cindy.swearin...@sun.com wrote: The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. Yes, thi is object numbers and most likely reason these are not turned into filnames is that corresponding files no longer exist. That seems to be the case: # zdb -d space/dcc 0x11e887 0xba25aa Dataset space/dcc [ZPL], ID 21, cr_txg 19, 20.5G, 3672408 objects So I'd run scrub another time, if the files are gone and there are no other corruptions scrub will reset error log and zpool status should become clean. That worked. After the scrub, there are no errors reported. You might be able to identify these object numbers with zdb, but I'm not sure how do that. You can try to use zdb this way to check if these objects still exist zdb -d space/dcc 0x11e887 0xba25aa -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] If you have ZFS in production, willing to share some details (with me)?
On Fri, Sep 18, 2009 at 01:51:52PM -0400, Steffen Weiberle wrote: I am trying to compile some deployment scenarios of ZFS. # of systems One, our e-mail server for the entire campus. amount of storage 2 TB that's 58% used. application profile(s) This is our Cyrus IMAP spool. In addition to user's e-mail folders (directories) and messages (files), it contains global, per-folder, and per-user databases. The latter two types are quite small. type of workload (low, high; random, sequential; read-only, read-write, write-only) It's quite active. Message files arrive randomly and are deleted randomly. As a result, files in a directory are not located in proximity on the storage. Individual users often read all of their folders and messages in one IMAP session. Databases are quite active. Each incoming message adds a file to a directory and reads or updates several databases. Most IMAP I/O is done with mmap() rather than with read()/write(). So far, IMAP peformance is adequate. The backup, done by EMC Networker, is very slow because it must read thousands of small files in directory order. storage type(s) We are using an Iscsi SAN with storage on a Netapp filer. It exports four 500-gb LUNs that are striped into one ZFS pool. All disk mangement is done on the Netapp. We have had several disk failures and replacements on the Netapp, with no effect on the e-mail server. industry A University with 35,000 enabled e-mail accounts. whether it is private or I can share in a summary anything else that might be of interest You are welcome to share this information. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS commands hang after several zfs receives
On Tue, Sep 15, 2009 at 08:48:20PM +1200, Ian Collins wrote: Ian Collins wrote: I have a case open for this problem on Solaris 10u7. The case has been identified and I've just received an IDR,which I will test next week. I've been told the issue is fixed in update 8, but I'm not sure if there is an nv fix target. I'll post back once I've abused a test system for a while. The IDR I was sent appears to have fixed the problem. I have been abusing the box for a couple of weeks without any lockups. Roll on update 8! Was that IDR140221-17? That one fixed a deadlock bug for us back in May. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 06, 2009 at 04:54:16PM +0100, Andrew Gabriel wrote: Andre van Eyssen wrote: On Mon, 6 Jul 2009, Gary Mills wrote: As for a business case, we just had an extended and catastrophic performance degradation that was the result of two ZFS bugs. If we have another one like that, our director is likely to instruct us to throw away all our Solaris toys and convert to Microsoft products. If you change platform every time you get two bugs in a product, you must cycle platforms on a pretty regular basis! You often find the change is towards Windows. That very rarely has the same rules applied, so things then stick there. There's a more general principle in operation here. Organizations do sometimes change platforms for peculiar reasons, but once they do that they're not going to do it again for a long time. That's why they disregard problems with the new platform. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 07:18:45PM +0100, Phil Harman wrote: Gary Mills wrote: On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? [..] Software engineering is always about prioritising resource. Nothing prioritises performance tuning attention quite like compelling competitive data. When Bart Smaalders and I wrote libMicro we generated a lot of very compelling data. I also coined the phrase If Linux is faster, it's a Solaris bug. You will find quite a few (mostly fixed) bugs with the synopsis linux is faster than solaris at So, if mmap(2) playing nicely with ZFS is important to you, probably the best thing you can do to help that along is to provide data that will help build the business case for spending engineering resource on the issue. First of all, how significant is the double caching in terms of performance? If the effect is small, I won't worry about it anymore. What sort of data do you need? Would a list of software products that utilize mmap(2) extensively and could benefit from ZFS be suitable? As for a business case, we just had an extended and catastrophic performance degradation that was the result of two ZFS bugs. If we have another one like that, our director is likely to instruct us to throw away all our Solaris toys and convert to Microsoft products. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of metadata overhead on filesystems with 100M files
On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution Architect wrote: What they noticed on the the X4500 systems, that when the zpool became filled up for about 50-60% the performance of the system did drop enormously. They do claim this has to do with the fragmentation of the ZFS filesystem. So we did try over there putting an S7410 system in with about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in a stripe) we were able to get much and much more i/o's from the system the the comparable X4500, however they did put it in production for a couple of weeks, and as soon as the ZFS filesystem did come in the range of about 50-60% filling the did see the same problem. We had a similar problem with a T2000 and 2 TB of ZFS storage. Once the usage reached 1 TB, the write performance dropped considerably and the CPU consumption increased. Our problem was indirectly a result of fragmentation, but it was solved by a ZFS patch. I understand that this patch, which fixes a whole bunch of ZFS bugs, should be released soon. I wonder if this was your problem. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Mon, Apr 27, 2009 at 04:47:27PM -0500, Gary Mills wrote: On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. The cause turned out to be this ZFS bug: 6596237: Stop looking and start ganging Apparently, the ZFS code was searching the free list looking for the perfect fit for each write. With a fragmented pool, this search took a very long time, delaying the write. Eventually, the requests arrived faster than writes could be sent to the devices, causing the server to be unresponsive. We also had another problem, due to this ZFS bug: 6591646: Hang while trying to enter a txg while holding a txg open This was a deadlock, with one thread blocking hundreds of other threads. Our symptom was that all zpool I/O would stop and the `ps' command would hang. A reboot was the only way out. If you have a support contract, Sun will supply an IDR that fixes both problems. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. The cause turned out to be this ZFS bug: 6596237: Stop looking and start ganging Apparently, the ZFS code was searching the free list looking for the perfect fit for each write. With a fragmented pool, this search took a very long time, delaying the write. Eventually, the requests arrived faster than writes could be sent to the devices, causing the server to be unresponsive. There isn't a patch for this one yet, but Sun will supply an IDR if you open a support case. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Peculiarities of COW over COW?
On Sun, Apr 26, 2009 at 05:19:18PM -0400, Ellis, Mike wrote: As soon as you put those zfs blocks ontop of iscsi, the netapp won't have a clue as far as how to defrag those iscsi files from the filer's perspective. (It might do some fancy stuff based on read/write patterns, but that's unlikely) Since the LUN is just a large file on the Netapp, I assume that all it can do is to put the blocks back into sequential order. That might have some benefit overall. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Peculiarities of COW over COW?
On Sun, Apr 26, 2009 at 05:02:38PM -0500, Tim wrote: On Sun, Apr 26, 2009 at 3:52 PM, Gary Mills [1]mi...@cc.umanitoba.ca wrote: We run our IMAP spool on ZFS that's derived from LUNs on a Netapp filer. There's a great deal of churn in e-mail folders, with messages appearing and being deleted frequently. Should ZFS and the Netapp be using the same blocksize, so that they cooperate to some extent? Just make sure ZFS is using a block size that is a multiple of 4k, which I believe it does by default. Okay, that's good. I have to ask though... why not just serve NFS off the filer to the Solaris box? ZFS on a LUN served off a filer seems to make about as much sense as sticking a ZFS based lun behind a v-filer (although the latter might actually might make sense in a world where it were supported *cough*neverhappen*cough* since you could buy the cheap newegg disk). I prefer NFS too, but the IMAP server requires POSIX semantics. I believe that NFS doesn't support that, at least NFS version 3. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is the 32 GB 2.5-Inch SATA Solid State Drive?
On Fri, Apr 24, 2009 at 09:08:52PM -0700, Richard Elling wrote: Gary Mills wrote: Does anyone know about this device? SESX3Y11Z 32 GB 2.5-Inch SATA Solid State Drive with Marlin Bracket for Sun SPARC Enterprise T5120, T5220, T5140 and T5240 Servers, RoHS-6 Compliant This is from Sun's catalog for the T5120 server. Would this work well as a separate ZIL device for ZFS? Is there any way I could use this in a T2000 server? The brackets appear to be different. The brackets are different. T2000 uses nemo bracket and T5120 uses marlin. For the part-number details, SunSolve is your friend. http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/SE_T5120/components http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/SunFireT2000_R/components I see also that no SSD is listed for the T2000. Has anyone gotten one to work as a separate ZIL device for ZFS? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] What is the 32 GB 2.5-Inch SATA Solid State Drive?
Does anyone know about this device? SESX3Y11Z 32 GB 2.5-Inch SATA Solid State Drive with Marlin Bracket for Sun SPARC Enterprise T5120, T5220, T5140 and T5240 Servers, RoHS-6 Compliant This is from Sun's catalog for the T5120 server. Would this work well as a separate ZIL device for ZFS? Is there any way I could use this in a T2000 server? The brackets appear to be different. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Tue, Apr 21, 2009 at 04:09:03PM -0400, Oscar del Rio wrote: There's a similar thread on hied-emailad...@listserv.nd.edu that might help or at least can get you in touch with other University admins in a similar situation. https://listserv.nd.edu/cgi-bin/wa?A1=ind0904L=HIED-EMAILADMIN Thread: mail systems using ZFS filesystems? Thanks. Those problems do sound similar. I also see positive experiences with T2000 servers, ZFS, and Cyrus IMAP from UC Davis. None of the people involved seem to be active on either the ZFS mailing list or the Cyrus list. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Tue, Apr 21, 2009 at 09:34:57AM -0500, Patrick Skerrett wrote: I'm fighting with an identical problem here am very interested in this thread. Solaris 10 127112-11 boxes running ZFS on a fiberchannel raid5 device (hardware raid). You are about a year behind in kernel patches. There is one patch that addresses similar problems. I'd recommend installing all of the new patches. This bug seems to be relevant: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6535160 Randomly one lun on a machine will stop writing for about 10-15 minutes (during a busy time of day), and then all of a sudden become active with a burst of activity. Reads will continue to happen. One thing that seems to have solved our hang and stall problems is to set `pg_contig_disable=1' in the kernel. I believe that only systems with Niagara CPUs are affected. It has to do with kernel code for handling two different sizes of memory pages. You can find more information here: http://forums.sun.com/thread.jspa?threadID=5257060 Also, open a support case with Sun if you haven't already. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. After I moved a couple of Cyrus databases from ZFS to UFS on Sunday morning, the server seemed to run quite nicely. One of these databases is memory-mapped by all of the lmtpd and pop3d processes. The other is opened by all the lmtpd processes. Both were quite active, with many small writes, so I assumed they'd be better on UFS. All of the IMAP mailboxes were still on ZFS. However, this morning, things went from bad to worse. All writes to the ZFS filesystems stopped completely. Look at this: $ zpool iostat 5 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - space 1.04T 975G 86 67 4.53M 2.57M space 1.04T 975G 5 0 159K 0 space 1.04T 975G 7 0 337K 0 space 1.04T 975G 3 0 179K 0 space 1.04T 975G 4 0 167K 0 `fsstat' told me that there was both writes and memory-mapped I/O to UFS, but nothing to ZFS. At the same time, the `ps' command would hang and could not be interrupted. `truss' on `ps' looked like this, but it eventually also stopped and not be interrupted. open(/proc/6359/psinfo, O_RDONLY) = 4 read(4, 02\0\0\0\0\0\001\0\018D7.., 416) = 416 close(4)= 0 open(/proc/12782/psinfo, O_RDONLY)= 4 read(4, 02\0\0\0\0\0\001\0\0 1EE.., 416) = 416 close(4)= 0 What could cause this sort of behavior? It happened three times today! -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 09:41:39PM -0500, Tim wrote: On Sat, Apr 18, 2009 at 9:01 PM, Gary Mills [1]mi...@cc.umanitoba.ca wrote: On Sat, Apr 18, 2009 at 06:53:30PM -0400, Ellis, Mike wrote: In case the writes are a problem: When zfs sends a sync-command to the iscsi luns, does the netapp just ack it, or does it wait till it fully destages? Might make sense to disable write/sync in /etc/system to be sure. So far I haven't been able to get an answer to that question from Netapp. I'm assuming that it acks it as soon as it's in the Netapp's non-volatile write cache. IIRC, it should just ack it. What version of ONTAP are you running? It seems to be this one: MODEL: FAS3020-R5 SW VERSION:7.2.3 -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 11:45:54PM -0500, Mike Gerdts wrote: [perf-discuss cc'd] On Sat, Apr 18, 2009 at 4:27 PM, Gary Mills mi...@cc.umanitoba.ca wrote: Many other layers are involved in this server. We use scsi_vhci for redundant I/O paths and Sun's Iscsi initiator to connect to the storage on our Netapp filer. The kernel plays a part as well. How do we determine which layer is responsible for the slow performance? Have you disabled the nagle algorithm for the iscsi initiator? http://bugs.opensolaris.org/view_bug.do?bug_id=6772828 I tried that on our test IMAP backend the other day. It made no significant difference to read or write times or to ZFS I/O bandwidth. I conclude that the Iscsi initiator has already sized its TCP packets to avoid Nagle delays. Also, you may want to consider doing backups from the NetApp rather than from the Solaris box. I've certainly recommended finding a different way to perform backups. Assuming all of your LUNs are in the same volume on the filer, a snapshot should be a crash-consistent image of the zpool. You could verify this by making the snapshot rw and trying to import the snapshotted LUNs on another host. That part sounds scary! The filer exports four LUNs that are combined into one ZFS pool on the IMAP server. These LUNs are volumes on the filer. How can we safely import them on another host? Anyway, this would remove the backup-related stress on the T2000. You can still do snapshots at the ZFS layer to give you file level restores. If the NetApp caught on fire, you would simply need to restore the volume containing the LUNs (presumably a small collection of large files) which would go a lot quicker than a large collection of small files. Yes, a disaster recovery would be much quicker in this case. Since iSCSI is in the mix, you should also be sure that your network is appropriately tuned. Assuming that you are using the onboard e1000g NICs, be sure that none of the bad counters are incrementing: $ kstat -p e1000g | nawk '$0 ~ /err|drop|fail|no/ $NF != 0' If this gives any output, there is likely something amiss with your network. Only this: e1000g:0:e1000g0:unknowns 1764449 I don't know what those are, but it's e1000g1 and e1000g2 that are used for the Iscsi network. The output from iostat -xCn 10 could be interesting as well. If asvc_t is high (30?), it means the filer is being slow to respond. If wsvc_t is frequently non-zero, there is some sort of a bottleneck that prevents the server from sending requests to the filer. Perhaps you have tuned ssd_max_throttle or Solaris has backed off because the filer said to slow down. (Assuming that ssd is used with iSCSI LUNs). Here's an example, taken from one of the busy periods: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.05.00.07.7 0.0 0.14.1 24.8 1 1 c1t2d0 27.0 13.8 1523.4 172.9 0.0 0.50.0 11.8 0 38 c4t60A98000433469764E4A2D456A644A74d0 42.0 21.4 2027.3 350.0 0.0 0.90.0 13.9 0 60 c4t60A98000433469764E4A2D456A696579d0 40.8 25.0 1993.5 339.1 0.0 0.80.0 11.8 0 52 c4t60A98000433469764E4A476D2F664E4Fd0 42.0 26.6 1968.4 319.1 0.0 0.80.0 11.8 0 56 c4t60A98000433469764E4A476D2F6B385Ad0 The service times seem okay to me. There's no `throttle' setting in any of the relevant driver conf files. What else is happening on the filer when mail gets slow? That is, are you experiencing slowness due to a mail peak or due to some research project that happens to be on the same spindles? What does the network look like from the NetApp side? Our Netapp guy tells me that the filer is operating normally when the problem occurs. The Iscsi network is less than 10% utilized. Are the mail server and the NetApp attached to the same switch, or are they at opposite ends of the campus? Is there something between them that is misbehaving? I don't think so. We have dedicated ethernet ports on both the IMAP server and the filer for Iscsi, along with a pair of dedicated switches. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] What causes slow performance under load?
We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. We've tried a number of things, each of which made an improvement, but the problem still occurs. The ZFS ARC size was about 10 GB, but was diminishing to 1 GB when the server was busy. In fact, it was unusable when that happened. Upgrading memory from 16 GB to 64 GB certainly made a difference. The ARC size is always over 30 GB now. Next, we limited the number of `lmtpd' (local delivery) processes to 64. With those two changes, the server still became very slow at busy times, but no longer became unresponsive. The final change was to disable ZFS prefetch. It's not clear if this made an improvement. The server is a T2000 running Solaris 10. It's a Cyrus murder back- end, essentially only an IMAP server. We did recently upgrade the front-end, from a 4-CPU SPARC box to a 16-core Intel box with more memory, also running Solaris 10. The front-end runs sendmail and proxies IMAP and POP connections to the back-end, and also forwards SMTP for local deliveries to the back-end, using LMTP. Cyrus runs thousands of `imapd' processes, with many `pop3d', and `lmtpd' processes as well. This should be an ideal workload for a Niagara box. All of these memory-map several moderate-sized databases, both Berkeley DB and skiplist types, and occasionally update those databases. Our EMC Networker client also often runs during the day, doing backups. All of the IMAP mailboxes reside on six ZFS filesystems, using a single 2-TB pool. It's only 51% occupied at the moment. Many other layers are involved in this server. We use scsi_vhci for redundant I/O paths and Sun's Iscsi initiator to connect to the storage on our Netapp filer. The kernel plays a part as well. How do we determine which layer is responsible for the slow performance? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 05:25:17PM -0500, Bob Friesenhahn wrote: On Sat, 18 Apr 2009, Gary Mills wrote: How do we determine which layer is responsible for the slow performance? If the ARC size is diminishing under heavy load then there must be excessive pressure for memory from the kernel or applications. A 30GB ARC is quite large. The slowdown likely increases the amount of RAM needed since more simultaneous requests are taking place at once and not completing as quickly as they should. Once the problem starts, it makes itself worse. It was diminishing under load when the server had only 16 GB of memory. There certainly was pressure then, so much so that the server became unresponsive. Once we upgraded that to 64 GB, the ARC size stayed high. I gather then that there's no longer pressure for memory by any of the components that might need it. It is good to make sure that the backup software is not the initial cause of the cascade effect. The backup is also very slow, often running for 24 hours. Since it's spending most of its time reading files, I assume that it must be cycling a cache someplace. I don't know if it's suffering from the same performance problem or if it's interfering with the IMAP service. Certainly, killing the backup doesn't seem to provide any relief. I don't like the idea of backups running in the daytime, but I get overruled in that one. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 09:58:19PM -0400, Ellis, Mike wrote: I've found that (depending on the backup software) the backup agents tend to run a single thread per filesystem. While that can backup several filesystems concurrently, the single filesystem backup is single-threaded... Yes, they do that. There are two of them running right now, but together they're only using 0.6% CPU. They're sleeping most of the time. I assume you're using zfs snapshots so you don't get fuzzy backups (over the 20-hour period...) That's what I've been recommending. We do have 14 daily snapshots available. I named them by Julian date, but our backups person doesn't like them because the names keep changing. Can you take a snapshot, and then have your backup software instead of backing up 1 entire fs/tree backup a bunch of the high-level filesystems concurrently? That could make a big difference on something like a t2000. Wouldn't there be one recent snapshot for each ZFS filesystem? We've certainly discussed backing up snapshots, but I wouldn't expect it to be much different. Wouldn't it still read all of the same files, except for ones that were added after the snapshot was taken? (You're not by chance using any type of ssh-transfers etc as part of the backups are you) No, Networker use RPC to connect to the backup server, but there's no encryption or compression on the client side. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 06:06:49PM -0700, Richard Elling wrote: [CC'ed to perf-discuss] Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. If memory is being stolen from the ARC, then the consumer must be outside of ZFS. I think this is a case for a traditional performance assessment. It was being stolen from the ARC, but once we added memory, that was no longer the case. ZFS is still one of the suspects. The server is a T2000 running Solaris 10. It's a Cyrus murder back- end, essentially only an IMAP server. We did recently upgrade the front-end, from a 4-CPU SPARC box to a 16-core Intel box with more memory, also running Solaris 10. The front-end runs sendmail and proxies IMAP and POP connections to the back-end, and also forwards SMTP for local deliveries to the back-end, using LMTP. Cyrus runs thousands of `imapd' processes, with many `pop3d', and `lmtpd' processes as well. This should be an ideal workload for a Niagara box. All of these memory-map several moderate-sized databases, both Berkeley DB and skiplist types, and occasionally update those databases. Our EMC Networker client also often runs during the day, doing backups. All of the IMAP mailboxes reside on six ZFS filesystems, using a single 2-TB pool. It's only 51% occupied at the moment. Many other layers are involved in this server. We use scsi_vhci for redundant I/O paths and Sun's Iscsi initiator to connect to the storage on our Netapp filer. The kernel plays a part as well. How do we determine which layer is responsible for the slow performance? prstat is your friend. Find out who is consuming the resources and work from there. What resources are visible with prstat, other than CPU and memory? Even at the busiest times, all of the processes only add up to about 6% of the CPU. The load average does rise alarmingly. Nothing is using large amounts of memory, although with thousands of processes, it would add up. I've found that it often makes sense to create processor sets and segregate dissimilar apps into different processor sets. mpstat can then clearly show how each processor set consumes its processors. IMAP workloads can be very tricky, because of the sort of I/O generated and because IMAP allows searching to be done on the server, rather than the client (eg POP) What would I look for with mpstat? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any news on ZFS bug 6535172?
On Mon, Apr 13, 2009 at 09:08:09AM +0530, Sanjeev wrote: How full is the pool ? Only 50%, but it started with two 500-gig LUNs initially. We added two more when it got up to 300 gigabytes. # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT space 1.99T 1.02T 992G51% ONLINE - # zpool status pool: space state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 c4t60A98000433469764E4A2D456A644A74d0 ONLINE 0 0 0 c4t60A98000433469764E4A2D456A696579d0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F6B385Ad0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F664E4Fd0 ONLINE 0 0 0 errors: No known data errors -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Any news on ZFS bug 6535172?
We're running a Cyrus IMAP server on a T2000 under Solaris 10 with about 1 TB of mailboxes on ZFS filesystems. Recently, when under load, we've had incidents where IMAP operations became very slow. The general symptoms are that the number of imapd, pop3d, and lmtpd processes increases, the CPU load average increases, but the ZFS I/O bandwidth decreases. At the same time, ZFS filesystem operations become very slow. A rewrite of a small file can take two minutes. We've added memory; this was an improvement, but the incidents continued. The next step is to disable ZFS prefetch and test this under load. If that doesn't help either, we're down to ZFS bugs. Our incidents seem similar to the ones at UC Davis: http://vpiet.ucdavis.edu/docs/EmailReviewCmte.Report_Feb2008.pdf These were attributed to bug 6535160, but this one is fixed on our server with patch 127127-11. Bug 6535172, ``zil_sync causing long hold times on zl_lock'', doesn't have a patch yet: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6535172 Could this bug cause our problem? How do I confirm that it does? Is there a workaround? Cyrus IMAP uses several moderate-sized databases that are memory-mapped by all processes. I can move these from ZFS to UFS if this is likely to help. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any news on ZFS bug 6535172?
On Sun, Apr 12, 2009 at 10:49:49AM -0700, Richard Elling wrote: Gary Mills wrote: We're running a Cyrus IMAP server on a T2000 under Solaris 10 with about 1 TB of mailboxes on ZFS filesystems. Recently, when under load, we've had incidents where IMAP operations became very slow. The general symptoms are that the number of imapd, pop3d, and lmtpd processes increases, the CPU load average increases, but the ZFS I/O bandwidth decreases. At the same time, ZFS filesystem operations become very slow. A rewrite of a small file can take two minutes. Bandwidth is likely not the issue. What does the latency to disk look like? Yes, I have statistics! This set was taken during an incident on Thursday. The load average was 12. There were about 5700 Cyrus processes running. Here are the relevant portions of `iostat -xn 5 4': extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 23.8 20.7 1195.0 677.8 0.0 1.00.0 22.2 0 37 c4t60A98000433469764E4A2D456A644A74d0 29.0 23.5 1438.3 626.8 0.0 1.30.0 25.4 0 44 c4t60A98000433469764E4A2D456A696579d0 22.8 26.6 1356.7 822.1 0.0 1.30.0 26.2 0 32 c4t60A98000433469764E4A476D2F664E4Fd0 26.4 27.3 1516.0 850.7 0.0 1.40.0 26.5 0 38 c4t60A98000433469764E4A476D2F6B385Ad0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 39.7 27.0 1395.8 285.5 0.0 1.10.0 16.3 0 51 c4t60A98000433469764E4A2D456A644A74d0 52.5 29.8 1890.8 175.1 0.0 1.80.0 22.3 0 63 c4t60A98000433469764E4A2D456A696579d0 30.0 33.3 1940.2 432.8 0.0 1.20.0 19.4 0 34 c4t60A98000433469764E4A476D2F664E4Fd0 39.9 42.5 2062.1 616.7 0.0 1.90.0 22.9 0 50 c4t60A98000433469764E4A476D2F6B385Ad0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 43.8 47.6 1691.5 504.8 0.0 1.60.0 17.3 0 59 c4t60A98000433469764E4A2D456A644A74d0 55.4 62.4 2027.8 517.0 0.0 2.20.0 18.5 0 72 c4t60A98000433469764E4A2D456A696579d0 18.6 76.8 682.3 843.5 0.0 1.10.0 12.0 0 34 c4t60A98000433469764E4A476D2F664E4Fd0 30.2 115.8 873.6 905.8 0.0 2.20.0 15.1 0 52 c4t60A98000433469764E4A476D2F6B385Ad0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 49.8 21.8 2438.7 400.3 0.0 1.70.0 24.0 0 62 c4t60A98000433469764E4A2D456A644A74d0 53.2 34.0 2741.3 218.0 0.0 2.10.0 24.4 0 63 c4t60A98000433469764E4A2D456A696579d0 14.0 26.8 506.2 482.1 0.0 0.70.0 18.2 0 32 c4t60A98000433469764E4A476D2F664E4Fd0 23.4 38.8 484.5 582.3 0.0 1.10.0 18.2 0 42 c4t60A98000433469764E4A476D2F6B385Ad0 -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any news on ZFS bug 6535172?
On Sun, Apr 12, 2009 at 12:23:03PM -0700, Richard Elling wrote: These disks are pretty slow. JBOD? They are not 100% busy, which means that either the cached data is providing enough response to the apps, or the apps are not capable of producing enough load -- which means the bottleneck may be elsewhere. They are four 500-gig Iscsi LUNs exported from a Netapp filer, with Solaris multipathing. Yes, the I/O is normally mostly writes, with reads being satisfied from various caches. You can use fsstat to get a better idea of what sort of I/O the applications are seeing from the file system. That might be revealing. Thanks for the suggestion. There are so many `*stat' commands that I forget about some of them. I've run a baseline with `fsstat', but the server is mostly idle now. I'll have to wait for another incident! What option to `fsstat' do you recommend? Here's a sample of the default output: $ fsstat zfs 5 5 new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 3.56M 1.53M 3.83M 1.07G 1.53M 2.47G 4.09M 56.4M 1.83T 61.1M 306G zfs 13 116 1.40K 5 11.6K 0 5 38.5K 125 127K zfs 18 018 3.61K 6 21.1K 0 6 16.7K97 244K zfs 26 425 1.73K10 6.76K 018 178K 142 817K zfs 12 313 3.90K 5 9.00K 0 5 32.8K 108 287K zfs 7 2 7 1.98K 3 5.87K 0 7 67.5K 108 2.34M zfs -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any news on ZFS bug 6535172?
On Sun, Apr 12, 2009 at 05:01:57PM -0400, Ellis, Mike wrote: Is the netapp iscsi-lun forcing a dull sync as a part of zfs's 5-second synx/flush type of thing? (Not needed tince the netapp guarantees the write once it acks it) I've asked that of our Netapp guy, but so far I haven't heard from him. Is there a way to determine this from the Iscsi initiator side? I do have a test mail server that I can play with. That could make a big difference... (Perhaps disabling the write-flush in zfs will make a big difference here, especially on a write-heavy system) -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Efficient backup of ZFS filesystems?
On Thu, Apr 09, 2009 at 04:25:58PM +0200, Henk Langeveld wrote: Gary Mills wrote: I've been watching the ZFS ARC cache on our IMAP server while the backups are running, and also when user activity is high. The two seem to conflict. Fast response for users seems to depend on their data being in the cache when it's needed. Most of the disk I/O seems to be writes in this situation. However, the backup needs to stat all files and read many of them. I'm assuming that all of this information is also added to the ARC cache, even though it may never be needed again. It must also evict user data from the cache, causing it to be reloaded every time it's needed. Find out whether you have a problem first. If not, don't worry, but read one. If you do have a problem, add memory or an L2ARC device. We do have a problem, but not with the backup itself. The backup is slow, but I expect that's just because it's reading a very large number of small files. Our problem is with normal IMAP operations becoming quite slow at times. I'm wondering if the backup is contributing to this problem. The ARC was designed to mitigate the effect of any single burst of sequential I/O, but the size of the cache dedicated to more Frequently used pages (the current working set) will still be reduced, depending on the amount of activity on either side of the cache. That's a nice design, better than a simple cache. As the ARC maintains a shadow list of recently evicted pages from both sides of the cache, such pages that are accessed again will then return to the 'Frequent' side of the cache. There will be continuous competition between 'Recent' and 'Frequent' sides of the ARC (and for convenience, I'm glossing over the existence of 'Locked' pages). Several reasons might cause pathological behaviour - a backup process might access the same metadata multiple times, causing that data to be promoted to 'Frequent', flushing out application related data. (ZFS does not differentiate between data and metadata for resource allocation, they all use the same I/O mechanism and cache.) That might be possible in our case. On the other hand, you might just not have sufficient memory to keep most of your metadata in the cache, or the backup process is just too aggressive. Adding memory or an L2cache might help. We've added memory. That did seem to help, although the problem's still there. I assume the L2cache is not available in Solaris 10. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Efficient backup of ZFS filesystems?
I've been watching the ZFS ARC cache on our IMAP server while the backups are running, and also when user activity is high. The two seem to conflict. Fast response for users seems to depend on their data being in the cache when it's needed. Most of the disk I/O seems to be writes in this situation. However, the backup needs to stat all files and read many of them. I'm assuming that all of this information is also added to the ARC cache, even though it may never be needed again. It must also evict user data from the cache, causing it to be reloaded every time it's needed. We use Networker for backups now. Is there some way to configure ZFS so that backups don't churn the cache? Is there a different way to perform backups to avoid this problem? We do keep two weeks of daily ZFS snapshots to use for restores of recently-lost data. We still need something for longer-term backups. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to set a minimum ARC size?
We have an IMAP server that uses ZFS filesystems for all of its mailbox and database files. As the number of users increases, with a consequent increase in the number of processes, the ARC size decreases from 10 gigabytes down to 2 gigabytes. I know that it's supposed to do that, but in this case ZFS is starved for memory and the whole thing slows to a crawl. Is there a way to set a minimum ARC size so that this doesn't happen? We are going to upgrade the memory, but a lower limit on ARC size might still be a good idea. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs related google summer of code ideas - your vote
On Tue, Mar 03, 2009 at 11:35:40PM +0200, C. Bergström wrote: Here's more or less what I've collected... [..] 10) Did I miss something.. I suppose my RFE for two-level ZFS should be included, unless nobody intends to attach a ZFS file server to a SAN with ZFS on application servers. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs related google summer of code ideas - your vote
On Wed, Mar 04, 2009 at 01:20:42PM -0500, Miles Nordin wrote: gm == Gary Mills mi...@cc.umanitoba.ca writes: gm I suppose my RFE for two-level ZFS should be included, Not that my opinion counts for much, but I wasn't deaf to it---I did respond. I appreciate that. I thought it was kind of based on mistaken understanding. It included this strangeness of the upper ZFS ``informing'' the lower one when corruption had occured on the network, and the lower ZFS was supposed to do something with the physical disks...to resolve corruption on the network? why? IIRC several others pointed out the same bogosity. It's a simply a consequence of ZFS's end-to-end error detection. There are many different components that could contribute to such errors. Since only the lower ZFS has data redundancy, only it can correct the error. Of course, if something in the data path consistently corrupts the data regardless of its origin, it won't be able to correct the error. The same thing can happen in the simple case, with one ZFS over physical disks. It makes slightly more sense in the write direction than the read direction maybe, but I still don't fully get the plan. It is a new protocol to replace iSCSI? or NFS? or, what? Is it a re-invention of pNFS or Lustre, but with more work since you're starting from zero, and less architectural foresight? I deliberately did not specify the protocol to keep the concept general. Anything that works and solves the problem would be good. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs related google summer of code ideas - your vote
On Wed, Mar 04, 2009 at 06:31:59PM -0700, Dave wrote: Gary Mills wrote: On Wed, Mar 04, 2009 at 01:20:42PM -0500, Miles Nordin wrote: gm == Gary Mills mi...@cc.umanitoba.ca writes: gm I suppose my RFE for two-level ZFS should be included, It's a simply a consequence of ZFS's end-to-end error detection. There are many different components that could contribute to such errors. Since only the lower ZFS has data redundancy, only it can correct the error. Of course, if something in the data path consistently corrupts the data regardless of its origin, it won't be able to correct the error. The same thing can happen in the simple case, with one ZFS over physical disks. I would argue against building this into ZFS. Any corruption happening on the wire should not be the responsibility of ZFS. If you want to make sure your data is not corrupted over the wire, use IPSec. If you want to prevent corruption in RAM, use ECC sticks, etc. But what if the `wire' is a SCSI bus? Would you want ZFS to do error correction in that case? There are many possible wires. Every component does its own error checking of some sort, but in its own domain. This brings us back to end-to-end error checking again. Since we are designing a filesystem, that's where the reliability should reside. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE for two-level ZFS
On Thu, Feb 19, 2009 at 12:36:22PM -0800, Brandon High wrote: On Thu, Feb 19, 2009 at 6:18 AM, Gary Mills mi...@cc.umanitoba.ca wrote: Should I file an RFE for this addition to ZFS? The concept would be to run ZFS on a file server, exporting storage to an application server where ZFS also runs on top of that storage. All storage management would take place on the file server, where the physical disks reside. The application server would still perform end-to-end error checking but would notify the file server when it detected an error. You could accomplish most of this by creating a iSCSI volume on the storage server, then using ZFS with no redundancy on the application server. That's what I'd like to do, and what we do now. The RFE is to take advantage of the end-to-end checksums in ZFS in spite of having no redundancy on the application server. Having all of the disk management in one place is a great benefit. You'll have two layers for checksums, one on the storage server's zpool and a second on the application server's filesystem. The application server won't be able to notify the storage server that it's detected a bad checksum, other than through retries, but can write a user-space monitor that watches for ZFS checksum errors and sends notification to the storage server. The RFE is to enable the two instances of ZFS to exchange information about checksum failures. To poke a hole in your idea: What if the app server does find an error? What's the storage server to do at that point? Provided that the storage server's zpool already has redundancy, the data written to disk should already be exactly what was received from the client. If you want to have the ability to recover from erorrs on the app server, you should use a redundant zpool - Either a mirror or a raidz. Yes, if the two instances of ZFS disagree, we have a problem that needs to be resolved: they need to cooperate in this endevour. If you're concerned about data corruption in transit, then it sounds like something akin to T10 DIF (which others mentioned) would fit the bill. You could also tunnel the traffic over a transit layer such as TLS or SSH that provides a measure of validation. Latency should be fun to deal with however. I'm mainly concerned that ZFS on the application server will detect a checksum error and then be unable to preserve the data. Iscsi already has TCP checksums. I assume that FC-AL does as well. Using more reliable checksums has no benefit if ZFS will still detect end-to-end checksum errors. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE for two-level ZFS
On Thu, Feb 19, 2009 at 09:59:01AM -0800, Richard Elling wrote: Gary Mills wrote: Should I file an RFE for this addition to ZFS? The concept would be to run ZFS on a file server, exporting storage to an application server where ZFS also runs on top of that storage. All storage management would take place on the file server, where the physical disks reside. The application server would still perform end-to-end error checking but would notify the file server when it detected an error. Currently, this is done as a retry. But retries can suffer from cached badness. So, ZFS on the application server would retry the read from the storage server. This would be the same as it does from a physical disk, I presume. However, if the checksum failure persisted, it would declare an error. That's where the RFE comes in, because it would then notify the file server to utilize its redundant data source. Perhaps this could be done as part of the retry, using existing protocols. There are several advantages to this configuration. One current recommendation is to export raw disks from the file server. Some storage devices, including I assume Sun's 7000 series, are unable to do this. Another is to build two RAID devices on the file server and to mirror them with ZFS on the application server. This is also sub-optimal as it doubles the space requirement and still does not take full advantage of ZFS error checking. Splitting the responsibilities works around these problems I'm not convinced, but here is how you can change my mind. 1. Determine which faults you are trying to recover from. I don't think this has been clearly identified, except that they are ``those faults that are only detected by end-to-end checksums''. 2. Prioritize these faults based on their observability, impact, and rate. Perhaps the project should be to extend end-to-end checksums in situations that don't have end-to-end redundancy. Redundancy at the storage layer would be required, of course. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RFE for two-level ZFS
Should I file an RFE for this addition to ZFS? The concept would be to run ZFS on a file server, exporting storage to an application server where ZFS also runs on top of that storage. All storage management would take place on the file server, where the physical disks reside. The application server would still perform end-to-end error checking but would notify the file server when it detected an error. There are several advantages to this configuration. One current recommendation is to export raw disks from the file server. Some storage devices, including I assume Sun's 7000 series, are unable to do this. Another is to build two RAID devices on the file server and to mirror them with ZFS on the application server. This is also sub-optimal as it doubles the space requirement and still does not take full advantage of ZFS error checking. Splitting the responsibilities works around these problems. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote: Ross wrote: I can also state with confidence that very, very few of the 100 staff working here will even be aware that it's possible to unmount a USB volume in windows. They will all just pull the plug when their work is saved, and since they all come to me when they have problems, I think I can safely say that pulling USB devices really doesn't tend to corrupt filesystems in Windows. Everybody I know just waits for the light on the device to go out. The key here is that Windows does not cache writes to the USB drive unless you go in and specifically enable them. It caches reads but not writes. If you enable them you will lose data if you pull the stick out before all the data is written. This is the type of safety measure that needs to be implemented in ZFS if it is to support the average user instead of just the IT professionals. That implies that ZFS will have to detect removable devices and treat them differently than fixed devices. It might have to be an option that can be enabled for higher performance with reduced data security. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Two-level ZFS
On Mon, Feb 02, 2009 at 09:53:15PM +0700, Fajar A. Nugraha wrote: On Mon, Feb 2, 2009 at 9:22 PM, Gary Mills mi...@cc.umanitoba.ca wrote: On Sun, Feb 01, 2009 at 11:44:14PM -0500, Jim Dunham wrote: If there are two (or more) instances of ZFS in the end-to-end data path, each instance is responsible for its own redundancy and error recovery. There is no in-band communication between one instance of ZFS and another instances of ZFS located elsewhere in the same end-to- end data path. I must have been unclear when I stated my question. The configuration, with ZFS on both systems, redundancy only on the file server, and end-to-end error detection and correction, does not exist. What additions to ZFS are required to make this work? None. It's simply not possible. You're talking about the existing ZFS implementation; I'm not! Is ZFS now frozen in time, with only bug being fixed? I have difficulty believing that. Putting a wire between two layers of ZFS should indeed be possible. Think about the Amber Road products, from the Fishworks team. They run ZFS and export Iscsi and FC-AL. Redundancy and disk management is already present in these products. Should it be implimented again in each of the servers that imports LUNs from these products? I think not. I believe Jim already state that, but let me give some additional comment that might be helpful. (1) zfs can provide end-to-end protection ONLY if you use it end-end. This means : - no other filesystem on top of it (e.g. do not use UFS on zvol or something similar) - no RAID/MIRROR under it (i.e. it must have access to the disk as JBOD) Exactly! That leads to my question. What information needs to be exchanged between ZFS on the file server and ZFS on the application server so that end-to-end protection can be maintained with redundancy and disk management only on the file server? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Two-level ZFS
On Sun, Feb 01, 2009 at 11:44:14PM -0500, Jim Dunham wrote: I wrote: I realize that this configuration is not supported. The configuration is supported, but not in the manner mentioned below. If there are two (or more) instances of ZFS in the end-to-end data path, each instance is responsible for its own redundancy and error recovery. There is no in-band communication between one instance of ZFS and another instances of ZFS located elsewhere in the same end-to- end data path. I must have been unclear when I stated my question. The configuration, with ZFS on both systems, redundancy only on the file server, and end-to-end error detection and correction, does not exist. What additions to ZFS are required to make this work? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Two-level ZFS
I realize that this configuration is not supported. What's required to make it work? Consider a file server running ZFS that exports a volume with Iscsi. Consider also an application server that imports the LUN with Iscsi and runs a ZFS filesystem on that LUN. All of the redundancy and disk management takes place on the file server, but end-to-end error detection takes place on the application server. This is a reasonable configuration, is it not? When the application server detects a checksum error, what information does it have to return to the file server so that it can correct the error? The file server could then retry the read from its redundant source, which might be a mirror or might be synthentic data from RAID-5. It might also indicate that a disk must be replaced. Must any information accompany each block of data sent to the application server so that the file server can identify the source of the data in the event of an error? Does this additional exchange of information fit into the Iscsi protocol, or does it have to flow out of band somehow? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] What are the usual suspects in data errors?
I realize that any error can occur in a storage subsystem, but most of these have an extremely low probability. I'm interested in this discussion in only those that do occur occasionally, and that are not catastrophic. Consider the common configuration of two SCSI disks connected to the same HBA that are configured as a mirror in some manner. In this case, the data path in general consists of: o The application o The filesystem o The drivers o The HBA o The SCSI bus o The controllers o The heads and patters Many of those components have their own error checking. Some have error correction. For example, parity checking is done on a SCSI bus, unless it's specifically disabled. Do SATA and PATA connections also do error checking? Disk sector I/O uses CRC error checking and correction. Memory buffers would often be protected by parity memory. Is there any more that I've missed? Now, let's consider common errors. To me, the most frequent would be a bit error on a disk sector. In this case, the controller would report a CRC error and would not return bad data. The filesystem would obtain the data from its redundant copy. I assume that ZFS would also rewrite the bad sector to correct it. The application would not see an error. Similar events would happen for a parity error on the SCSI bus. What can go wrong with the disk controller? A simple seek to the wrong track is not a problem because the track number is encoded on the platter. The controller will simply recalibrate the mechanism and retry the seek. If it computes the wrong sector, that would be a problem. Does this happen with any frequency? In this case, ZFS would detect a checksum error and obtain the data from its redundant copy. A logic error in ZFS might result in incorrect metadata being written with valid checksum. In this case, ZFS might panic on import or might correct the error. How is this sort of error prevented? If the application wrote bad data to the filesystem, none of the error checking in lower layers would detect it. This would be strictly an error in the application. Some errors might result from a loss of power if some ZFS data was written to a disk cache but never was written to the disk platter. Again, ZFS might panic on import or might correct the error. How is this sort of error prevented? After all of this discussion, what other errors can ZFS checksums reasonably detect? Certainly if some of the other error checking failed to detect an error, ZFS would still detect one. How likely are these other error checks to fail? Is there anything else I've missed in this analysis? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snapshot before patching..
On Tue, Dec 30, 2008 at 02:06:16PM +0100, dick hoogendijk wrote: What kind of snapshot do I need to be on the safe side patching a S10u6 system? rpool? rpool/ROOT? rpool/ROOT/BE? Use Live Upgrade. Create a new boot environment and apply the patches to that. Activate the new BE and `init 6'. And how/what do I do to reverse to the non-patched system in case something goes terribly wrong? ;-) Just revert to the old BE. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to create a basic new filesystem?
On Sat, Dec 20, 2008 at 03:52:46AM -0800, Uwe Dippel wrote: This might sound sooo simple, but it isn't. I read the ZFS Administration Guide and it did not give an answer; at least no simple answer, simple enough for me to understand. The intention is to follow the thread Easiest way to replace a boot disk with a larger one. The command given would be zpool attach rpool /dev/dsk/c1d0s0 /dev/dsk/c2d0s0 as far as I understand in my case. What it says is cannot open '/dev/dsk/c2d0s0': No such device or address. format shows that the partition exists: The problem is that fdisk partitions are not the same as Solaris partitions. The admin guide refers to a Solaris partition. For Solaris 10 x86, this has to be created inside an fdisk partition. # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1d0 DEFAULT cyl 17020 alt 2 hd 255 sec 63 /p...@0,0/pci-...@9/i...@0/c...@0,0 1. c2d0 DEFAULT cyl 10442 alt 2 hd 255 sec 126 /p...@0,0/pci-...@9/i...@1/c...@0,0 Specify disk (enter its number): 1 selecting c2d0 Controller working list found [disk formatted, defect list found] FORMAT MENU: [...] Total disk size is 38912 cylinders Cylinder size is 32130 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 Linux native 019 20 0 2 Solaris2 19 1046210444 27 3 Other OS 10463 130742612 7 4 EXT-DOS13075 3891225838 66 These are fdisk partitions. To my understanding, there is no need to format before using a file system in ZFS. The Creating a Basic ZFS File System is not clear to me. The first (and only) command it offers, creates a mirrored storage of a whole disk; none of which I intend to do. (I suggested before, to offer a guide as well containing all the *basic* commands.) I wonder if I really need to use format-partition first to create slice s0 in that second (DOS)partition of c2d0 before ZFS can use it? The Solaris `format' command is use to create Solaris partitions, and the label that describes them. For a ZFS root pool, you have to use a Solaris label, and a partition (slice). This was slice 0 in your example. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to create a basic new filesystem?
On Sat, Dec 20, 2008 at 06:10:10AM -0800, Uwe Dippel wrote: thanks. All my servers run OpenBSD, so I know the difference between a DOS-partition and a slice. :) My background is Solaris SPARC, where things are simpler. Solaris writes a label to a physical disk to define slices (Solaris partitions) on the disk. The `format' command sees the physical disk. In the case of Solaris x86, this command sees one fdisk partition, which it treats as a disk. I generally create a single fdisk partition that occupies the entire disk, to return to simplicity. My confusion is about the labels. I could not label it what I wanted, like zfsed or pool, it had to be root. And since we can have only a single bf-partition per drive (dsk), I was thinking ZFS would take the (existing but unlabeled) s0 to attach to. This does not seem to be the case. The tag that appears on the partition menu isn't used in normal operation of the system. There are only a few valid choices, but `root' is fine. Out of curiosity: how does it matter (to ZFS) if /dsk/c3t1d0s0 is a complete drive or exists in a bf-partition? One way or another, /dev/dsk/c2d0s0 seems to be over-defined now. If you give `zpool' a complete disk, by omitting the slice part, it will write its own label to the drive. If you specify it with a slice, it expects that you have already defined that slice. For a root pool, it has to be a slice. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
On Thu, Dec 11, 2008 at 10:41:26PM -0600, Bob Friesenhahn wrote: On Thu, 11 Dec 2008, Gary Mills wrote: The split responsibility model is quite appealing. I'd like to see ZFS address this model. Is there not a way that ZFS could delegate responsibility for both error detection and correction to the storage device, at least one more sophisticated than a physical disk? Why is split responsibility appealing? In almost any complex system whether it be government or computing, split responsibility results in indecision and confusion. Heirarchical decision making based on common rules is another matter entirely. Now this becomes semantics. There still has to be a hierarchy, but it's split into areas of responsibility. In the case of ZFS over SAN storage, the area boundary now is the SAN cable. Unfortunately SAN equipment is still based on technology developed in the early '80s and simply tries to behave like a more reliable disk drive rather than a participating intelligent component in a system which may detect, tolerate, and spontaneously correct any faults. That's exactly what I'm asking. How can ZFS and SAN equipment be improved so that they cooperate to make the whole system more reliable? Converting the SAN storage into a JBOD is not a valid solution. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
On Wed, Dec 10, 2008 at 12:58:48PM -0800, Richard Elling wrote: Nicolas Williams wrote: On Wed, Dec 10, 2008 at 01:30:30PM -0600, Nicolas Williams wrote: On Wed, Dec 10, 2008 at 12:46:40PM -0600, Gary Mills wrote: On the server, a variety of filesystems can be created on this virtual disk. UFS is most common, but ZFS has a number of advantages over UFS. Two of these are dynamic space management and snapshots. There are also a number of objections to employing ZFS in this manner. ZFS has very strong error detection built-in, and for mirrored and RAID-Zed pools can recover from errors automatically as long as there's a mirror left or enough disks in RAID-Z left to complete the recovery. Oh, but I get it: all the redundancy here would be in the SAN, and the ZFS pools would have no mirrors, no RAID-Z. Note that you'll generally be better off using RAID-Z than HW RAID-5. Precisely because ZFS can reconstruct the correct data if it's responsible for redundancy. But note that the setup you describe puts ZFS in no worse a situation than any other filesystem. Well, actually, it does. ZFS is susceptible to a class of failure modes I classify as kill the canary types. ZFS will detect errors and complain about them, which results in people blaming ZFS (the canary). If you follow this forum, you'll see a kill the canary post about every month or so. By default, ZFS implements the policy that uncorrectable, but important failures may cause it to do an armadillo impression (staying with the animal theme ;-) but for which some other file systems, like UFS, will blissfully ignore -- putting data at risk. Occasionally, arguments will arise over whether this is the best default policy, though most folks seem to agree that data corruption is a bad thing. Later versions of ZFS, particularly that available in Solaris 10 10/08 and all OpenSolaris releases, allow system admins to have better control over these policies. Yes, that's what I was getting at. Without redundancy at the ZFS level, ZFS can report errors but not correct them. Of course, with a reliable SAN and storage device, those errors will never happen. Certainly, vendors of these products will claim that they have extremely high standards of data integrity. Data corruption is the worst nightmare of storage designers, after all. It rarely happens, although I have seen it on one occasion in a high-quality storage device. The split responsibility model is quite appealing. I'd like to see ZFS address this model. Is there not a way that ZFS could delegate responsibility for both error detection and correction to the storage device, at least one more sophisticated than a physical disk? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Split responsibility for data with ZFS
Large sites that have centralized their data with a SAN typically have a storage device export block-oriented storage to a server, with a fibre-channel or Iscsi connection between the two. The server sees this as a single virtual disk. On the storage device, the blocks of data may be spread across many physical disks. The storage device looks after redundancy and management of the physical disks. It may even phone home when a disk fails and needs to be replaced. The storage device provides reliability and integrity for the blocks of data that it serves, and does this well. On the server, a variety of filesystems can be created on this virtual disk. UFS is most common, but ZFS has a number of advantages over UFS. Two of these are dynamic space management and snapshots. There are also a number of objections to employing ZFS in this manner. ``ZFS cannot correct errors'', and ``you will lose all of your data'' are two of the alarming ones. Isn't ZFS supposed to ensure that data written to the disk are always correct? What's the real problem here? This is a split responsibility configuration where the storage device is responsible for integrity of the storage and ZFS is responsible for integrity of the filesystem. How can it be made to behave in a reliable manner? Can ZFS be better than UFS in this configuration? Is a different form of communication between the two components necessary in this case? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate /var
On Mon, Dec 01, 2008 at 04:45:16PM -0700, Lori Alt wrote: On 11/27/08 17:18, Gary Mills wrote: On Fri, Nov 28, 2008 at 11:19:14AM +1300, Ian Collins wrote: On Fri 28/11/08 10:53 , Gary Mills [EMAIL PROTECTED] sent: On Fri, Nov 28, 2008 at 07:39:43AM +1100, Edward Irvine wrote: I'm currently working with an organisation who want use ZFS for their full zones. Storage is SAN attached, and they also want to create a separate /var for each zone, which causes issues when the zone is installed. They believe that a separate /var is still good practice. If your mount options are different for /var and /, you will need a separate filesystem. In our case, we use `setuid=off' and `devices=off' on /var for security reasons. We do the same thing for home directories and /tmp . For zones? Sure, if you require different mount options in the zones. I looked into this and found that, using ufs, you can indeed set up the zone's /var directory as a separate file system. I don't know about how LiveUpgrade works with that configuration (I didn't try it). But I was at least able to get the zone to install and boot. But with zfs, I couldn't even get a zone with a separate /var dataset to install, let alone be manageable with LiveUpgrade. I configured the zone like so: # zonecfg -z z4 z4: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:z4 create zonecfg:z4 set zonepath=/zfszones/z4 zonecfg:z4 add fs zonecfg:z4:fs set dir=/var zonecfg:z4:fs set special=rpool/ROOT/s10x_u6wos_07b/zfszones/z4/var zonecfg:z4:fs set type=zfs zonecfg:z4:fs end zonecfg:z4 exit I then get this result from trying to install the zone: prancer# zoneadm -z z4 install Preparing to install zone z4. ERROR: No such file or directory: cannot mount /zfszones/z4/root/var You might have to pre-create this filesystem. `special' may not be needed at all. in non-global zone to install: the source block device or directory rpool/ROOT/s10x_u6wos_07b/zfszones/z1/var cannot be accessed ERROR: cannot setup zone z4 inherited and configured file systems ERROR: cannot setup zone z4 file systems inherited and configured from the global zone ERROR: cannot create zone boot environment z4 I don't fully understand the failures here. I suspect that there are problems both in the zfs code and zones code. It SHOULD work though. The fact that it doesn't seems like a bug. In the meantime, I guess we have to conclude that a separate /var in a non-global zone is not supported on zfs. A separate /var in the global zone is supported however, even when the root is zfs. I haven't tried ZFS zone roots myself, but I do have a few comments. ZFS filesystems are cheap because they don't require separate disk slices. As well, they are attribute boundaries. Those are necessary or convenient in some case. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate /var
On Fri, Nov 28, 2008 at 07:39:43AM +1100, Edward Irvine wrote: I'm currently working with an organisation who want use ZFS for their full zones. Storage is SAN attached, and they also want to create a separate /var for each zone, which causes issues when the zone is installed. They believe that a separate /var is still good practice. If your mount options are different for /var and /, you will need a separate filesystem. In our case, we use `setuid=off' and `devices=off' on /var for security reasons. We do the same thing for home directories and /tmp . -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate /var
On Fri, Nov 28, 2008 at 11:19:14AM +1300, Ian Collins wrote: On Fri 28/11/08 10:53 , Gary Mills [EMAIL PROTECTED] sent: On Fri, Nov 28, 2008 at 07:39:43AM +1100, Edward Irvine wrote: I'm currently working with an organisation who want use ZFS for their full zones. Storage is SAN attached, and they also want to create a separate /var for each zone, which causes issues when the zone is installed. They believe that a separate /var is still good practice. If your mount options are different for /var and /, you will need a separate filesystem. In our case, we use `setuid=off' and `devices=off' on /var for security reasons. We do the same thing for home directories and /tmp . For zones? Sure, if you require different mount options in the zones. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: [osol-announce] IMPT: Do not use SXCE Build 102
On Mon, Nov 17, 2008 at 07:27:50AM +0200, Johan Hartzenberg wrote: Thank you for the details. A few more questions: Does booting into build 102 do I zpool online on the root pool? And the above disable -t is temporary till the next reboot - any specific reason for doing it that way? And last question: What do I loose when I disable sysevent? I can answer that last one, since I tried it. When you reboot without sysevent, you will find that the console login service cannot run. You'll wind up in single-user mode. You can enable sysevent at that point and reboot again. Disabling it with `-t' after the system's up seems to do no harm. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there a baby thumper?
On Tue, Nov 04, 2008 at 05:48:26PM -0600, Tim wrote: Well, what's the end goal? What are you testing for that you need from the thumper? I/O interfaces? CPU? Chipset? If you need *everything* you don't have any other choice. I suppose that something with SATA disks and the same disk controller would be most suitable. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is there a baby thumper?
One of our storage guys would like to put a thumper into service, but he's looking for a smaller model to use for testing. Is there something that has the same CPU, disks, and disk controller as a thumper, but fewer disks? The ones I've seen all have 48 disks. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there a baby thumper?
On Tue, Nov 04, 2008 at 03:31:16PM -0700, Carl Wimmi wrote: There isn't a de-populated version. Would X4540 with 250 or 500 GB drives meet your needs? That might be our only choice. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss