Re: [zfs-discuss] partioned cache devices
On 03/19/13 20:27, Jim Klimov wrote: I disagree; at least, I've always thought differently: the "d" device is the whole disk denomination, with a unique number for a particular controller link ("c+t"). The disk has some partitioning table, MBR or GPT/EFI. In these tables, partition "p0" stands for the table itself (i.e. to manage partitioning), p0 is the whole disk regardless of any partitioning. (Hence you can use p0 to access any type of partition table.) and the rest kind of "depends". In case of MBR tables, one partition may be named as having a Solaris (or Solaris2) type, and there it holds a SMI table of Solaris slices, and these slices can hold legacy filesystems or components of ZFS pools. In case of GPT, the GPT-partitions can be used directly by ZFS. However, they are also denominated as "slices" in ZFS and format utility. The GPT partitioning spec requires the disk to be FDISK partitioned with just one single FDISK partition of type EFI, so that tools which predate GPT partitioning will still see such a GPT disk as fully assigned to FDISK partitions, and therefore less likely to be accidentally blown away. I believe, Solaris-based OSes accessing a "p"-named partition and an "s"-named slice of the same number on a GPT disk should lead to the same range of bytes on disk, but I am not really certain about this. No, you'll see just p0 (whole disk), and p1 (whole disk less space for the backwards compatible FDISK partitioning). Also, if a "whole disk" is given to ZFS (and for OSes other that the latest Solaris 11 this means non-rpool disks), then ZFS labels the disk as GPT and defines a partition for itself plus a small trailing partition (likely to level out discrepancies with replacement disks that might happen to be a few sectors too small). In this case ZFS reports that it uses "cXtYdZ" as a pool component, For an EFI disk, the device name without a final p* or s* component is the whole EFI partition. (It's actually the s7 slice minor device node, but the s7 is dropped from the device name to avoid the confusion we had with s2 on SMI labeled disks being the whole SMI partition.) since it considers itself in charge of the partitioning table and its inner contents, and doesn't intend to share the disk with other usages (dual-booting and other OSes' partitions, or SLOG and L2ARC parts, etc). This also "allows" ZFS to influence hardware-related choices, like caching and throttling, and likely auto-expansion with the changed LUN sizes by fixing up the partition table along the way, since it assumes being 100% in charge of the disk. I don't think there is a "crime" in trying to use the partitions (of either kind) as ZFS leaf vdevs, even the zpool(1M) manpage states that: ... The following virtual devices are supported: disk A block device, typically located under /dev/dsk. ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks. ... Right. This is orthogonal to the fact that there can only be one Solaris slice table, inside one partition, on MBR. AFAIK this is irrelevant on GPT/EFI - no SMI slices there. There's a simpler way to think of it on x86. You always have FDISK partitioning (p1, p2, p3, p4). You can then have SMI or GPT/EFI slices (both called s0, s1, ...) in an FDISK partition of the appropriate type. With SMI labeling, s2 is by convention the whole Solaris FDISK partition (although this is not enforced). With EFI labeling, s7 is enforced as the whole EFI FDISK partition, and so the trailing s7 is dropped off the device name for clarity. This simplicity is brought about because the GPT spec requires that backwards compatible FDISK partitioning is included, but with just 1 partition assigned. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
Andrew Werchowiecki wrote: Total disk size is 9345 cylinders Cylinder size is 12544 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 EFI 0 93459346100 You only have a p1 (and for a GPT/EFI labeled disk, you can only have p1 - no other FDISK partitions are allowed). partition> print Current partition table (original): Total disk sectors available: 117214957 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm642.00GB 4194367 1usrwm 4194368 53.89GB 117214990 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 1172149918.00MB 117231374 You have an s0 and s1. This isn’t the output from when I did it but it is exactly the same steps that I followed. Thanks for the info about slices, I may give that a go later on. I’m not keen on that because I have clear evidence (as in zpools set up this way, right now, working, without issue) that GPT partitions of the style shown above work and I want to see why it doesn’t work in my set up rather than simply ignoring and moving on. You would have to blow away the partitioning you have, and create an FDISK partitioned disk (not EFI), and then create a p1 and p2 partition. (Don't use the 'partition' subcommand, which confusingly creates solaris slices.) Give the FDISK partitions a partition type which nothing will recognise, such as 'other', so that nothing will try and interpret them as OS partitions. Then you can use them as raw devices, and they should be portable between OS's which can handle FDISK partitioned devices. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
I did something like the following: format -e /dev/rdsk/c5t0d0p0 fdisk 1 (create) F (EFI) 6 (exit) partition label 1 y 0 usr wm 64 4194367e 1 usr wm 4194368 117214990 label 1 y Total disk size is 9345 cylinders Cylinder size is 12544 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 EFI 0 93459346100 partition> print Current partition table (original): Total disk sectors available: 117214957 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm642.00GB 4194367 1usrwm 4194368 53.89GB 117214990 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 1172149918.00MB 117231374 This isn't the output from when I did it but it is exactly the same steps that I followed. Thanks for the info about slices, I may give that a go later on. I'm not keen on that because I have clear evidence (as in zpools set up this way, right now, working, without issue) that GPT partitions of the style shown above work and I want to see why it doesn't work in my set up rather than simply ignoring and moving on. From: Fajar A. Nugraha [mailto:w...@fajar.net] Sent: Sunday, 17 March 2013 3:04 PM To: Andrew Werchowiecki Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] partioned cache devices On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki mailto:andrew.werchowie...@xpanse.com.au>> wrote: I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault. How did you create the partition? Are those marked as solaris partition, or something else (e.g. fdisk on linux use type "83" by default). I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability. Linux can read solaris slice and import solaris-made pools just fine, as long as you're using compatible zpool version (e.g. zpool version 28). -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
It's a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog. I'm less concerned about wasted space, more concerned about amount of SAS ports I have available. I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault. I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability. From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [opensolarisisdeadlongliveopensola...@nedharvey.com] Sent: Friday, 15 March 2013 8:44 PM To: Andrew Werchowiecki; zfs-discuss@opensolaris.org Subject: RE: partioned cache devices > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Andrew Werchowiecki > > muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 > Password: > cannot open '/dev/dsk/c25t10d1p2': I/O error > muslimwookie@Pyzee:~$ > > I have two SSDs in the system, I've created an 8gb partition on each drive for > use as a mirrored write cache. I also have the remainder of the drive > partitioned for use as the read only cache. However, when attempting to add > it I get the error above. Sounds like you're probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2. If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6 Generally speaking, it's unadvisable to split the slog/cache devices anyway. Because: If you're splitting it, evidently you're focusing on the wasted space. Buying an expensive 128G device where you couldn't possibly ever use more than 4G or 8G in the slog. But that's not what you should be focusing on. You should be focusing on the speed (that's why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog. You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you'll run is: Since a slog is write-only (except during mount, typically at boot) it's possible to have a failure mode where you think you're writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there's an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool. I've never heard of anyone actually being that paranoid, and I've never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical. Mirroring the slog device really isn't necessary in the modern age. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] partioned cache devices
Hi all, I'm having some trouble with adding cache drives to a zpool, anyone got any ideas? muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 Password: cannot open '/dev/dsk/c25t10d1p2': I/O error muslimwookie@Pyzee:~$ I have two SSDs in the system, I've created an 8gb partition on each drive for use as a mirrored write cache. I also have the remainder of the drive partitioned for use as the read only cache. However, when attempting to add it I get the error above. Here's a zpool status: pool: aggr0 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Feb 21 21:13:45 2013 1.13T scanned out of 20.0T at 106M/s, 51h52m to go 74.2G resilvered, 5.65% done config: NAME STATE READ WRITE CKSUM aggr0DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c7t5000C50035CA68EDd0ONLINE 0 0 0 c7t5000C5003679D3E2d0ONLINE 0 0 0 c7t50014EE2B16BC08Bd0ONLINE 0 0 0 c7t50014EE2B174216Dd0ONLINE 0 0 0 c7t50014EE2B174366Bd0ONLINE 0 0 0 c7t50014EE25C1E7646d0ONLINE 0 0 0 c7t50014EE25C17A62Cd0ONLINE 0 0 0 c7t50014EE25C17720Ed0ONLINE 0 0 0 c7t50014EE206C2AFD1d0ONLINE 0 0 0 c7t50014EE206C8E09Fd0ONLINE 0 0 0 c7t50014EE602DFAACAd0ONLINE 0 0 0 c7t50014EE602DFE701d0ONLINE 0 0 0 c7t50014EE20677C1C1d0ONLINE 0 0 0 replacing-13 UNAVAIL 0 0 0 c7t50014EE6031198C1d0 UNAVAIL 0 0 0 cannot open c7t50014EE0AE2AB006d0 ONLINE 0 0 0 (resilvering) c7t50014EE65835480Dd0ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 c25t10d1p1 ONLINE 0 0 0 c25t9d1p1ONLINE 0 0 0 errors: No known data errors As you can see, I've successfully added the 8gb partitions in a write caches. Interestingly, when I do a zpool iostat -v it shows the total as 111gb: capacity operationsbandwidth pool alloc free read write read write --- - - - - - - aggr020.0T 7.27T 1.33K139 81.7M 4.19M raidz2 20.0T 7.27T 1.33K115 81.7M 2.70M c7t5000C50035CA68EDd0- -566 9 6.91M 241K c7t5000C5003679D3E2d0- -493 8 6.97M 242K c7t50014EE2B16BC08Bd0- -544 9 7.02M 239K c7t50014EE2B174216Dd0- -525 9 6.94M 241K c7t50014EE2B174366Bd0- -540 9 6.95M 241K c7t50014EE25C1E7646d0- -549 9 7.02M 239K c7t50014EE25C17A62Cd0- -534 9 6.93M 241K c7t50014EE25C17720Ed0- -542 9 6.95M 241K c7t50014EE206C2AFD1d0- -549 9 7.02M 239K c7t50014EE206C8E09Fd0- -526 10 6.94M 241K c7t50014EE602DFAACAd0- -576 10 6.91M 241K c7t50014EE602DFE701d0- -591 10 7.00M 239K c7t50014EE20677C1C1d0- -530 10 6.95M 241K replacing- - 0922 0 7.11M c7t50014EE6031198C1d0 - - 0 0 0 0 c7t50014EE0AE2AB006d0 - - 0622 2 7.10M c7t50014EE65835480Dd0- -595 10 6.98M 239K logs - - - - - - mirror 740K 111G 0 43 0 2.75M c25t10d1p1 - - 0 43 3 2.75M c25t9d1p1- - 0 43 3 2.75M --- - - - - - - rpool7.32G 12.6G 2 4 41.9K 43.2K c4t0d0s0 7.32G 12.6G 2 4 41.9K 43.2K --- - - - - - - Something funky is going on here... Wooks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Darren J Moffat [mailto:darr...@opensolaris.org] Support for SCSI UNMAP - both issuing it and honoring it when it is the backing store of an iSCSI target. When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem... Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever. SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that some blocks are no longer needed. (This might be because a file has been deleted in the filesystem on the device.) In the case of a Flash device, it can optimise usage by knowing this, e.g. it can perhaps perform a background erase on the real blocks so they're ready for reuse sooner, and/or better optimise wear leveling by having more spare space to play with. There are some devices in which this enables the device to improve its lifetime by performing better wear leveling when having more spare space. It can also help by avoiding some read-modify-write operations, if the device knows the data that is in the rest of the 4k block is no loner needed. In the case of an iSCSI LUN target, these blocks no longer need to be archived, and if sparse space allocation is in use, the space they occupied can be freed off. In the particular case of ZFS provisioning the iSCSI LUN (COMSTAR), you might get performance improvements by having more free space to play with during other write operations to allow better storage layout optimisation. So, bottom line is longer life of SSDs (maybe higher performance too if there's less waiting for erases during writes), and better space utilisation and performance for a ZFS COMSTAR target. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?
In my own experiments with my own equivalent of mbuffer, it's well worth giving the receiving side a buffer which is sized to hold the amount of data in a transaction commit, which allows ZFS to be banging out one tx group to disk, whilst the network is bringing the next one across for it. This will be roughly the link speed in bytes/second x 5, plus a bit more for good measure, say 250-300Mbytes for a gigabit link. It seems to be most important when the disks and the network link have similar max theoretical bandwidths (100Mbytes/sec is what you might expect from both gigabit ethernet and reasonable disks), and it becomes less important as the difference in max performance between them increases. Without the buffer, you tend to see the network run flat out for 5 seconds, and then the receiving disks run flat out for 5 seconds, alternating back and forth, whereas with the buffer, both continue streaming at full gigabit speed without a break. I have not seen any benefit of buffering on the sending side, although I'd still be inclined to include a small one. YMMV... Palmer, Trey wrote: We have found mbuffer to be the fastest solution. Our rates for large transfers on 10GbE are: 280MB/smbuffer 220MB/srsh 180MB/sHPN-ssh unencrypted 60MB/s standard ssh The tradeoff mbuffer is a little more complicated to script; rsh is, well, you know; and hpn-ssh requires rebuilding ssh and (probably) maintaining a second copy of it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1
3112 and 3114 were very early SATA controllers before there were any SATA drivers, which pretend to be ATA controllers to the OS. No one should be using these today. sol wrote: Oh I can run the disks off a SiliconImage 3114 but it's the marvell controller that I'm trying to get working. I'm sure it's the controller which is used in the Thumpers so it should surely work in solaris 11.1 *From:* Bob Friesenhahn If the SATA card you are using is a JBOD-style card (i.e. disks are portable to a different controller), are you able/willing to swap it for one that Solaris is known to support well? ---- -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove disk
Bob Friesenhahn wrote: On Sat, 1 Dec 2012, Jan Owoc wrote: When I would like to change the disk, I also would like change the disk enclosure, I don't want to use the old one. You didn't give much detail about the enclosure (how it's connected, how many disk bays it has, how it's used etc.), but are you able to power off the system and transfer the all the disks at once? And what happen if I have 24, 36 disks to change ? It's take mounth to do that. Those are the current limitations of zfs. Yes, with 12x2TB of data to copy it could take about a month. You can create a brand new pool with the new chassis and use 'zfs send' to send a full snapshot of each filesystem to the new pool. After the bulk of the data has been transferred, take new snapshots and send the remainder. This expects that both pools can be available at once. or if you don't care about existing snapshots, use Shadow Migration to move the data across. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
Arne Jansen wrote: We have finished a beta version of the feature. What does FITS stand for? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Schweiss, Chip How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's been wound back by up to 30 seconds when you reboot.) This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] all in one server
Richard Elling wrote: On Sep 18, 2012, at 7:31 AM, Eugen Leitl <mailto:eu...@leitl.org>> wrote: Can I actually have a year's worth of snapshots in zfs without too much performance degradation? I've got 6 years of snapshots with no degradation :-) $ zfs list -t snapshot -r export/home | wc -l 1951 $ echo 1951 / 365 | bc -l 5.34520547945205479452 $ So you're slightly ahead of my 5.3 years of daily snapshots:-) -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 05/28/12 20:06, Iwan Aucamp wrote: I'm getting sub-optimal performance with an mmap based database (mongodb) which is running on zfs of Solaris 10u9. System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks - a few mongodb instances are running with with moderate IO and total rss of 50 GB - a service which logs quite excessively (5GB every 20 mins) is also running (max 2GB ram use) - log files are compressed after some time to bzip2. Database performance is quite horrid though - it seems that zfs does not know how to manage allocation between page cache and arc cache - and it seems arc cache wins most of the time. I'm thinking of doing the following: - relocating mmaped (mongo) data to a zfs filesystem with only metadata cache - reducing zfs arc cache to 16 GB Is there any other recommendations - and is above likely to improve performance. 1. Upgrade to S10 Update 10 - this has various performance improvements, in particular related to database type loads (but I don't know anything about mongodb). 2. Reduce the ARC size so RSS + ARC + other memory users < RAM size. I assume the RSS include's whatever caching the database does. In theory, a database should be able to work out what's worth caching better than any filesystem can guess from underneath it, so you want to configure more memory in the DB's cache than in the ARC. (The default ARC tuning is unsuitable for a database server.) 3. If the database has some concept of blocksize or recordsize that it uses to perform i/o, make sure the filesystems it is using configured to be the same recordsize. The ZFS default recordsize (128kB) is usually much bigger than database blocksizes. This is probably going to have less impact with an mmaped database than a read(2)/write(2) database, where it may prove better to match the filesystem's record size to the system's page size (4kB, unless it's using some type of large pages). I haven't tried playing with recordsize for memory mapped i/o, so I'm speculating here. Blocksize or recordsize may apply to the log file writer too, and it may be that this needs a different recordsize and therefore has to be in a different filesystem. If it uses write(2) or some variant rather than mmap(2) and doesn't document this in detail, Dtrace is your friend. 4. Keep plenty of free space in the zpool if you want good database performance. If you're more than 60% full (S10U9) or 80% full (S10U10), that could be a factor. Anyway, there are a few things to think about. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs_arc_max values
On 05/17/12 15:03, Bob Friesenhahn wrote: On Thu, 17 May 2012, Paul Kraus wrote: Why are you trying to tune the ARC as _low_ as possible? In my experience the ARC gives up memory readily for other uses. The only place I _had_ to tune the ARC in production was a couple systems running an app that checks for free memory _before_ trying to allocate it. If the ARC has all but 1 GB in use, the app (which is looking for On my system I adjusted the ARC down due to running user-space applications with very bursty short-term large memory usage. Reducing the ARC assured that there would be no contention between zfs ARC and the applications. If the system is running one app which expects to do lots of application level caching (and in theory, the app should be able to work out what's worth caching and what isn't better than any filesystem underneath it can guess), then you should be planning your memory usage accordingly. For example, a database server, you probably want to allocate much of the system's memory to the database cache (in the case of Oracle, the SGA), leaving enough for a smaller ZFS arc and the memory required by the OS and app. Depends on the system and database size, but something like 50% SGA, 25% ZFS ARC, 25% for everything else might be an example, with the SGA disproportionally bigger on larger systems with larger databases. On my desktop system (supposed to be 8GB RAM, but currently 6GB due to a dead DIMM), I have knocked the ARC down to 1GB. I used to find the ARC wouldn't shrink in size until system had got to the point of crawling along showing anon page-ins, and some app (usually firefox or thunderbird) had already become too difficult to use. I must admit I did this a long time ago, and ZFS's shrinking of the ARC may be more proactive now than it was back then, but I don't notice any ZFS performance issues with the ARC restricted to 1GB on a desktop system. It may have increase scrub times, but that happens when I'm in bed, so I don't care. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... #define _FILE_OFFSET_BITS 64 #include #include #include #include #include int main(int argc, char **argv) { int i; for (i = 1; i< argc; i++) { int fd; fd = open(argv[i], O_RDONLY); if (fd< 0) { perror(argv[i]); } else { off_t eof; off_t hole; if (((eof = lseek(fd, 0, SEEK_END))< 0) || lseek(fd, 0, SEEK_SET)< 0) { perror(argv[i]); } else if (eof == 0) { printf("%s: empty\n", argv[i]); } else { hole = lseek(fd, 0, SEEK_HOLE); if (hole< 0) { perror(argv[i]); } else if (hole< eof) { printf("%s: sparse\n", argv[i]); } else { printf("%s: not sparse\n", argv[i]); } } close(fd); } } return 0; } On 03/26/12 10:06 PM, ольга крыжановская wrote: Mike, I was hoping that some one has a complete example for a bool has_file_one_or_more_holes(const char *path) function. Olga 2012/3/26 Mike Gerdts: 2012/3/26 ольга крыжановская: How can I test if a file on ZFS has holes, i.e. is a sparse file, using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any recommendations on Perc H700 controller on Dell Rx10 ?
On 03/10/12 09:29, Sriram Narayanan wrote: Hi folks: At work, I have an R510, and R610 and an R710 - all with the H700 PERC controller. Based on experiments, it seems like there is no way to bypass the PERC controller - it seems like one can only access the individual disks if they are set up in RAID0 each. This brings me to ask some questions: a. Is it fine (in terms of an intelligent controller coming in the way of ZFS) to have the PERC controllers present each drive as RAID0 drives ? b. Would there be any errors in terms of PERC doing things that ZFS is not aware of and this causing any issues later ? I had to produce a ZFS hybrid storage pool performance demo, and was initially given a system with a RAID-only controller (different from yours, but same idea). I created the demo with it, but disabled the RAID's cache as that wasn't what I wanted in the picture. Meanwhile, I ordered the non-RAID version of the card. When it came and I swapped it in. A couple of issues... ZFS doesn't recognise any of the disks obviously, because they have propitiatory RAID headers on them, so they have to be created again from scratch. (That was no big deal in this case, and if it had been, I could have done a zfs send and receive to somewhere else temporarily.) The performance went up, a tiny bit for the spinning disks, and by 50% for the SSDs, so the RAID controller was seriously limiting the IOPs of the SSDs in particular. This was when SSDs were relatively new, and the controllers may not have been designed with SSDs in mind. That's likely to be somewhat different nowadays, but I don't have any data to show that either way. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send only difference between snapshots
skeletor wrote: There is a task: make backup by sending snapshots to another server. But I don't want to send each time a complete snapshot of the system - I want to send only the difference between a snapshots. For example: there are 2 servers, and I want to do the snapshot on the master, send only the difference between the current and recent snapshots on the backup and then deploy it on backup. Any ideas how this can be done? It's called an incremental - it's part of the zfs send command line options. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
Gary Mills wrote: On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote: On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov wrote: "Does raidzN actually protect against bitrot?" That's a kind of radical, possibly offensive, question formula that I have lately. Yup, it does. That's why many of us use it. There's actually no such thing as bitrot on a disk. Each sector on the disk is accompanied by a CRC that's verified by the disk controller on each read. It will either return correct data or report an unreadable sector. There's nothing inbetween. Actually, there are a number of disk firmware and cache faults inbetween, which zfs has picked up over the years. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Stress test zfs
grant lowe wrote: Ok. I blew it. I didn't add enough information. Here's some more detail: Disk array is a RAMSAN array, with RAID6 and 8K stripes. I'm measuring performance with the results of the bonnie++ output and comparing with with the the zpool iostat output. It's with the zpool iostat I'm not seeing a lot of writes. Since ZFS never writes data back where it was, it can coalesce multiple outstanding writes into fewer device writes. This may be what you're seeing. I have a ZFS IOPs demo where the (multi-threaded) application is performing over 10,000 synchronous write IOPs, but the underlying devices are only performing about 1/10th of that, due to ZFS coalescing multiple outstanding writes. Sorry, I'm not familiar with what type of load bonnie generates. -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle EMEA Server Pre-Sales ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Hardware and Software, Engineered to Work Together ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
Does "current" include sol10u10 as well as sol11? If so, when did that go in? Was it in sol10u9? Thanks, Andrew From: Cindy Swearingen mailto:cindy.swearin...@oracle.com>> Subject: Re: [zfs-discuss] Can I create a mirror for a root rpool? Date: December 16, 2011 10:38:21 AM CST To: Tim Cook mailto:t...@cook.ms>> Cc: mailto:zfs-discuss@opensolaris.org>> Hi Tim, No, in current Solaris releases the boot blocks are installed automatically with a zpool attach operation on a root pool. Thanks, Cindy On 12/15/11 17:13, Tim Cook wrote: Do you still need to do the grub install? On Dec 15, 2011 5:40 PM, "Cindy Swearingen" mailto:cindy.swearin...@oracle.com> <mailto:cindy.swearin...@oracle.com>> wrote: Hi Anon, The disk that you attach to the root pool will need an SMI label and a slice 0. The syntax to attach a disk to create a mirrored root pool is like this, for example: # zpool attach rpool c1t0d0s0 c1t1d0s0 Thanks, Cindy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
On 12/16/11 07:27 AM, Gregg Wonderly wrote: Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case Can you be more specific why it fails? I have seen a couple of cases, and I'm wondering if you're hitting the same thing. Can you post the prtvtoc output of your original disk please? and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow zfs send/recv speed
On 11/15/11 23:40, Tim Cook wrote: On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel mailto:andrew.gabr...@oracle.com>> wrote: On 11/15/11 23:05, Anatoly wrote: Good day, The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to 100+ disks in pool. But the speed doesn't vary in any degree. As I understand 'zfs send' is a limiting factor. I did tests by sending to /dev/null. It worked out too slow and absolutely not scalable. None of cpu/memory/disk activity were in peak load, so there is of room for improvement. Is there any bug report or article that addresses this problem? Any workaround or solution? I found these guys have the same result - around 7 Mbytes/s for 'send' and 70 Mbytes for 'recv'. http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, the send runs at almost 100Mbytes/sec, so it's pretty much limited by the ethernet. Since you have provided none of the diagnostic data you collected, it's difficult to guess what the limiting factor is for you. -- Andrew Gabriel So all the bugs have been fixed? Probably not, but the OP's implication that zfs send has a specific rate limit in the range suggested is demonstrably untrue. So I don't know what's limiting the OP's send rate. (I could guess a few possibilities, but that's pointless without the data.) I seem to recall people on this mailing list using mbuff to speed it up because it was so bursty and slow at one point. IE: http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/ Yes, this idea originally came from me, having analyzed the send/receive traffic behavior in combination with network connection behavior. However, it's the receive side that's bursty around the TXG commits, not the send side, so that doesn't match the issue the OP is seeing. (The buffer sizes in that blog are not optimal, although any buffer at the receive side will make a significant improvement if the network bandwidth is same order of magnitude as the send/recv are capable of.) -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow zfs send/recv speed
On 11/15/11 23:05, Anatoly wrote: Good day, The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to 100+ disks in pool. But the speed doesn't vary in any degree. As I understand 'zfs send' is a limiting factor. I did tests by sending to /dev/null. It worked out too slow and absolutely not scalable. None of cpu/memory/disk activity were in peak load, so there is of room for improvement. Is there any bug report or article that addresses this problem? Any workaround or solution? I found these guys have the same result - around 7 Mbytes/s for 'send' and 70 Mbytes for 'recv'. http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, the send runs at almost 100Mbytes/sec, so it's pretty much limited by the ethernet. Since you have provided none of the diagnostic data you collected, it's difficult to guess what the limiting factor is for you. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
ZFS detects far more errors that traditional filesystems will simply miss. This means that many of the possible causes for those errors will be something other than a real bad block on the disk. As Edward said, the disk firmware should automatically remap real bad blocks, so if ZFS did that too, we'd not use the remapped block, which is probably fine. For other errors, there's nothing wrong with the real block on the disk - it's going to be firmware, driver, cache corruption, or something else, so blacklisting the block will not solve the issue. Also, with some types of disk (SSD), block numbers are moved around to achieve wear leveling, so blacklistinng a block number won't stop you reusing that real block. -- Andrew Gabriel (from mobile) --- Original message --- From: Edward Ned Harvey To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org Sent: 8.11.'11, 12:50 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Didier Rebeix from ZFS documentation it appears unclear to me if a "zpool scrub" will black list any found bad blocks so they won't be used anymore. If there are any physically bad blocks, such that the hardware (hard disk) will return an error every time that block is used, then the disk should be replaced. All disks have a certain amount of error detection/correction built in, and remap bad blocks internally and secretly behind the scenes, transparent to the OS. So if there are any blocks regularly reporting bad to the OS, then it means there is a growing problem inside the disk. Offline the disk and replace it. It is ok to get an occasional cksum error. Say, once a year. Because the occasional cksum error will be re-read and as long as the data is correct the second time, no problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool replace not concluding + duplicate drive label
On 28/10/2011, at 3:06 PM, Daniel Carosone wrote: > On Thu, Oct 27, 2011 at 10:49:22AM +1100, afree...@mac.com wrote: >> Hi all, >> >> I'm seeing some puzzling behaviour with my RAID-Z. >> > > Indeed. Start with zdb -l on each of the disks to look at the labels in more > detail. > > -- > Dan. I'm reluctant to include a monstrous wall of text so I've placed the output at http://dl.dropbox.com/u/19420697/zdb.out. Immediately I'm struck by the sad dearth of information on da6, the similarity of the da0 + da0/old subtree to the zpool status information and my total lack of knowledge on how to use this data in any beneficial fashion. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper (X4500), and CF SSD for L2ARC = ?
Jim Klimov wrote: Thanks, but I believe currently that's out of budget, but a 90MB/s CF module may be acceptable for the small business customer. I wondered if that is known to work or not... I've had a compact flash IDE drive not work in a white-box system. In that case it was a ufs root disk, but any attempt to put a serious load on it, and it corrupted data all over the place. So if you're going to try one, make sure you hammer it very hard in a test environment before you commit anything important to it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] solaris 10u8 hangs with message Disconnected command timeout for Target 0
Ding Honghui wrote: Hi, My solaris storage hangs. I login to the console and there is messages[1] display on the console. I can't login into the console and seems the IO is totally blocked. The system is solaris 10u8 on Dell R710 with disk array Dell MD3000. 2 HBA cable connect the server and MD3000. The symptom is random. It is very appreciated if any one can help me out. The SCSI target you are talking to is being reset. "Unit Attention" means it's forgotten what operating parameters have been negotiated with the system and is a warning the device might have been changed without the system knowing, and it's telling you this happened because of "device internal reset". That sort of thing can happen if the firmware in the SCSI target crashes and restarts, or the power supply blips, or if the device was swapped. I don't know anything about a Dell MD3000, but given it's happened on lots of disks at the same moment following a timeout, it looks like the array power cycled or array firmware (if any) rebooted. (Not sure if a SCSI bus reset can do this or not.) [1] Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /pci@0,0/pci8086,3410@9/pci8086,32c@0/pci1028,1f04@8 (mpt1): Aug 16 13:14:16 nas-hz-02 Disconnected command timeout for Target 0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802a44b8f0ded (sd47): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679073Error Block: 1380679073 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa18029e4b8f0d61 (sd41): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802a24b8f0dc5 (sd45): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679073Error Block: 1380679073 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa18029c4b8f0d35 (sd39): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802984b8f0cd2 (sd35): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sudden drop in disk performance - WD20EURS & 4k sectors to blame?
David Wragg wrote: I've not done anything different this time from when I created the original (512b) pool. How would I check ashift? For a zpool called "export"... # zdb export | grep ashift ashift: 12 ^C # As far as I know (although I don't have any WD's), all the current 4k sectorsize hard drives claim to be 512b sectorsize, so if you didn't do anything special, you'll probably have ashift=9. I would look at a zpool iostat -v to see what the IOPS rate is (you may have bottomed out on that), and I would also work out average transfer size (although that alone doesn't necessarily tell you much - a dtrace quantize aggregation would be better). Also check service times on the disks (iostat) to see if there's one which is significantly worse and might be going bad. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk IDs and DD
Lanky Doodle wrote: Oh no I am not bothered at all about the target ID numbering. I just wondered if there was a problem in the way it was enumerating the disks. Can you elaborate on the dd command LaoTsao? Is the 's' you refer to a parameter of the command or the slice of a disk - none of my 'data' disks have been 'configured' yet. I wanted to ID them before adding them to pools. Use p0 on x86 (whole disk, without regard to any partitioning). Any other s or p device node may or may not be there, depending on what partitions/slices are on the disk. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] matching zpool versions to development builds
John Martin wrote: Is there a list of zpool versions for development builds? I found: http://blogs.oracle.com/stw/entry/zfs_zpool_and_file_system where it says Solaris 11 Express is zpool version 31, but my system has BEs back to build 139 and I have not done a zpool upgrade since installing this system but it reports on the current development build: # zpool upgrade -v This system is currently running ZFS pool version 33. It's painfully laid out (each on a separate page), but have a look at http://hub.opensolaris.org/bin/view/Community+Group+zfs/31 (change the version on the end of the URL). It conks out at version 31 though. I have systems back to build 125, so I tend to always force zpool version 19 for that (and that automatically limits zfs version to 4). There's also some info about some builds on the zfs wikipedia page http://en.wikipedia.org/wiki/Zfs -- Andrew Gabriel* *** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale performance query
Alexander Lesle wrote: And what is your suggestion for scrubbing a mirror pool? Once per month, every 2 weeks, every week. There isn't just one answer. For a pool with redundancy, you need to do a scrub just before the redundancy is lost, so you can be reasonably sure the remaining data is correct and can rebuild the redundancy. The problem comes with knowing when this might happen. Of course, if you are doing some planned maintenance which will reduce the pool redundancy, then always do a scrub before that. However, in most cases, the redundancy is lost without prior warning, and you need to do periodic scrubs to cater for this case. I do a scrub via cron once a week on my home system. Having almost completely filled the pool, this was taking about 24 hours. However, now that I've replaced the disks and done a send/recv of the data across to a new larger pool which is only 1/3rd full, that's dropped down to 2 hours. For a pool with no redundancy, where you rely only on backups for recovery, the scrub needs to be integrated into the backup cycle, such that you will discover corrupt data before it has crept too far through your backup cycle to be able to find a non corrupt version of the data. When you have a new hardware setup, I would perform scrubs more frequently as a further check that the hardware doesn't have any systemic problems, until you have gained confidence in it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding mirrors to an existing zfs-pool
Bernd W. Hennig wrote: G'Day, - zfs pool with 4 disks (from Clariion A) - must migrate to Clariion B (so I created 4 disks with the same size, avaiable for the zfs) The zfs pool has no mirrors, my idea was to add the new 4 disks from the Clariion B to the 4 disks which are still in the pool - and later remove the original 4 disks. I only found in all example how to create a new pool with mirrors but no example how to add to a pool without mirrors a mirror disk for each "disk" in the pool. - is it possible to add disks to each disk in the pool (they have different sizes, so I have exact add the correct disks form Clariion B to the original disk from Clariion B) - can I later "remove" the disks from the Clariion A, pool is intact, user can work with the pool Depends on a few things... What OS are you running, and what release/update or build? What's the RAID layout of your pool "zpool status"? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send/receive and ashift
Does anyone know if it's OK to do zfs send/receive between zpools with different ashift values? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Mount Options
Tony MacDoodle wrote: I have a zfs pool called logs (about 200G). I would like to create 2 volumes using this chunk of storage. However, they would have different mount points. ie. 50G would be mounted as /oarcle/logs 100G would be mounted as /session/logs is this possible? Yes... zfs create -o mountpoint=/oracle/logs logs/oracle zfs create -o mountpoint=/session/logs logs/session If you don't otherwise specify, the two filesystem will share the pool without any constraints. If you wish to limit their max space... zfs set quota=50g logs/oracle zfs set quota=100g logs/session and/or if you wish to reserve a minimum space... zfs set reservation=50g logs/oracle zfs set reservation=100g logs/session Do I have to use the legacy mount options? You don't have to. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How create a FAT filesystem on a zvol?
Gary Mills wrote: On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote: On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills wrote: The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? seems not. [...] Some solaris tools (like fdisk, or "mkfs -F pcfs") needs disk geometry to function properly. zvols doesn't provide that. If you want to use zvols to work with such tools, the easiest way would be using lofi, or exporting zvols as iscsi share and import it again. For example, if you have a 10MB zvol and use lofi, fdisk would show these geometry Total disk size is 34 cylinders Cylinder size is 602 (512 byte) blocks ... which will then be used if you run "mkfs -F pcfs -o nofdisk,size=20480". Without lofi, the same command would fail with Drive geometry lookup (need tracks/cylinder and/or sectors/track: Operation not supported So, why can I do it with UFS? # zfs create -V 10m rpool/vol1 # newfs /dev/zvol/rdsk/rpool/vol1 newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y Warning: 4130 sector(s) in last cylinder unallocated /dev/zvol/rdsk/rpool/vol1: 20446 sectors in 4 cylinders of 48 tracks, 128 sectors 10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, Why is this different from PCFS? UFS has known for years that drive geometries are bogus, and just fakes something up to keep itself happy. What UFS thinks of as a cylinder bares no relation to actual disk cylinders. If you give mkfs_pcfs all the geom data it needs, then it won't try asking the device... andrew@opensolaris:~# zfs create -V 10m rpool/vol1 andrew@opensolaris:~# mkfs -F pcfs -o fat=16,nofdisk,nsect=255,ntrack=63,size=2 /dev/zvol/rdsk/rpool/vol1 Construct a new FAT file system on /dev/zvol/rdsk/rpool/vol1: (y/n)? y andrew@opensolaris:~# fstyp /dev/zvol/rdsk/rpool/vol1 pcfs andrew@opensolaris:~# fsck -F pcfs /dev/zvol/rdsk/rpool/vol1 ** /dev/zvol/rdsk/rpool/vol1 ** Scanning file system meta-data ** Correcting any meta-data discrepancies 10143232 bytes. 0 bytes in bad sectors. 0 bytes in 0 directories. 0 bytes in 0 files. 10143232 bytes free. 512 bytes per allocation unit. 19811 total allocation units. 19811 available allocation units. andrew@opensolaris:~# mount -F pcfs /dev/zvol/dsk/rpool/vol1 /mnt andrew@opensolaris:~# -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 512b vs 4K sectors
Richard Elling wrote: On Jul 4, 2011, at 6:42 AM, Lanky Doodle wrote: Hiya, I''ve been doing a lot of research surrounding this and ZFS, including some posts on here, though I am still left scratching my head. I am planning on using slow RPM drives for a home media server, and it's these that seem to 'suffer' from a few problems; Seagate Barracuda LP - Looks to be the only true 512b sector hard disk. Serious firmware issues Western Digital Cavier Green - 4K sectors = crap write performance Hitachi 5K3000 - Variable sector sizing (according to tech. specs) Samsung SpinPoint F4 - Just plain old problems with them What is the best drive of the above 4, and are 4K drives really a no-no with ZFS. Are there any alternatives in the same price bracket? 4K drives are fine, especially if the workload is read-mostly. Depending on the OS, you can tell ZFS to ignore the incorrect physical sector size reported by some drives. Today, this is easiest in FreeBSD, a little bit more tricky in OpenIndiana (patches and source are available for a few different implementations). Or you can just trick them out by starting the pool with a 4K sector device that doesn't lie (eg, iscsi target). Who would have thought choosing a hard disk could be so 'hard'! I recommend enterprise-grade disks, none of which made your short list ;-(. -- richard I'm going through this at the moment. I've bought a pair of Seagate Barracuda XT 2Tb disks (which are a bit more Enterprise than the list above), just plugged them in, and so far they're OK. Not had them long enough to report on longevity. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 700GB gone?
On 06/30/11 08:50 PM, Orvar Korvar wrote: I have a 1.5TB disk that has several partitions. One of them is 900GB. Now I can only see 300GB. Where is the rest? Is there a command I can do to reach the rest of the data? Will scrub help? Not much to go on - no one can answer this. How did you go about partitioning the disk? What does the fdisk partitioning look like (if its x86)? What does the VToC slice layout look like? What are you using each partition and slice for? What tells you that you can only see 300GB? -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 06/27/11 11:32 PM, Bill Sommerfeld wrote: On 06/27/11 15:24, David Magda wrote: Given the amount of transistors that are available nowadays I think it'd be simpler to just create a series of SIMD instructions right in/on general CPUs, and skip the whole co-processor angle. see: http://en.wikipedia.org/wiki/AES_instruction_set Present in many current Intel CPUs; also expected to be present in AMD's "Bulldozer" based CPUs. I recall seeing a blog comparing the existing Solaris hand-tuned AES assembler performance with the (then) new AES instruction version, where the Intel AES instructions only got you about a 30% performance increase. I've seen reports of better performance improvements, but usually by comparing with the performance on older processors which are going to be slower for additional reasons then just missing the AES instructions. Also, you could claim better performance improvement if you compared against a less efficient original implementation of AES. What this means is that a faster CPU may buy you more crypto performance than the AES instructions alone will do. My understanding from reading the Intel AES instruction set (which I warn might not be completely correct) is that the AES encryption/decryption instruction is executed between 10 and 14 times (depending on key length) for each 128 bits (16 bytes) of data being encrypted/decrypted, so it's very much part of the regular instruction pipeline. The code will have to loop though this process multiple times to process a data block bigger than 16 bytes, i.e. a double nested loop, although I expect it's normally loop-unrolled a fair degree for optimisation purposes. Conversely, the crypto units in the T-series processors are separate from the CPU, and do the encryption/decryption whilst the CPU is getting on with something else, and they do it much faster than it could be done on the CPU. Small blocks are normally a problem for crypto offload engines because the overhead of farming off the work to the engine and getting the result back often means that you can do the crypto on the CPU faster than the time it takes to get the crypto engine started and stopped. However, T-series crypto is particularly good at handling small blocks efficiently, such as around 1kbyte which you are likely to find in a network packet, as it is much closer coupled to the CPU than a PCI crypto card can be, and performance with small packets was key for the crypto networking support T-series was designed for. Of course, it handles crypto of large blocks just fine too. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote: Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. I agree. And disksort is in the mix, too. Oh, I'd never looked at that. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. ...and disksort still survives... maybe we should kill it? It looks like it's possibly slightly worse than the pathologically worst response time case I described below... There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os sent to the disk. Does that also go through disksort? Disksort doesn't seem to have any concept of priorities (but I haven't looked in detail where it plugs in to the whole framework). So it might make better sense for ZFS to keep the disk queue depth small for HDDs. -- richard -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compare snapshot to current zfs fs
Harry Putnam wrote: I have a sneaking feeling I'm missing something really obvious. If you have zfs fs that see little use and have lost track of whether changes may have occurred since last snapshot, is there some handy way to determine if a snapshot matches its filesystem. Or put another way, some way to determine if the snapshot is different than its current filesystem. I knot about the diff tools and of course I guess one could compare overall sizes in bytes for a good idea, but is there a way provided by zfs? If you have a recent enough OS release... zfs diff [ | ] -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
On 05/14/11 01:08 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Donald Stahl Running a zpool scrub on our production pool is showing a scrub rate of about 400K/s. (When this pool was first set up we saw rates in the MB/s range during a scrub). Wait longer, and keep watching it. Or just wait till it's done and look at the total time required. It is normal to have periods of high and low during scrub. I don't know why. Check the IOPS per drive - you may be maxing out on one of them if it's in an area where there are lots of small blocks. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
Toby Thain wrote: On 08/05/11 10:31 AM, Edward Ned Harvey wrote: ... Incidentally, does fsync() and sync return instantly or wait? Cuz "time sync" might product 0 sec every time even if there were something waiting to be flushed to disk. The semantics need to be synchronous. Anything else would be a horrible bug. sync(2) is not required to be synchronous. I believe that for ZFS it is synchronous, but for most other filesystems, it isn't (although a second sync will block until the actions resulting from a previous sync have completed). fsync(3C) is synchronous. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
Dan Shelton wrote: Is anyone aware of any freeware program that can speed up copying tons of data (2 TB) from UFS to ZFS on same server? I use 'ufsdump | ufsrestore'*. I would also suggest try setting 'sync=disabled' during the operation, and reverting it afterwards. Certainly, fastfs (a similar although more dangerous option for ufs) makes ufs to ufs copying significantly faster. *ufsrestore works fine on ZFS filesystems (although I haven't tried it with any POSIX ACLs on the original ufs filesystem, which would probably simply get lost). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
Matthew Anderson wrote: Hi All, I've run into a massive performance problem after upgrading to Solaris 11 Express from oSol 134. Previously the server was performing a batch write every 10-15 seconds and the client servers (connected via NFS and iSCSI) had very low wait times. Now I'm seeing constant writes to the array with a very low throughput and high wait times on the client servers. Zil is currently disabled. How/Why? There is currently one failed disk that is being replaced shortly. Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134? What does "zfs get sync" report? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove zil device
Do you have any details on that CR? Either my Google-fu is failing or Oracle has moved the CR database private. I haven't encountered this problem but would like to know if there are certain behaviors to avoid to not risk this. Has it been fixed in Sol10 or OpenSolaris? Thanks, Andrew From: Cindy Swearingen [cindy.swearin...@oracle.com] Sent: Thursday, March 31, 2011 1:55 PM To: Roy Sigurd Karlsbakk Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Cannot remove zil device You can add and remove mirrored or non-mirrored log devices. Jordan is probably running into CR 7000154: cannot remove log device Thanks, Cindy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
Krunal Desai wrote: On Wed, Feb 23, 2011 at 8:38 AM, Mauricio Tavares wrote: I see what you mean; in http://mail.opensolaris.org/pipermail/opensolaris-discuss/2008-September/043024.html they claim it is supported by the uata driver. What would you suggest instead? Also, since I have the card already, how about if I try it out? My experience with SPARC is limited, but perhaps the Option ROM/BIOS for that card is intended for x86, and not SPARC? I might thinking of another controller, but this could be the case. You could always try to boot with the card; the worst that'll probably happen is boot hangs before the OS even comes into play. SPARC won't try to run the BIOS on the card anyway (it will only run OpenFirmware BIOS), but you will have to make sure the card has the non-RAID BIOS so that the PCI class doesn't claim it to be a RAID controller, which will prevent Solaris going anywhere near the card at all. These cards could be bought with either RAID or non-RAID BIOS, but RAID was more common. You can (or could some time back) download the RAID and non-RAID BIOS from Silicon Image and re-flash which also updates the PCI class, and I think you'll need a Windows system to actually flash the BIOS. You might want to do a google search on "3114 data corruption" too, although it never hit me back when I used the cards. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
Mauricio Tavares wrote: Perhaps a bit off-topic (I asked on the rescue list -- http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and was told to try here), but I am kinda shooting in the dark: I have been finding online scattered and vague info stating that this card can be made to work with a sparc solaris 10 box (http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html is the only link I can offer right now). Can anyone confirm or deny that? 3112/3114 was a very early (possibly the first?) SATA chipset, I think aimed for use before SATA drivers had been developed. I would suggest looking for something more modern. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)
Roy Sigurd Karlsbakk wrote: Nope. Most HDDs today have a single read channel, and they select which head uses that channel at any point in time. They cannot use multiple heads at the same time, because the heads to not travel the same path on their respective surfaces at the same time. There's no real vertical alignment of the tracks between surfaces, and every surface has its own embedded position information that is used when that surface's head is active. There were attempts at multi-actuator designs with separate servo arms and multiple channels, but mechanically they're too difficult to manufacture at high yields as I understood it. Perhaps a stupid question, but why don't they read from all platters in parallel? The answer is in the text you quoted above. There are drives now with two level actuators. The primary actuator is the standard actuator you are familiar with which moves all the arms. The secondary actuator is a piezo crystal towards the head end of the arm which can move the head a few tracks very quickly without having to move the arm, and these are one per head. In theory, this might allow multiple heads to lock on to their respective tracks at the same time for parallel reads, but I haven't heard that they are used in this way. If you go back to the late 1970's before tracks had embedded servo data, on multi-platter disks you had one surface which contained the head positioning servo data, and the drive relied on accurate vertical alignment between heads/surfaces to keep on track (and drives could head-switch instantly). Around 1980, tracks got too close together for this to work anymore, and the servo positioning data was embedded into each track itself. The very first drives of this type scanned all the surfaces on startup to build up an internal table of the relative misalignment of tracks across the surfaces, but this rapidly became unviable as drive capacity increased and this scan would take an unreasonable length of time. It may be that modern drives learn this as they go - I don't know. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey My google-fu is coming up short on this one... I didn't see that it had been discussed in a while ... BTW, there were a bunch of places where people said "ZFS doesn't need trim." Which I hope, by now, has been commonly acknowledged as bunk. The only situation where you don't need TRIM on a SSD is when (a) you're going to fill it once and never write to it again, which is highly unlikely considering the fact that you're buying a device for its fast write performance... (b) you don't care about performance, which is highly unlikely considering the fact that you bought a performance device ... (c) you are using whole disk encryption. This is a valid point. You would probably never TRIM anything from a fully encrypted disk ... In places where people said TRIM was thought to be unnecessary, the justification they stated was that TRIM will only benefit people whose usage patterns are sporadic, rather than sustained. The downfall of that argument is the assumption that the device can't perform TRIM operations simultaneously while performing other operations. That may be true in some cases, or even globally, but without backing, it's just an assumption. One which I find highly quesitonable. TRIM could also be useful where ZFS uses a storage LUN which is sparsely provisioned, in order to deallocate blocks in the LUN which have previously been allocated, but whose contents have since been invalidated. In this case, both ZFS and whatever is providing the storage LUN would need to support TRIM. Out of interest, what other filesystems out there today can generate TRIM commands? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] show hdd life time ?
Richard Elling wrote: On Jan 21, 2011, at 7:36 PM, Tobias Lauridsen wrote: it is possible to se my hdd total time it have been in use so I can switch to a new one before it gets too many hours old In theory, yes. In practice, I've never seen a disk properly report this data on a consistent basis :-( Perhaps some of the more modern disks do a better job? Look for the power on hours (POH) attribute of SMART. http://en.wikipedia.org/wiki/S.M.A.R.T. If you're looking for stats to give an indication of likely wear, and thus increasing probably of failure, POH is probably not very useful by itself (or even at all). Things like Head Flying Hours and Load Cycle Count are probably more indicative, although not necessarily maintained by all drives. Of course, data which gives indication of actual (rather than likely) wear is even more important as an indicator of impending failure, such as the various error and retry counts. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] incorrect vdev added to pool
On 01/15/11 11:32 PM, Gal Buki wrote: Hi I have a pool with a raidz2 vdev. Today I accidentally added a single drive to the pool. I now have a pool that partially has no redundancy as this vdev is a single drive. Is there a way to remove the vdev Not at the moment, as far as I know. and replace it with a new raidz2 vdev? If not what can I do to do damage control and add some redundancy to the single drive vdev? I think you should be able to attach another disk to it to make them into a mirror. (Make sure you attach, and not add.) -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?
Sridhar, You have switched to a new disruptive filesystem technology, and it has to be disruptive in order to break out of all the issues older filesystems have, and give you all the new and wonderful features. However, you are still trying to use old filesystem techniques with it, which is why things don't fit for you, and you are missing out on the more powerful way ZFS presents these features to you. On 11/16/10 06:59 AM, Ian Collins wrote: On 11/16/10 07:19 PM, sridhar surampudi wrote: Hi, How it would help for instant recovery or point in time recovery ?? i.e restore data at device/LUN level ? Why would you want to? If you are sending snapshots to another pool, you can do instant recovery at the pool level. Point in time recovery is a feature of ZFS snapshots. What's more, with ZFS you can see all your snapshots online all the time, read and/or recover just individual files or whole datasets, and the storage overhead is very efficient. If you want to recover a whole LUN, that's presumably because you lost the original, and in this case the system won't have the original filesystem mounted. Currently it is easy as I can unwind the primary device stack and restore data at device/ LUN level and recreate stack. It's probably easier with ZFS to restore data at the pool or filesystem level from snapshots. Trying to work at the device level is just adding an extra level of complexity to a problem already solved. I won't claim ZFS couldn't better support use of back-end Enterprise storage, but in this case, you haven't given any use cases where that's relevant. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing GUID
sridhar surampudi wrote: Hi I am looking in similar lines, my requirement is 1. create a zpool on one or many devices ( LUNs ) from an array ( array can be IBM or HPEVA or EMC etc.. not SS7000). 2. Create file systems on zpool 3. Once file systems are in use (I/0 is happening) I need to take snapshot at array level a. Freeze the zfs flle system ( not required due to zfs consistency : source : mailing groups) b. take array snapshot ( say .. IBM flash copy ) c. Got new snapshot device (having same data and metadata including same GUID of source pool) Now I need a way to change the GUID and pool of snapshot device so that the snapshot device can be accessible on same host or an alternate host (if the LUN is shared). Could you please post commands for the same. There is no way I know of currently. (There was an unofficial program floating around to do this on much earlier opensolaris versions, but it no longer works). If you have a support contract, raise a call and asked to be added to RFE 6744320. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?
sridhar surampudi wrote: Hi Darren, In shot I am looking a way to freeze and thaw for zfs file system so that for harware snapshot, i can do 1. run zfs freeze 2. run hardware snapshot on devices belongs to the zpool where the given file system is residing. 3. run zfs thaw Unlike other filesystems, ZFS is always consistent on disk, so there's no need to freeze a zpool to take a hardware snapshot. The hardware snapshot will effectively contain all transactions up to the last transaction group commit, plus all synchronous transactions up to the hardware snapshot. If you want to be sure that all transactions up to a certain point in time are included (for the sake of an application's data), take a ZFS snapshot (which will force a TXG commit), and then take the hardware snapshot. You will not be able to access the hardware snapshot from the system which has the original zpool mounted, because the two zpools will have the same pool GUID (there's an RFE outstanding on fixing this). The one thing you do need to be careful of is, that with a multi-disk zpool, the hardware snapshot is taken at an identical point in time across all the disks in the zpool. This functionality is usually an extra-charge option in Enterprise storage systems. If the hardware snapshots are staggered across multiple disks, all bets are off, although if you take a zfs snapshot immediately beforehand and you test import/scrub the hardware snapshot (on a different system) immediately (so you can repeat the hardware snapshot again if it fails), maybe you will be lucky. The right way to do this with zfs is to send/recv the datasets to a fresh zpool, or (S10 Update 9) to create an extra zpool mirror and then split it off with zpool split. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do you use >1 partition on x86?
Change the new partition type to something than none of the OS's on the system will know anything about, so they don't make any invalid assumptions about what might be in it. Then use the appropriate partition device node, /dev/dsk/c7t0d0p4 (assuming it's the 4th primary FDISK partition). Multiple zpools on one disk is not going to be good for performance if you use both together. There may be some way to grow the existing Solaris partition into the spare space without destroying the contents and then growing the zpool into the new space, but I haven't tried this with FDISK partitions, so I don't know if it works without damaging the existing contents. (I have done it with slices, and it does work in that case.) Bill Werner wrote: So when I built my new workstation last year, I partitioned the one and only disk in half, 50% for Windows, 50% for 2009.06. Now, I'm not using Windows, so I'd like to use the other half for another ZFS pool, but I can't figure out how to access it. I have used fdisk to create a second Solaris2 partition, did a re-con reboot, but format still only shows the 1 available partition. How do I used the second partition? selecting c7t0d0 Total disk size is 30401 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 Other OS 0 4 5 0 2 IFS: NTFS 5 19171913 6 3 ActiveSolaris2 1917 1497113055 43 4 Solaris2 14971 3017015200 50 format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c7t0d0 /p...@0,0/pci1028,2...@1f,2/d...@0,0 Thanks for any idea. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
casper@sun.com wrote: On Tue, Oct 5, 2010 at 11:49 PM, wrote: I'm not sure that that is correct; the drive works on naive clients but I believe it can reveal its true colors. The drive reports 512 byte sectors to all hosts. AFAIK there's no way to make it report 4k sectors. Too bad because it makes it less useful (specifically because the label mentions sectors and if you can use bigger sectors, you can address a larger drive). Having now read a number of forums about these, there's a strong feeling WD screwed up by not providing a switch to disable pseudo 512b access so you can use the 4k native. The industry as a whole will transition to 4k sectorsize over next few years, but these first 4k sectorsize HDs are rather less useful with 4k sectorsize-aware OS's. Let's hope other manufacturers get this right in their first 4k products. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
Michael DeMan wrote: The WD 1TB 'enterprise' drives are still 512 sector size and safe to use, who knows though, maybe they just started shipping with 4K sector size as I write this e-mail? Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? That part has me worried on this whole 4K sector migration thing more than what to buy today. Given the choice, I would prefer to buy 4K sector size now, but operating system support is still limited. Does anybody know if there any vendors that are shipping 4K sector drives that have a jumper option to make them 512 size? WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? Changing the sector size (if it's possible at all) would require a reformat of the drive. On SCSI disks which support it, you do it by changing the sector size on the relevant mode select page, and then sending a format-unit command to make the drive relayout all the sectors. I've no idea if these 4K sata drives have any such mechanism, but I would expect they would. BTW, I've been using a pair of 1TB Hitachi Ultrastar for something like 18 months without any problems at all. Of course, a 1 year old disk model is no longer available now. I'm going to have to swap out for bigger disks in the not too distant future. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
Richard L. Hamilton wrote: Typically on most filesystems, the inode number of the root directory of the filesystem is 2, 0 being unused and 1 historically once invisible and used for bad blocks (no longer done, but kept reserved so as not to invalidate assumptions implicit in ufsdump tapes). However, my observation seems to be (at least back at snv_97), the inode number of ZFS filesystem root directories (including at the top level of a spool) is 3, not 2. If there's any POSIX/SUS requirement for the traditional number 2, I haven't found it. So maybe there's no reason founded in official standards for keeping it the same. But there are bound to be programs that make what was with other filesystems a safe assumption. Perhaps a warning is in order, if there isn't already one. Is there some _reason_ why the inode number of filesystem root directories in ZFS is 3 rather than 2? If you look at zfs_create_fs(), you will see the first 3 items created are: Create zap object used for SA attribute registration Create a delete queue. Create root znode. Hence, inode 3. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding a higher level partition to ZFS pool
Axelle Apvrille wrote: Hi all, I would like to add a new partition to my ZFS pool but it looks like it's more stricky than expected. The layout of my disk is the following: - first partition for Windows. I want to keep it. (no formatting !) - second partition for OpenSolaris.This is where I have all the Solaris slices (c0d0s0 etc). I have a single ZFS pool. OpenSolaris boots on ZFS. - third partition: a FAT partition I want to keep (no formatting !) - fourth partition: I want to add this partition to my ZFS pool (or another pool ?). I don't care if information on that partition is lost, I can format it if necessary. zpool add c0d0p0:2 ? Hmm... You cannot add it to the root pool, as the root pool cannot be a RAID0. You can make another pool from it... Ideally, set the FDISK partition type to something that none of the OS's on the system will know anything about. (It doesn't matter what it is from the zfs point of view, but you don't want any of the OS's thinking it's something they believe they know how to use.) zpool create tank c0d0p4 (for the 4th FDISK primary partition). Note that two zpools on the same disk may give you poor performance if you are accessing both at the same time, as you are forcing head seeking between them. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failed zfs send "invalid backup stream".............
Humberto Ramirez wrote: I'm trying to replicate a 300 GB pool with this command zfs send al...@3 | zfs receive -F omega about 2 hours in to the process it fails with this error "cannot receive new filesystem stream: invalid backup stream" I have tried setting the target read only (zfs set readonly=on omega) also disable Timeslider thinking it might have something to do with it. What could be causing the error ? Could be zfs filesystem version too old on the sending side (I have one such case). What are their versions, and what release/build of the OS are you using? if the target is a new hard drive can I use this zfs send al...@3 > /dev/c10t0d0 ? That command doesn't make much sense for the purpose of doing anything useful. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...
7;m not intimately familiar with the firmware versions, but if you're having problems, making sure you have latest firmware is probably a good thing to do. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
What you say is true only on the system itself. On an NFS client system, 30 seconds of lost data in the middle of a file (as per my earlier example) is a corrupt file. -original message- Subject: Re: [zfs-discuss] Solaris startup script location From: Edward Ned Harvey Date: 18/08/2010 17:17 > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Alxen4 > > Disabling ZIL converts all synchronous calls to asynchronous which > makes ZSF to report data acknowledgment before it actually was written > to stable storage which in turn improves performance but might cause > data corruption in case of server crash. > > Is it correct ? It is partially correct. With the ZIL disabled, you could lose up to 30 sec of writes, but it won't cause an inconsistent filesystem, or "corrupt" data. If you make a distinction between "corrupt" and "lost" data, then this is valuable for you to know: Disabling the ZIL can result in up to 30sec of lost data, but not corrupt data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Alxen4 wrote: Thanks...Now I think I understand... Let me summarize it andd let me know if I'm wrong. Disabling ZIL converts all synchronous calls to asynchronous which makes ZSF to report data acknowledgment before it actually was written to stable storage which in turn improves performance but might cause data corruption in case of server crash. Is it correct ? In my case I'm having serious performance issues with NFS over ZFS. You need a non-volatile slog, such as an SSD. My NFS Client is ESXi so the major question is there risk of corruption for VMware images if I disable ZIL ? Yes. If your NFS server takes an unexpected outage and comes back up again, some writes will have been lost which ESXi thinks succeeded (typically 5 to 30 seconds worth of writes/updates immediately before the outage). So as an example, if you had an application writing a file sequentially, you will likely find an area of the file is corrupt because the data was lost. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Andrew Gabriel wrote: Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) For sure I can manually remove/and add it by script and put the script in regular rc2.d location...I'm just looking for more elegant way to it. Can you start by explaining what you're trying to do, because this may be completely misguided? A ramdisk is volatile, so you'll lose it when system goes down, causing failure to mount on reboot. Recreating a ramdisk on reboot won't recreate the slog device you lost when the system went down. I expect the zpool would fail to mount. Furthermore, using a ramdisk as a ZIL is effectively just a very inefficient way to disable the ZIL. A better way to do this is to "zfs set sync=disabled ..." on relevant filesystems. I can't recall which build introduced this, but prior to that, you can set zfs://zil_disable=1 in /etc/system but that applies to all pools/filesystems. The double-slash was brought to you by a bug in thunderbird. The original read: set zfs:zil_disable=1 -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) For sure I can manually remove/and add it by script and put the script in regular rc2.d location...I'm just looking for more elegant way to it. Can you start by explaining what you're trying to do, because this may be completely misguided? A ramdisk is volatile, so you'll lose it when system goes down, causing failure to mount on reboot. Recreating a ramdisk on reboot won't recreate the slog device you lost when the system went down. I expect the zpool would fail to mount. Furthermore, using a ramdisk as a ZIL is effectively just a very inefficient way to disable the ZIL. A better way to do this is to "zfs set sync=disabled ..." on relevant filesystems. I can't recall which build introduced this, but prior to that, you can set zfs://zil_disable=1 in /etc/system but that applies to all pools/filesystems. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS automatic rollback and data rescue.
Constantine wrote: ZFS doesn't do this. I thought so too. ;) Situation brief: I've got OpenSolaris 2009.06 installed on the RAID-5 array on the controller with 512 Mb cache (as i can remember) without a cache-saving battery. I hope the controller disabled the cache then. Probably a good idea to run "zpool scrub rpool" to find out if it's broken. It will probably take some time. zpool status will show the progress. At the Friday lightning bolt hit the power supply station of colocating company,and turned out that their UPSs not much more then decoration. After reboot filesystem and logs are on their last snapshot version. Would also be useful to see output of: zfs list -t all -r zpool/filesystem wi...@zeus:~/.zfs/snapshot# zfs list -t all -r rpool NAME USED AVAIL REFER MOUNTPOINT rpool 427G 1.37T 82.5K /rpool rpool/ROOT 366G 1.37T19K legacy rpool/ROOT/opensolaris20.6M 1.37T 3.21G / rpool/ROOT/xvm8.10M 1.37T 8.24G / rpool/ROOT/xvm-1 690K 1.37T 8.24G / rpool/ROOT/xvm-2 35.1G 1.37T 232G / rpool/ROOT/xvm-3 851K 1.37T 221G / rpool/ROOT/xvm-4 331G 1.37T 221G / rpool/ROOT/xv...@install 144M - 2.82G - rpool/ROOT/xv...@xvm 38.3M - 3.21G - rpool/ROOT/xv...@2009-07-27-01:09:1456K - 8.24G - rpool/ROOT/xv...@2009-07-27-01:09:5756K - 8.24G - rpool/ROOT/xv...@2009-09-13-23:34:54 2.30M - 206G - rpool/ROOT/xv...@2009-09-13-23:35:17 1.14M - 206G - rpool/ROOT/xv...@2009-09-13-23:42:12 5.72M - 206G - rpool/ROOT/xv...@2009-09-13-23:42:45 5.69M - 206G - rpool/ROOT/xv...@2009-09-13-23:46:25 573K - 206G - rpool/ROOT/xv...@2009-09-13-23:46:34 525K - 206G - rpool/ROOT/xv...@2009-09-13-23:48:11 6.51M - 206G - rpool/ROOT/xv...@2010-04-22-03:50:25 24.6M - 221G - rpool/ROOT/xv...@2010-04-22-03:51:28 24.6M - 221G - Actually, there's 24.6Mbytes worth of changes to the filesystem since the last snapshot, which is coincidentally about the same as there was over the preceding minute between the last two snapshots. I can't tell if (or how much of) that happened before, verses after, the reboot though. rpool/dump16.0G 1.37T 16.0G - rpool/export 28.6G 1.37T21K /export rpool/export/home 28.6G 1.37T21K /export/home rpool/export/home/wiron 28.6G 1.37T 28.6G /export/home/wiron rpool/swap16.0G 1.38T 101M - = Normally in a power-out scenario, you will only lose asynchronous writes since the last transaction group commit, which will be up to 30 seconds worth (although normally much less), and you lose no synchronous writes. However, I've no idea what your potentially flaky RAID array will have done. If it was using its cache and thinking it was non-volatile, then it could easily have corrupted the zfs filesystem due to having got writes out of sequence with transaction commits, and this can render the filesystem no longer mountable because the back-end storage has lied to zfs about committing writes. Even though you were lucky and it still mounts, it might still be corrupted, hence the suggestion to run zpool scrub (and even more important, get the RAID array fixed). Since I presume ZFS doesn't have redundant storage for this zpool, any corrupted data can't be repaired by ZFS, although it will tell you about it. Running ZFS without redundancy on flaky storage is not a good place to be. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS automatic rollback and data rescue.
Constantine wrote: Hi. I've got the ZFS filesystem (opensolaris 2009.06), witch, as i can see, was automatically rollbacked by OS to the lastest snapshot after the power failure. ZFS doesn't do this. Can you give some more details of what you're seeing? Would also be useful to see output of: zfs list -t all -r zpool/filesystem There is a trouble - snapshot is too old, and ,consequently, there is a questions -- Can I browse pre-rollbacked corrupted branch of FS ? And, if I can, how ? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Global Spare for 2 pools
Tony MacDoodle wrote: I have 2 ZFS pools all using the same drive type and size. The question is can I have 1 global hot spare for both of those pools? Yes. A hot spare disk can be added to more than one pool at the same time. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Z stripes
Phil Harman wrote: On 10 Aug 2010, at 08:49, Ian Collins wrote: On 08/10/10 06:21 PM, Terry Hull wrote: I am wanting to build a server with 16 - 1TB drives with 2 – 8 drive RAID Z2 arrays striped together. However, I would like the capability of adding additional stripes of 2TB drives in the future. Will this be a problem? I thought I read it is best to keep the stripes the same width and was planning to do that, but I was wondering about using drives of different sizes. These drives would all be in a single pool. It would work, but you run the risk of the smaller drives becoming full and all new writes doing to the bigger vdev. So while usable, performance would suffer. Almost by definition, the 1TB drives are likely to be getting full when the new drives are added (presumably because of running out of space). Performance can only be said to suffer relative to a new pool built entirely with drives of the same size. Even if he added 8x 2TB drives in a RAIDZ3 config it is hard to predict what the performance gap will be (on the one hand: RAIDZ3 vs RAIDZ2, on the other: an empty group vs an almost full, presumably fragmented, group). One option would be to add 2TB drives as 5 drive raidz3 vdevs. That way your vdevs would be approximately the same size and you would have the optimum redundancy for the 2TB drives. I think you meant 6, but I don't see a good reason for matching the group sizes. I'm for RAIDZ3, but I don't see much logic in mixing groups of 6+2 x 1TB and 3+3 x 2TB in the same pool (in one group I appear to care most about maximising space, in the other I'm maximising availability) Another option - use the new 2TB drives to swap out the existing 1TB drives. If you can find another use for the swapped out drives, this works well, and avoids ending up with sprawling lower capacity drives as your pool grows in size. This is what I do at home. The freed-up drives get used in other systems and for off-site backups. Over the last 4 years, I've upgraded from 1/4TB, to 1/2TB, and now on 1TB drives. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS SCRUB
Mohammed Sadiq wrote: Hi Is it recommended to do scrub while the filesystem is mounted . How frequently do we have to do scrub and at what circumstances. You can scrub while the filesystems are mounted - most people do, there's no reason to unmount for for a scrub. (Scrub is pool level, not filesystem level.) Scrub does noticeably slow the filesystem, so pick a time of low application load or a time when performance isn't critical. If it overruns into a busy period, you can cancel the scrub. Unfortunately, you can't pause and resume - there's an RFE for this, so if you cancel one you can't restart it from where it got to - it has to restart from the beginning. You should scrub occasionally anyway. That's your check that data you haven't accessed in your application isn't rotting on the disks. You should also do a scrub before you do a planned reduction of the pool redundancy (e.g. if you're going to detach a mirror side in order to attach a larger disk), most particularly if you are reducing the redundancy to nothing. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Maximum zfs send/receive throughput
Jim Barker wrote: Just an update, I had a ticket open with Sun regarding this and it looks like they have a CR for what I was seeing (6975124). That would seem to describe a zfs receive which has stopped for 12 hours. You described yours as slow, which is not the term I personally would use for one which is stopped. However, you haven't given anything like enough detail here of your situation and what's happening for me to make any worthwhile guesses. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zvol recordsize for backing a zpool over iSCSI
Just wondering if anyone has experimented with working out the best zvol recordsize for a zvol which is backing a zpool over iSCSI? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Phil Harman Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available The fact that people run unsafe systems seemingly without complaint for years assumes that they know silent data corruption when they see^H^H^Hhear it ... which, of course, they didn't ... because it is silent ... or having encountered corrupted data, that they have the faintest idea where it came from. In my day to day work I still find many people that have been (apparently) very lucky. Running with sync disabled, or ZIL disabled, you could call "unsafe" if you want to use a generalization and a stereotype. Just like people say "writeback" is unsafe. If you apply a little more intelligence, you'll know, it's safe in some conditions, and not in other conditions. Like ... If you have a BBU, you can use your writeback safely. And if you're not sharing stuff across the network, you're guaranteed the disabled ZIL is safe. But even when you are sharing stuff across the network, the disabled ZIL can still be safe under the following conditions: If you are only doing file sharing (NFS, CIFS) and you are willing to reboot/remount from all your clients after an ungraceful shutdown of your server, then it's safe to run with ZIL disabled. No, that's not safe. The client can still lose up to 30 seconds of data, which could be, for example, an email message which is received and foldered on the server, and is then lost. It's probably /*safe enough*/ for most home users, but you should be fully aware of the potential implications before embarking on this route. (As I said before, the zpool itself is not at any additional risk of corruption, it's just that you might find the zfs filesystems with sync=disabled appear to have been rewound by up to 30 seconds.) If you're unsure, then adding SSD nonvolatile log device, as people have said, is the way to go. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
Thomas Burgess wrote: On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <mailto:sigbj...@nixtra.com>> wrote: Hi, I've been searching around on the Internet to fine some help with this, but have been unsuccessfull so far. I have some performance issues with my file server. I have an OpenSolaris server with a Pentium D 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB SATA drives. If I compile or even just unpack a tar.gz archive with source code (or any archive with lots of small files), on my Linux client onto a NFS mounted disk to the OpenSolaris server, it's extremely slow compared to unpacking this archive on the locally on the server. A 22MB .tar.gz file containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. Unpacking the same file locally on the server is just under 2 seconds. Between the server and client I have a gigabit network, which at the time of testing had no other significant load. My NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". Any suggestions to why this is? Regards, Sigbjorn as someone else said, adding an ssd log device can help hugely. I saw about a 500% nfs write increase by doing this. I've heard of people getting even more. Another option if you don't care quite so much about data security in the event of an unexpected system outage would be to use Robert Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available build. The risk is that if the file server goes down unexpectedly, it might come back up having lost some seconds worth of changes which it told the client (lied) that it had committed to disk, when it hadn't, and this violates the NFS protocol. That might be OK if you are using it to hold source that's being built, where you can kick off a build again if the server did go down in the middle of it. Wouldn't be a good idea for some other applications though (although Linux ran this way for many years, seemingly without many complaints). Note that there's no increased risk of the zpool going bad - it's just that after the reboot, filesystems with sync=disabled will look like they were rewound by some seconds (possibly up to 30 seconds). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
Richard Jahnel wrote: Any idea why? Does the zfs send or zfs receive bomb out part way through? I have no idea why mbuffer fails. Changing the -s from 128 to 1536 made it take longer to occur and slowed it down bu about 20% but didn't resolve the issue. It just ment I might get as far as 2.5gb before mbuffer bombed with broken pipe. Trying -r and -R with various values had no effect. I found that where the network bandwidth and the disks' throughput are similar (which requires a pool with many top level vdevs in the case of a 10Gb link), you ideally want a buffer on the receive side which will hold about 5 seconds worth of data. A large buffer on the transmit side didn't help. The aim is to be able to continue steaming data across the network whilst a transaction commit happens at the receive end and zfs receive isn't reading, but to have the data ready locally for zfs receive when it starts reading again. Then the network will stream, in spite of the bursty read nature of zfs receive. I recorded this in bugid http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6729347 However, I haven't verified the extent to which this still happens on more recent builds. Might be worth trying it over rsh if security isn't an issue, and then you lose the encryption overhead. Trouble is that then you've got almost no buffering, which can do bad things to the performance, which is why mbuffer would be ideal if it worked for you. I seem to remember reading that rsh was remapped to ssh in Solaris. No. On the system you're rsh'ing to, you will have to "svcadm enable svc:/network/shell:default", and set up appropriate authorisation in ~/.rhosts I heard of some folks using netcat. I haven't figured out where to get netcat nor the syntax for using it yet. I used a buffering program of my own, but I presume mbuffer would work too. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
Richard Jahnel wrote: I've tried ssh blowfish and scp arcfour. both are CPU limited long before the 10g link is. I'vw also tried mbuffer, but I get broken pipe errors part way through the transfer. Any idea why? Does the zfs send or zfs receive bomb out part way through? Might be worth trying it over rsh if security isn't an issue, and then you lose the encryption overhead. Trouble is that then you've got almost no buffering, which can do bad things to the performance, which is why mbuffer would be ideal if it worked for you. I'm open to ideas for faster ways to to either zfs send directly or through a compressed file of the zfs send output. For the moment I; zfs send > pigz scp arcfour the file gz file to the remote host gunzip < to zfs receive This takes a very long time for 3 TB of data, and barely makes use the 10g connection between the machines due to the CPU limiting on the scp and gunzip processes. Also, if you have multiple datasets to send, might be worth seeing if sending them in parallel helps. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 1tb SATA drives
Arne Jansen wrote: Jordan McQuown wrote: I’m curious to know what other people are running for HD’s in white box systems? I’m currently looking at Seagate Barracuda’s and Hitachi Deskstars. I’m looking at the 1tb models. These will be attached to an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as a large storage array for backups and archiving. I wouldn't recommend using desktop drives in a server RAID. They can't handle the vibrations well that are present in a server. I'd recommend at least the Seagate Constellation or the Hitachi Ultrastar, though I haven't tested the Deskstar myself. I've been using a couple of 1TB Hitachi Ultrastars for about a year with no problem. I don't think mine are still available, but I expect they have something equivalent. The pool is scrubbed 3 times a week which takes nearly 19 hours now, and hammers the heads quite hard. I keep meaning to reduce the scrub frequency now it's getting to take so long, but haven't got around to it. What I really want is pause/resume scrub, and the ability to trigger the pause/resume from the screensaver (or something similar). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended RAM for ZFS on various platforms
Garrett D'Amore wrote: Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors. You'll have better performance, and good resilience against errors. And you can grow later as you need to by just adding additional drive pairs. -- Garrett Or in my case, I find my home data growth is slightly less than the rate of disk capacity increase, so every 18 months or so, I simply swap out the disks for higher capacity ones. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
Linder, Doug wrote: Out of sheer curiosity - and I'm not disagreeing with you, just wondering - how does ZFS make money for Oracle when they don't charge for it? Do you think it's such an important feature that it's a big factor in customers picking Solaris over other platforms? Yes, it is one of many significant factors in customers choosing Solaris over other OS's. Having chosen Solaris, customers then tend to buy Sun/Oracle systems to run it on. Of course, there are the 7000 series products too, which are heavily based on the capabilities of ZFS, amongst other Solaris features. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost ZIL Device - FIXED
Greetings All, I can't believe it didn't figure this out sooner. First of all, a big thank you to everyone who gave me advice and suggestions, especially Richard. The problem was with the -d switch. When importing a pool if you specify -d and a path it ONLY looks there. So if I run: # zpool import -d /var/zfs-log/ tank It won't look for devices in /dev/dsk Consequently running without -d /var/zfs-log/ it won't find the log device. Here is the command that worked: # zpool import -d /var/zfs-log -d /dev/dsk tank And to make sure that this doesn't happen again (I have learned my lesson this time) I have ordered two small SSD drives to put in a mirrored config for the log device. Thanks again to everyone and now I will get some worry-free sleep :) Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost ZIL Device
According to 'zpool upgrade' my pool versions are are 22. All pools were upgraded several months ago, including the one in question. Here is what I get when I try to import: fileserver ~ # zpool import 9013303135438223804 cannot import 'tank': pool may be in use from other system, it was last accessed by fileserver (hostid: 0x406155) on Tue Jul 6 10:46:13 2010 use '-f' to import anyway fileserver ~ # zpool import -f 9013303135438223804 cannot import 'tank': one or more devices is currently unavailable Destroy and re-create the pool from a backup source. On Jul 6, 2010, at 11:48 PM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Andrew Kener >> >> the OS hard drive crashed [and log device] > > Here's what I know: In zpool >= 19, if you import this, it will prompt you > to confirm the loss of the log device, and then it will import. > > Here's what I have heard: The ability to import with a failed log device as > described above, was created right around zpool 14 or 15, not quite sure > which. > > Here's what I don't know: If the failed zpool was some version which was > too low ... and you try to import on an OS which is capable of a much higher > version of zpool ... Can the newer OS handle it just because the newer OS is > able to handle a newer version of zpool? Or maybe the version of the failed > pool is the one that matters, regardless of what the new OS is capable of > doing now? > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Lost ZIL Device
Hello All, I've recently run into an issue I can't seem to resolve. I have been running a zpool populated with two RAID-Z1 VDEVs and a file on the (separate) OS drive for the ZIL: raidz1-0 ONLINE c12t0d0 ONLINE c12t1d0 ONLINE c12t2d0 ONLINE c12t3d0 ONLINE raidz1-2 ONLINE c12t4d0 ONLINE c12t5d0 ONLINE c13t0d0 ONLINE c13t1d0 ONLINE logs /ZIL-Log.img This was running on Nexenta Community Edition v3. Everything was going smoothly until today when the OS hard drive crashed and I was not able to boot from it any longer. I had migrated this setup from an OpenSolaris install some months back and I still had the old drive intact. I put it in the system, booted it up and tried to import the zpool. Unfortunately, I have not been successful. Previously when migrating from OSOL to Nexenta I was able to get the new system to recognize and import the ZIL device file. Since it has been lost in the drive crash I have not been able to duplicate that success. Here is the output from a 'zpool import' command: pool: tank id: 9013303135438223804 state: UNAVAIL status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-EY config: tank UNAVAIL missing device raidz1-0 ONLINE c12t0d0 ONLINE c12t1d0 ONLINE c12t5d0 ONLINE c12t3d0 ONLINE raidz1-2 ONLINE c12t4d0 ONLINE c12t2d0 ONLINE c13t0d0 ONLINE c13t1d0 ONLINE I created a new file for the ZIL (using mkfile) and tried to specify it for inclusion with -d but it doesn't get recognized. Probably because it was never part of the original zpool. I also symlinked the new ZIL file into /dev/dsk but that didn't make any difference either. Any suggestions? Andrew Kener ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
> > Good. Run 'zpool scrub' to make sure there are no > other errors. > > regards > victor > Yes, scrubbed successfully with no errors. Thanks again for all of your generous assistance. /AJ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
> > - Original Message - > > Victor, > > > > The zpool import succeeded on the next attempt > following the crash > > that I reported to you by private e-mail! > > > > For completeness, this is the final status of the > pool: > > > > > > pool: tank > > state: ONLINE > > scan: resilvered 1.50K in 165h28m with 0 errors on > Sat Jul 3 08:02:30 > > Out of curiosity, what sort of drives are you using > here? Resilvering in 165h28m is close to a week, > which is rather bad imho. I think the resilvering statistic is quite misleading, in this case. We're using very average 1TB retail Hitachi disks, which perform just fine when the pool is healthy. What happened here is that the zpool-tank process was performing a resilvering task in parallel with the processing of a very large inconsistent dataset, which took the overwhelming majority of the time to complete. Why it actually took over a week to process the 2TB volume in an inconsistent state is my primary concern with the performance of ZFS, in this case. > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > r...@karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum > presenteres intelligibelt. Det er et elementært > imperativ for alle pedagoger å unngå eksessiv > anvendelse av idiomer med fremmed opprinnelse. I de > fleste tilfeller eksisterer adekvate og relevante > synonymer på norsk. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Victor, The zpool import succeeded on the next attempt following the crash that I reported to you by private e-mail! For completeness, this is the final status of the pool: pool: tank state: ONLINE scan: resilvered 1.50K in 165h28m with 0 errors on Sat Jul 3 08:02:30 2010 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 cache c2t0d0ONLINE 0 0 0 errors: No known data errors Thank you very much for your help. We did not need to add additional RAM to solve this, in the end. Instead, we needed to persist with the import through several panics to finally work our way through the large inconsistent dataset; it is unclear whether the resilvering caused additional processing delay. Unfortunately, the delay made much of the data quite stale, now that it's been recovered. It does seem that zfs would benefit tremendously from a better (quicker and more intuitive?) set of recovery tools, that are available to a wider range of users. It's really a shame, because the features and functionality in zfs are otherwise absolutely second to none. /Andrew[i][/i][i][/i][i][/i][i][/i][i][/i] -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
> Andrew, > > Looks like the zpool is telling you the devices are > still doing work of > some kind, or that there are locks still held. > Agreed; it appears the CSV1 volume is in a fundamentally inconsistent state following the aborted zfs destroy attempt. See later in this thread where Victor has identified this to be the case. I am awaiting his analysis of the latest crash. > From man of section 2 intro page the errors are > listed. Number 16 > ooks to be an EBUSY. > > > 16 EBUSYDevice busy > An attempt was made to mount > a dev- > ice that was already > mounted or an > attempt was made to > unmount a device > on which there is > an active file > (open file, current > directory, > mounted-on file, active > text seg- > ment). It will also > occur if an > attempt is made to > enable accounting > when it is already > enabled. The > device or resource is > currently una- > vailable. EBUSY is > also used by > mutexes, semaphores, > condition vari- > ables, and r/w locks, > to indicate > that a lock is held, > and by the > processor control > function > P_ONLINE. > ndrew Jones wrote: > > Just re-ran 'zdb -e tank' to confirm the CSV1 > volume is still exhibiting error 16: > > > > > > Could not open tank/CSV1, error 16 > > > > > > Considering my attempt to delete the CSV1 volume > lead to the failure in the first place, I have to > think that if I can either 1) complete the deletion > of this volume or 2) roll back to a transaction prior > to this based on logging or 3) repair whatever > corruption has been caused by this partial deletion, > that I will then be able to import the pool. > > > > What does 'error 16' mean in the ZDB output, any > suggestions? > > > > -- > Geoff Shipman | Senior Technical Support Engineer > Phone: +13034644710 > Oracle Global Customer Services > 500 Eldorado Blvd. UBRM-04 | Broomfield, CO 80021 > Email: geoff.ship...@sun.com | Hours:9am-5pm > MT,Monday-Friday > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss > -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Victor, A little more info on the crash, from the messages file is attached here. I have also decompressed the dump with savecore to generate unix.0, vmcore.0, and vmdump.0. Jun 30 19:39:10 HL-SAN unix: [ID 836849 kern.notice] Jun 30 19:39:10 HL-SAN ^Mpanic[cpu3]/thread=ff0017909c60: Jun 30 19:39:10 HL-SAN genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ff0017909790 addr=0 occurred in module "" due to a NULL pointer dereference Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN unix: [ID 839527 kern.notice] sched: Jun 30 19:39:10 HL-SAN unix: [ID 753105 kern.notice] #pf Page fault Jun 30 19:39:10 HL-SAN unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x0 Jun 30 19:39:10 HL-SAN unix: [ID 243837 kern.notice] pid=0, pc=0x0, sp=0xff0017909880, eflags=0x10002 Jun 30 19:39:10 HL-SAN unix: [ID 211416 kern.notice] cr0: 8005003b cr4: 6f8 Jun 30 19:39:10 HL-SAN unix: [ID 624947 kern.notice] cr2: 0 Jun 30 19:39:10 HL-SAN unix: [ID 625075 kern.notice] cr3: 336a71000 Jun 30 19:39:10 HL-SAN unix: [ID 625715 kern.notice] cr8: c Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rdi: 282 rsi:15809 rdx: ff03edb1e538 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rcx:5 r8:0 r9: ff03eb2d6a00 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rax: 202 rbx:0 rbp: ff0017909880 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r10: f80d16d0 r11:4 r12:0 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r13: ff03e21bca40 r14: ff03e1a0d7e8 r15: ff03e21bcb58 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]fsb:0 gsb: ff03e25fa580 ds: 4b Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] es: 4b fs:0 gs: 1c3 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]trp:e err: 10 rip:0 Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] cs: 30 rfl:10002 rsp: ff0017909880 Jun 30 19:39:10 HL-SAN unix: [ID 266532 kern.notice] ss: 38 Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909670 unix:die+dd () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909780 unix:trap+177b () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909790 unix:cmntrap+e6 () Jun 30 19:39:10 HL-SAN genunix: [ID 802836 kern.notice] ff0017909880 0 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098a0 unix:debug_enter+38 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098c0 unix:abort_sequence_enter+35 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909910 kbtrans:kbtrans_streams_key+102 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909940 conskbd:conskbdlrput+e7 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099b0 unix:putnext+21e () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099f0 kbtrans:kbtrans_queueevent+7c () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a20 kbtrans:kbtrans_queuepress+7c () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a60 kbtrans:kbtrans_untrans_keypressed_raw+46 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a90 kbtrans:kbtrans_processkey+32 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909ae0 kbtrans:kbtrans_streams_key+175 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b10 kb8042:kb8042_process_key+40 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b50 kb8042:kb8042_received_byte+109 () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b80 kb8042:kb8042_intr+6a () Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909bb0 i8042:i8042_intr+c5 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c00 unix:av_dispatch_autovect+7c () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c40 unix:dispatch_hardint+33 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183552f0 unix:switch_sp_and_call+13 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355340 unix:do_interrupt+b8 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355350 unix:_interrupt+b8 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183554a0 unix:htable_steal+198 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355510 unix:htable_alloc+248 () Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183555c0
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Victor, I've reproduced the crash and have vmdump.0 and dump device files. How do I query the stack on crash for your analysis? What other analysis should I provide? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
> > On Jun 29, 2010, at 8:30 PM, Andrew Jones wrote: > > > Victor, > > > > The 'zpool import -f -F tank' failed at some point > last night. The box was completely hung this morning; > no core dump, no ability to SSH into the box to > diagnose the problem. I had no choice but to reset, > as I had no diagnostic ability. I don't know if there > would be anything in the logs? > > It sounds like it might run out of memory. Is it an > option for you to add more memory to the box > temporarily? I'll place the order for more memory or transfer some from another machine. Seems quite likely that we did run out of memory. > > Even if it is an option, it is good to prepare for > such outcome and have kmdb loaded either at boot time > by adding -k to 'kernel$' line in GRUB menu, or by > loading it from console with 'mdb -K' before > attempting import (type ':c' at mdb prompt to > continue). In case it hangs again, you can press > 'F1-A' on the keyboard, drop into kmdb and then use > '$ > If you hardware has physical or virtual NMI button, > you can use that too to drop into kmdb, but you'll > need to set a kernel variable for that to work: > > http://blogs.sun.com/darren/entry/sending_a_break_to_o > pensolaris > > > Earlier I ran 'zdb -e -bcsvL tank' in write mode > for 36 hours and gave up to try something different. > Now the zpool import has hung the box. > > What do you mean be running zdb in write mode? zdb > normally is readonly tool. Did you change it in some > way? I had read elsewhere that set /zfs/:zfs_recover=/1/ and set aok=/1/ placed zdb into some kind of a write/recovery mode. I have set these in /etc/system. Is this a bad idea in this case? > > > Should I try zdb again? Any suggestions? > > It sounds like zdb is not going to be helpful, as > inconsistent dataset processing happens only in > read-write mode. So you need to try above suggestions > with more memory and kmdb/nmi. Will do, thanks! > > victor > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss > -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Victor, The 'zpool import -f -F tank' failed at some point last night. The box was completely hung this morning; no core dump, no ability to SSH into the box to diagnose the problem. I had no choice but to reset, as I had no diagnostic ability. I don't know if there would be anything in the logs? Earlier I ran 'zdb -e -bcsvL tank' in write mode for 36 hours and gave up to try something different. Now the zpool import has hung the box. Should I try zdb again? Any suggestions? Thanks, Andrew -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Thanks Victor. I will give it another 24 hrs or so and will let you know how it goes... You are right, a large 2TB volume (CSV1) was not in the process of being deleted, as described above. It is showing error 16 on 'zdb -e' -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Just re-ran 'zdb -e tank' to confirm the CSV1 volume is still exhibiting error 16: Could not open tank/CSV1, error 16 Considering my attempt to delete the CSV1 volume lead to the failure in the first place, I have to think that if I can either 1) complete the deletion of this volume or 2) roll back to a transaction prior to this based on logging or 3) repair whatever corruption has been caused by this partial deletion, that I will then be able to import the pool. What does 'error 16' mean in the ZDB output, any suggestions? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Malachi, Thanks for the reply. There were no snapshots for the CSV1 volume that I recall... very few snapshots on the any volume in the tank. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Dedup had been turned on in the past for some of the volumes, but I had turned it off altogether before entering production due to performance issues. GZIP compression was turned on for the volume I was trying to delete. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Update: have given up on the zdb write mode repair effort, as least for now. Hoping for any guidance / direction anyone's willing to offer... Re-running 'zpool import -F -f tank' with some stack trace debug, as suggested in similar threads elsewhere. Note that this appears hung at near idle. ff03e278c520 ff03e9c60038 ff03ef109490 1 60 ff0530db4680 PC: _resume_from_idle+0xf1CMD: zpool import -F -f tank stack pointer for thread ff03e278c520: ff00182bbff0 [ ff00182bbff0 _resume_from_idle+0xf1() ] swtch+0x145() cv_wait+0x61() zio_wait+0x5d() dbuf_read+0x1e8() dnode_next_offset_level+0x129() dnode_next_offset+0xa2() get_next_chunk+0xa5() dmu_free_long_range_impl+0x9e() dmu_free_object+0xe6() dsl_dataset_destroy+0x122() dsl_destroy_inconsistent+0x5f() findfunc+0x23() dmu_objset_find_spa+0x38c() dmu_objset_find_spa+0x153() dmu_objset_find+0x40() spa_load_impl+0xb23() spa_load+0x117() spa_load_best+0x78() spa_import+0xee() zfs_ioc_pool_import+0xc0() zfsdev_ioctl+0x177() cdev_ioctl+0x45() spec_ioctl+0x5a() fop_ioctl+0x7b() ioctl+0x18e() dtrace_systrace_syscall32+0x11a() _sys_sysenter_post_swapgs+0x149() -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)
Now at 36 hours since zdb process start and: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 827 root 4936M 4931M sleep 590 0:50:47 0.2% zdb/209 Idling at 0.2% processor for nearly the past 24 hours... feels very stuck. Thoughts on how to determine where and why? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable
Gordon Ross wrote: Anyone know why my ZFS filesystem might suddenly start giving me an error when I try to "ls -d" the top of it? i.e.: ls -d /tank/ws/fubar /tank/ws/fubar: Operation not applicable zpool status says all is well. I've tried snv_139 and snv_137 (my latest and previous installs). It's an amd64 box. Both OS versions show the same problem. Do I need to run a scrub? (will take days...) Other ideas? It might be interesting to run it under truss, to see which syscall is returning that error. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
> On 6/10/2010 9:04 PM, Rodrigo E. De León Plicet > wrote: > > On Tue, Jun 8, 2010 at 7:14 PM, Anurag > Agarwal wrote: > > > >> We at KQInfotech, initially started on an > independent port of ZFS to linux. > >> When we posted our progress about port last year, > then we came to know about > >> the work on LLNL port. Since then we started > working on to re-base our > >> changing on top Brian's changes. > >> > >> We are working on porting ZPL on that code. Our > current status is that > >> mount/unmount is working. Most of the directory > operations and read/write is > >> also working. There is still lot more development > work and testing that > >> needs to be going in this. But we are committed to > make this happen so > >> please stay tuned. > >> > > > > Good times ahead! > > > I don't mean to be a PITA, but I'm assuming that > someone lawyerly has had the appropriate discussions > with the porting team about how linking against the > GPL'd Linux kernel means your kernel module has to be > GPL-compatible. It doesn't matter if you distribute > it outside the general kernel source tarball, what > matters is that you're linking against a GPL program, > and the old GPL v2 doesn't allow for a > non-GPL-compatibly-licensed module to do that. This is incorrect. The viral effects of the GPL only take effect at the point of distribution. If ZFS is distributed seperately to the Linux kernel as a module then the person doing the combining is the user. Different if a Linux distro wanted to include it on a live CD, for example. GPL is not concerned with what code is linked with what. Cheers Andrew. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it possible to disable MPxIO during OpenSolaris installation?
James C. McPherson wrote: On 2/06/10 03:11 PM, Fred Liu wrote: Fix some typos. # In fact, there is no problem for MPxIO name in technology. It only matters for storage admins to remember the name. You are correct. I think there is no way to give short aliases to these long tedious MPxIO name. You are correct that we don't have aliases. However, I do not agree that the naming is tedious. It gives you certainty about the actual device that you are dealing with, without having to worry about whether you've cabled it right. Might want to add a call record to CR 6901193 Need a command to list current usage of disks, partitions, and slices which includes a request for vanity naming for disks. (Actually, vanity naming for disks should probably be brought out into a separate RFE.) -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle Pre-Sales Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Oracle is committed to developing practices and products that help protect the environment ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss