Re: [zfs-discuss] partioned cache devices
On 03/19/13 20:27, Jim Klimov wrote: I disagree; at least, I've always thought differently: the "d" device is the whole disk denomination, with a unique number for a particular controller link ("c+t"). The disk has some partitioning table, MBR or GPT/EFI. In these tables, partition "p0" stands for the table itself (i.e. to manage partitioning), p0 is the whole disk regardless of any partitioning. (Hence you can use p0 to access any type of partition table.) and the rest kind of "depends". In case of MBR tables, one partition may be named as having a Solaris (or Solaris2) type, and there it holds a SMI table of Solaris slices, and these slices can hold legacy filesystems or components of ZFS pools. In case of GPT, the GPT-partitions can be used directly by ZFS. However, they are also denominated as "slices" in ZFS and format utility. The GPT partitioning spec requires the disk to be FDISK partitioned with just one single FDISK partition of type EFI, so that tools which predate GPT partitioning will still see such a GPT disk as fully assigned to FDISK partitions, and therefore less likely to be accidentally blown away. I believe, Solaris-based OSes accessing a "p"-named partition and an "s"-named slice of the same number on a GPT disk should lead to the same range of bytes on disk, but I am not really certain about this. No, you'll see just p0 (whole disk), and p1 (whole disk less space for the backwards compatible FDISK partitioning). Also, if a "whole disk" is given to ZFS (and for OSes other that the latest Solaris 11 this means non-rpool disks), then ZFS labels the disk as GPT and defines a partition for itself plus a small trailing partition (likely to level out discrepancies with replacement disks that might happen to be a few sectors too small). In this case ZFS reports that it uses "cXtYdZ" as a pool component, For an EFI disk, the device name without a final p* or s* component is the whole EFI partition. (It's actually the s7 slice minor device node, but the s7 is dropped from the device name to avoid the confusion we had with s2 on SMI labeled disks being the whole SMI partition.) since it considers itself in charge of the partitioning table and its inner contents, and doesn't intend to share the disk with other usages (dual-booting and other OSes' partitions, or SLOG and L2ARC parts, etc). This also "allows" ZFS to influence hardware-related choices, like caching and throttling, and likely auto-expansion with the changed LUN sizes by fixing up the partition table along the way, since it assumes being 100% in charge of the disk. I don't think there is a "crime" in trying to use the partitions (of either kind) as ZFS leaf vdevs, even the zpool(1M) manpage states that: ... The following virtual devices are supported: disk A block device, typically located under /dev/dsk. ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks. ... Right. This is orthogonal to the fact that there can only be one Solaris slice table, inside one partition, on MBR. AFAIK this is irrelevant on GPT/EFI - no SMI slices there. There's a simpler way to think of it on x86. You always have FDISK partitioning (p1, p2, p3, p4). You can then have SMI or GPT/EFI slices (both called s0, s1, ...) in an FDISK partition of the appropriate type. With SMI labeling, s2 is by convention the whole Solaris FDISK partition (although this is not enforced). With EFI labeling, s7 is enforced as the whole EFI FDISK partition, and so the trailing s7 is dropped off the device name for clarity. This simplicity is brought about because the GPT spec requires that backwards compatible FDISK partitioning is included, but with just 1 partition assigned. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] partioned cache devices
Andrew Werchowiecki wrote: Total disk size is 9345 cylinders Cylinder size is 12544 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 EFI 0 93459346100 You only have a p1 (and for a GPT/EFI labeled disk, you can only have p1 - no other FDISK partitions are allowed). partition> print Current partition table (original): Total disk sectors available: 117214957 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm642.00GB 4194367 1usrwm 4194368 53.89GB 117214990 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 1172149918.00MB 117231374 You have an s0 and s1. This isn’t the output from when I did it but it is exactly the same steps that I followed. Thanks for the info about slices, I may give that a go later on. I’m not keen on that because I have clear evidence (as in zpools set up this way, right now, working, without issue) that GPT partitions of the style shown above work and I want to see why it doesn’t work in my set up rather than simply ignoring and moving on. You would have to blow away the partitioning you have, and create an FDISK partitioned disk (not EFI), and then create a p1 and p2 partition. (Don't use the 'partition' subcommand, which confusingly creates solaris slices.) Give the FDISK partitions a partition type which nothing will recognise, such as 'other', so that nothing will try and interpret them as OS partitions. Then you can use them as raw devices, and they should be portable between OS's which can handle FDISK partitioned devices. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Darren J Moffat [mailto:darr...@opensolaris.org] Support for SCSI UNMAP - both issuing it and honoring it when it is the backing store of an iSCSI target. When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C. Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem... Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever. SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that some blocks are no longer needed. (This might be because a file has been deleted in the filesystem on the device.) In the case of a Flash device, it can optimise usage by knowing this, e.g. it can perhaps perform a background erase on the real blocks so they're ready for reuse sooner, and/or better optimise wear leveling by having more spare space to play with. There are some devices in which this enables the device to improve its lifetime by performing better wear leveling when having more spare space. It can also help by avoiding some read-modify-write operations, if the device knows the data that is in the rest of the 4k block is no loner needed. In the case of an iSCSI LUN target, these blocks no longer need to be archived, and if sparse space allocation is in use, the space they occupied can be freed off. In the particular case of ZFS provisioning the iSCSI LUN (COMSTAR), you might get performance improvements by having more free space to play with during other write operations to allow better storage layout optimisation. So, bottom line is longer life of SSDs (maybe higher performance too if there's less waiting for erases during writes), and better space utilisation and performance for a ZFS COMSTAR target. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?
In my own experiments with my own equivalent of mbuffer, it's well worth giving the receiving side a buffer which is sized to hold the amount of data in a transaction commit, which allows ZFS to be banging out one tx group to disk, whilst the network is bringing the next one across for it. This will be roughly the link speed in bytes/second x 5, plus a bit more for good measure, say 250-300Mbytes for a gigabit link. It seems to be most important when the disks and the network link have similar max theoretical bandwidths (100Mbytes/sec is what you might expect from both gigabit ethernet and reasonable disks), and it becomes less important as the difference in max performance between them increases. Without the buffer, you tend to see the network run flat out for 5 seconds, and then the receiving disks run flat out for 5 seconds, alternating back and forth, whereas with the buffer, both continue streaming at full gigabit speed without a break. I have not seen any benefit of buffering on the sending side, although I'd still be inclined to include a small one. YMMV... Palmer, Trey wrote: We have found mbuffer to be the fastest solution. Our rates for large transfers on 10GbE are: 280MB/smbuffer 220MB/srsh 180MB/sHPN-ssh unencrypted 60MB/s standard ssh The tradeoff mbuffer is a little more complicated to script; rsh is, well, you know; and hpn-ssh requires rebuilding ssh and (probably) maintaining a second copy of it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1
3112 and 3114 were very early SATA controllers before there were any SATA drivers, which pretend to be ATA controllers to the OS. No one should be using these today. sol wrote: Oh I can run the disks off a SiliconImage 3114 but it's the marvell controller that I'm trying to get working. I'm sure it's the controller which is used in the Thumpers so it should surely work in solaris 11.1 *From:* Bob Friesenhahn If the SATA card you are using is a JBOD-style card (i.e. disks are portable to a different controller), are you able/willing to swap it for one that Solaris is known to support well? -------- -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove disk
Bob Friesenhahn wrote: On Sat, 1 Dec 2012, Jan Owoc wrote: When I would like to change the disk, I also would like change the disk enclosure, I don't want to use the old one. You didn't give much detail about the enclosure (how it's connected, how many disk bays it has, how it's used etc.), but are you able to power off the system and transfer the all the disks at once? And what happen if I have 24, 36 disks to change ? It's take mounth to do that. Those are the current limitations of zfs. Yes, with 12x2TB of data to copy it could take about a month. You can create a brand new pool with the new chassis and use 'zfs send' to send a full snapshot of each filesystem to the new pool. After the bulk of the data has been transferred, take new snapshots and send the remainder. This expects that both pools can be available at once. or if you don't care about existing snapshots, use Shadow Migration to move the data across. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
Arne Jansen wrote: We have finished a beta version of the feature. What does FITS stand for? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Schweiss, Chip How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's been wound back by up to 30 seconds when you reboot.) This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] all in one server
Richard Elling wrote: On Sep 18, 2012, at 7:31 AM, Eugen Leitl <mailto:eu...@leitl.org>> wrote: Can I actually have a year's worth of snapshots in zfs without too much performance degradation? I've got 6 years of snapshots with no degradation :-) $ zfs list -t snapshot -r export/home | wc -l 1951 $ echo 1951 / 365 | bc -l 5.34520547945205479452 $ So you're slightly ahead of my 5.3 years of daily snapshots:-) -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 05/28/12 20:06, Iwan Aucamp wrote: I'm getting sub-optimal performance with an mmap based database (mongodb) which is running on zfs of Solaris 10u9. System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks - a few mongodb instances are running with with moderate IO and total rss of 50 GB - a service which logs quite excessively (5GB every 20 mins) is also running (max 2GB ram use) - log files are compressed after some time to bzip2. Database performance is quite horrid though - it seems that zfs does not know how to manage allocation between page cache and arc cache - and it seems arc cache wins most of the time. I'm thinking of doing the following: - relocating mmaped (mongo) data to a zfs filesystem with only metadata cache - reducing zfs arc cache to 16 GB Is there any other recommendations - and is above likely to improve performance. 1. Upgrade to S10 Update 10 - this has various performance improvements, in particular related to database type loads (but I don't know anything about mongodb). 2. Reduce the ARC size so RSS + ARC + other memory users < RAM size. I assume the RSS include's whatever caching the database does. In theory, a database should be able to work out what's worth caching better than any filesystem can guess from underneath it, so you want to configure more memory in the DB's cache than in the ARC. (The default ARC tuning is unsuitable for a database server.) 3. If the database has some concept of blocksize or recordsize that it uses to perform i/o, make sure the filesystems it is using configured to be the same recordsize. The ZFS default recordsize (128kB) is usually much bigger than database blocksizes. This is probably going to have less impact with an mmaped database than a read(2)/write(2) database, where it may prove better to match the filesystem's record size to the system's page size (4kB, unless it's using some type of large pages). I haven't tried playing with recordsize for memory mapped i/o, so I'm speculating here. Blocksize or recordsize may apply to the log file writer too, and it may be that this needs a different recordsize and therefore has to be in a different filesystem. If it uses write(2) or some variant rather than mmap(2) and doesn't document this in detail, Dtrace is your friend. 4. Keep plenty of free space in the zpool if you want good database performance. If you're more than 60% full (S10U9) or 80% full (S10U10), that could be a factor. Anyway, there are a few things to think about. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs_arc_max values
On 05/17/12 15:03, Bob Friesenhahn wrote: On Thu, 17 May 2012, Paul Kraus wrote: Why are you trying to tune the ARC as _low_ as possible? In my experience the ARC gives up memory readily for other uses. The only place I _had_ to tune the ARC in production was a couple systems running an app that checks for free memory _before_ trying to allocate it. If the ARC has all but 1 GB in use, the app (which is looking for On my system I adjusted the ARC down due to running user-space applications with very bursty short-term large memory usage. Reducing the ARC assured that there would be no contention between zfs ARC and the applications. If the system is running one app which expects to do lots of application level caching (and in theory, the app should be able to work out what's worth caching and what isn't better than any filesystem underneath it can guess), then you should be planning your memory usage accordingly. For example, a database server, you probably want to allocate much of the system's memory to the database cache (in the case of Oracle, the SGA), leaving enough for a smaller ZFS arc and the memory required by the OS and app. Depends on the system and database size, but something like 50% SGA, 25% ZFS ARC, 25% for everything else might be an example, with the SGA disproportionally bigger on larger systems with larger databases. On my desktop system (supposed to be 8GB RAM, but currently 6GB due to a dead DIMM), I have knocked the ARC down to 1GB. I used to find the ARC wouldn't shrink in size until system had got to the point of crawling along showing anon page-ins, and some app (usually firefox or thunderbird) had already become too difficult to use. I must admit I did this a long time ago, and ZFS's shrinking of the ARC may be more proactive now than it was back then, but I don't notice any ZFS performance issues with the ARC restricted to 1GB on a desktop system. It may have increase scrub times, but that happens when I'm in bed, so I don't care. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... #define _FILE_OFFSET_BITS 64 #include #include #include #include #include int main(int argc, char **argv) { int i; for (i = 1; i< argc; i++) { int fd; fd = open(argv[i], O_RDONLY); if (fd< 0) { perror(argv[i]); } else { off_t eof; off_t hole; if (((eof = lseek(fd, 0, SEEK_END))< 0) || lseek(fd, 0, SEEK_SET)< 0) { perror(argv[i]); } else if (eof == 0) { printf("%s: empty\n", argv[i]); } else { hole = lseek(fd, 0, SEEK_HOLE); if (hole< 0) { perror(argv[i]); } else if (hole< eof) { printf("%s: sparse\n", argv[i]); } else { printf("%s: not sparse\n", argv[i]); } } close(fd); } } return 0; } On 03/26/12 10:06 PM, ольга крыжановская wrote: Mike, I was hoping that some one has a complete example for a bool has_file_one_or_more_holes(const char *path) function. Olga 2012/3/26 Mike Gerdts: 2012/3/26 ольга крыжановская: How can I test if a file on ZFS has holes, i.e. is a sparse file, using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any recommendations on Perc H700 controller on Dell Rx10 ?
On 03/10/12 09:29, Sriram Narayanan wrote: Hi folks: At work, I have an R510, and R610 and an R710 - all with the H700 PERC controller. Based on experiments, it seems like there is no way to bypass the PERC controller - it seems like one can only access the individual disks if they are set up in RAID0 each. This brings me to ask some questions: a. Is it fine (in terms of an intelligent controller coming in the way of ZFS) to have the PERC controllers present each drive as RAID0 drives ? b. Would there be any errors in terms of PERC doing things that ZFS is not aware of and this causing any issues later ? I had to produce a ZFS hybrid storage pool performance demo, and was initially given a system with a RAID-only controller (different from yours, but same idea). I created the demo with it, but disabled the RAID's cache as that wasn't what I wanted in the picture. Meanwhile, I ordered the non-RAID version of the card. When it came and I swapped it in. A couple of issues... ZFS doesn't recognise any of the disks obviously, because they have propitiatory RAID headers on them, so they have to be created again from scratch. (That was no big deal in this case, and if it had been, I could have done a zfs send and receive to somewhere else temporarily.) The performance went up, a tiny bit for the spinning disks, and by 50% for the SSDs, so the RAID controller was seriously limiting the IOPs of the SSDs in particular. This was when SSDs were relatively new, and the controllers may not have been designed with SSDs in mind. That's likely to be somewhat different nowadays, but I don't have any data to show that either way. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send only difference between snapshots
skeletor wrote: There is a task: make backup by sending snapshots to another server. But I don't want to send each time a complete snapshot of the system - I want to send only the difference between a snapshots. For example: there are 2 servers, and I want to do the snapshot on the master, send only the difference between the current and recent snapshots on the backup and then deploy it on backup. Any ideas how this can be done? It's called an incremental - it's part of the zfs send command line options. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
Gary Mills wrote: On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote: On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov wrote: "Does raidzN actually protect against bitrot?" That's a kind of radical, possibly offensive, question formula that I have lately. Yup, it does. That's why many of us use it. There's actually no such thing as bitrot on a disk. Each sector on the disk is accompanied by a CRC that's verified by the disk controller on each read. It will either return correct data or report an unreadable sector. There's nothing inbetween. Actually, there are a number of disk firmware and cache faults inbetween, which zfs has picked up over the years. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Stress test zfs
grant lowe wrote: Ok. I blew it. I didn't add enough information. Here's some more detail: Disk array is a RAMSAN array, with RAID6 and 8K stripes. I'm measuring performance with the results of the bonnie++ output and comparing with with the the zpool iostat output. It's with the zpool iostat I'm not seeing a lot of writes. Since ZFS never writes data back where it was, it can coalesce multiple outstanding writes into fewer device writes. This may be what you're seeing. I have a ZFS IOPs demo where the (multi-threaded) application is performing over 10,000 synchronous write IOPs, but the underlying devices are only performing about 1/10th of that, due to ZFS coalescing multiple outstanding writes. Sorry, I'm not familiar with what type of load bonnie generates. -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle EMEA Server Pre-Sales ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Hardware and Software, Engineered to Work Together ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
On 12/16/11 07:27 AM, Gregg Wonderly wrote: Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case Can you be more specific why it fails? I have seen a couple of cases, and I'm wondering if you're hitting the same thing. Can you post the prtvtoc output of your original disk please? and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow zfs send/recv speed
On 11/15/11 23:40, Tim Cook wrote: On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel mailto:andrew.gabr...@oracle.com>> wrote: On 11/15/11 23:05, Anatoly wrote: Good day, The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to 100+ disks in pool. But the speed doesn't vary in any degree. As I understand 'zfs send' is a limiting factor. I did tests by sending to /dev/null. It worked out too slow and absolutely not scalable. None of cpu/memory/disk activity were in peak load, so there is of room for improvement. Is there any bug report or article that addresses this problem? Any workaround or solution? I found these guys have the same result - around 7 Mbytes/s for 'send' and 70 Mbytes for 'recv'. http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, the send runs at almost 100Mbytes/sec, so it's pretty much limited by the ethernet. Since you have provided none of the diagnostic data you collected, it's difficult to guess what the limiting factor is for you. -- Andrew Gabriel So all the bugs have been fixed? Probably not, but the OP's implication that zfs send has a specific rate limit in the range suggested is demonstrably untrue. So I don't know what's limiting the OP's send rate. (I could guess a few possibilities, but that's pointless without the data.) I seem to recall people on this mailing list using mbuff to speed it up because it was so bursty and slow at one point. IE: http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/ Yes, this idea originally came from me, having analyzed the send/receive traffic behavior in combination with network connection behavior. However, it's the receive side that's bursty around the TXG commits, not the send side, so that doesn't match the issue the OP is seeing. (The buffer sizes in that blog are not optimal, although any buffer at the receive side will make a significant improvement if the network bandwidth is same order of magnitude as the send/recv are capable of.) -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow zfs send/recv speed
On 11/15/11 23:05, Anatoly wrote: Good day, The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to 100+ disks in pool. But the speed doesn't vary in any degree. As I understand 'zfs send' is a limiting factor. I did tests by sending to /dev/null. It worked out too slow and absolutely not scalable. None of cpu/memory/disk activity were in peak load, so there is of room for improvement. Is there any bug report or article that addresses this problem? Any workaround or solution? I found these guys have the same result - around 7 Mbytes/s for 'send' and 70 Mbytes for 'recv'. http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, the send runs at almost 100Mbytes/sec, so it's pretty much limited by the ethernet. Since you have provided none of the diagnostic data you collected, it's difficult to guess what the limiting factor is for you. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
ZFS detects far more errors that traditional filesystems will simply miss. This means that many of the possible causes for those errors will be something other than a real bad block on the disk. As Edward said, the disk firmware should automatically remap real bad blocks, so if ZFS did that too, we'd not use the remapped block, which is probably fine. For other errors, there's nothing wrong with the real block on the disk - it's going to be firmware, driver, cache corruption, or something else, so blacklisting the block will not solve the issue. Also, with some types of disk (SSD), block numbers are moved around to achieve wear leveling, so blacklistinng a block number won't stop you reusing that real block. -- Andrew Gabriel (from mobile) --- Original message --- From: Edward Ned Harvey To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org Sent: 8.11.'11, 12:50 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Didier Rebeix from ZFS documentation it appears unclear to me if a "zpool scrub" will black list any found bad blocks so they won't be used anymore. If there are any physically bad blocks, such that the hardware (hard disk) will return an error every time that block is used, then the disk should be replaced. All disks have a certain amount of error detection/correction built in, and remap bad blocks internally and secretly behind the scenes, transparent to the OS. So if there are any blocks regularly reporting bad to the OS, then it means there is a growing problem inside the disk. Offline the disk and replace it. It is ok to get an occasional cksum error. Say, once a year. Because the occasional cksum error will be re-read and as long as the data is correct the second time, no problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper (X4500), and CF SSD for L2ARC = ?
Jim Klimov wrote: Thanks, but I believe currently that's out of budget, but a 90MB/s CF module may be acceptable for the small business customer. I wondered if that is known to work or not... I've had a compact flash IDE drive not work in a white-box system. In that case it was a ufs root disk, but any attempt to put a serious load on it, and it corrupted data all over the place. So if you're going to try one, make sure you hammer it very hard in a test environment before you commit anything important to it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] solaris 10u8 hangs with message Disconnected command timeout for Target 0
Ding Honghui wrote: Hi, My solaris storage hangs. I login to the console and there is messages[1] display on the console. I can't login into the console and seems the IO is totally blocked. The system is solaris 10u8 on Dell R710 with disk array Dell MD3000. 2 HBA cable connect the server and MD3000. The symptom is random. It is very appreciated if any one can help me out. The SCSI target you are talking to is being reset. "Unit Attention" means it's forgotten what operating parameters have been negotiated with the system and is a warning the device might have been changed without the system knowing, and it's telling you this happened because of "device internal reset". That sort of thing can happen if the firmware in the SCSI target crashes and restarts, or the power supply blips, or if the device was swapped. I don't know anything about a Dell MD3000, but given it's happened on lots of disks at the same moment following a timeout, it looks like the array power cycled or array firmware (if any) rebooted. (Not sure if a SCSI bus reset can do this or not.) [1] Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /pci@0,0/pci8086,3410@9/pci8086,32c@0/pci1028,1f04@8 (mpt1): Aug 16 13:14:16 nas-hz-02 Disconnected command timeout for Target 0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802a44b8f0ded (sd47): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679073Error Block: 1380679073 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa18029e4b8f0d61 (sd41): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802a24b8f0dc5 (sd45): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679073Error Block: 1380679073 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa18029c4b8f0d35 (sd39): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 Aug 16 13:14:16 nas-hz-02 scsi: WARNING: /scsi_vhci/disk@g60026b900053aa1802984b8f0cd2 (sd35): Aug 16 13:14:16 nas-hz-02 Error for Command: write(10) Error Level: Retryable Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 1380679072Error Block: 1380679072 Aug 16 13:14:16 nas-hz-02 scsi: Vendor: DELL Serial Number: Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal reset), ASCQ: 0x4, FRU: 0x0 -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sudden drop in disk performance - WD20EURS & 4k sectors to blame?
David Wragg wrote: I've not done anything different this time from when I created the original (512b) pool. How would I check ashift? For a zpool called "export"... # zdb export | grep ashift ashift: 12 ^C # As far as I know (although I don't have any WD's), all the current 4k sectorsize hard drives claim to be 512b sectorsize, so if you didn't do anything special, you'll probably have ashift=9. I would look at a zpool iostat -v to see what the IOPS rate is (you may have bottomed out on that), and I would also work out average transfer size (although that alone doesn't necessarily tell you much - a dtrace quantize aggregation would be better). Also check service times on the disks (iostat) to see if there's one which is significantly worse and might be going bad. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk IDs and DD
Lanky Doodle wrote: Oh no I am not bothered at all about the target ID numbering. I just wondered if there was a problem in the way it was enumerating the disks. Can you elaborate on the dd command LaoTsao? Is the 's' you refer to a parameter of the command or the slice of a disk - none of my 'data' disks have been 'configured' yet. I wanted to ID them before adding them to pools. Use p0 on x86 (whole disk, without regard to any partitioning). Any other s or p device node may or may not be there, depending on what partitions/slices are on the disk. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] matching zpool versions to development builds
John Martin wrote: Is there a list of zpool versions for development builds? I found: http://blogs.oracle.com/stw/entry/zfs_zpool_and_file_system where it says Solaris 11 Express is zpool version 31, but my system has BEs back to build 139 and I have not done a zpool upgrade since installing this system but it reports on the current development build: # zpool upgrade -v This system is currently running ZFS pool version 33. It's painfully laid out (each on a separate page), but have a look at http://hub.opensolaris.org/bin/view/Community+Group+zfs/31 (change the version on the end of the URL). It conks out at version 31 though. I have systems back to build 125, so I tend to always force zpool version 19 for that (and that automatically limits zfs version to 4). There's also some info about some builds on the zfs wikipedia page http://en.wikipedia.org/wiki/Zfs -- Andrew Gabriel* *** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale performance query
Alexander Lesle wrote: And what is your suggestion for scrubbing a mirror pool? Once per month, every 2 weeks, every week. There isn't just one answer. For a pool with redundancy, you need to do a scrub just before the redundancy is lost, so you can be reasonably sure the remaining data is correct and can rebuild the redundancy. The problem comes with knowing when this might happen. Of course, if you are doing some planned maintenance which will reduce the pool redundancy, then always do a scrub before that. However, in most cases, the redundancy is lost without prior warning, and you need to do periodic scrubs to cater for this case. I do a scrub via cron once a week on my home system. Having almost completely filled the pool, this was taking about 24 hours. However, now that I've replaced the disks and done a send/recv of the data across to a new larger pool which is only 1/3rd full, that's dropped down to 2 hours. For a pool with no redundancy, where you rely only on backups for recovery, the scrub needs to be integrated into the backup cycle, such that you will discover corrupt data before it has crept too far through your backup cycle to be able to find a non corrupt version of the data. When you have a new hardware setup, I would perform scrubs more frequently as a further check that the hardware doesn't have any systemic problems, until you have gained confidence in it. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding mirrors to an existing zfs-pool
Bernd W. Hennig wrote: G'Day, - zfs pool with 4 disks (from Clariion A) - must migrate to Clariion B (so I created 4 disks with the same size, avaiable for the zfs) The zfs pool has no mirrors, my idea was to add the new 4 disks from the Clariion B to the 4 disks which are still in the pool - and later remove the original 4 disks. I only found in all example how to create a new pool with mirrors but no example how to add to a pool without mirrors a mirror disk for each "disk" in the pool. - is it possible to add disks to each disk in the pool (they have different sizes, so I have exact add the correct disks form Clariion B to the original disk from Clariion B) - can I later "remove" the disks from the Clariion A, pool is intact, user can work with the pool Depends on a few things... What OS are you running, and what release/update or build? What's the RAID layout of your pool "zpool status"? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send/receive and ashift
Does anyone know if it's OK to do zfs send/receive between zpools with different ashift values? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Mount Options
Tony MacDoodle wrote: I have a zfs pool called logs (about 200G). I would like to create 2 volumes using this chunk of storage. However, they would have different mount points. ie. 50G would be mounted as /oarcle/logs 100G would be mounted as /session/logs is this possible? Yes... zfs create -o mountpoint=/oracle/logs logs/oracle zfs create -o mountpoint=/session/logs logs/session If you don't otherwise specify, the two filesystem will share the pool without any constraints. If you wish to limit their max space... zfs set quota=50g logs/oracle zfs set quota=100g logs/session and/or if you wish to reserve a minimum space... zfs set reservation=50g logs/oracle zfs set reservation=100g logs/session Do I have to use the legacy mount options? You don't have to. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How create a FAT filesystem on a zvol?
Gary Mills wrote: On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote: On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills wrote: The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? seems not. [...] Some solaris tools (like fdisk, or "mkfs -F pcfs") needs disk geometry to function properly. zvols doesn't provide that. If you want to use zvols to work with such tools, the easiest way would be using lofi, or exporting zvols as iscsi share and import it again. For example, if you have a 10MB zvol and use lofi, fdisk would show these geometry Total disk size is 34 cylinders Cylinder size is 602 (512 byte) blocks ... which will then be used if you run "mkfs -F pcfs -o nofdisk,size=20480". Without lofi, the same command would fail with Drive geometry lookup (need tracks/cylinder and/or sectors/track: Operation not supported So, why can I do it with UFS? # zfs create -V 10m rpool/vol1 # newfs /dev/zvol/rdsk/rpool/vol1 newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y Warning: 4130 sector(s) in last cylinder unallocated /dev/zvol/rdsk/rpool/vol1: 20446 sectors in 4 cylinders of 48 tracks, 128 sectors 10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, Why is this different from PCFS? UFS has known for years that drive geometries are bogus, and just fakes something up to keep itself happy. What UFS thinks of as a cylinder bares no relation to actual disk cylinders. If you give mkfs_pcfs all the geom data it needs, then it won't try asking the device... andrew@opensolaris:~# zfs create -V 10m rpool/vol1 andrew@opensolaris:~# mkfs -F pcfs -o fat=16,nofdisk,nsect=255,ntrack=63,size=2 /dev/zvol/rdsk/rpool/vol1 Construct a new FAT file system on /dev/zvol/rdsk/rpool/vol1: (y/n)? y andrew@opensolaris:~# fstyp /dev/zvol/rdsk/rpool/vol1 pcfs andrew@opensolaris:~# fsck -F pcfs /dev/zvol/rdsk/rpool/vol1 ** /dev/zvol/rdsk/rpool/vol1 ** Scanning file system meta-data ** Correcting any meta-data discrepancies 10143232 bytes. 0 bytes in bad sectors. 0 bytes in 0 directories. 0 bytes in 0 files. 10143232 bytes free. 512 bytes per allocation unit. 19811 total allocation units. 19811 available allocation units. andrew@opensolaris:~# mount -F pcfs /dev/zvol/dsk/rpool/vol1 /mnt andrew@opensolaris:~# -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 512b vs 4K sectors
Richard Elling wrote: On Jul 4, 2011, at 6:42 AM, Lanky Doodle wrote: Hiya, I''ve been doing a lot of research surrounding this and ZFS, including some posts on here, though I am still left scratching my head. I am planning on using slow RPM drives for a home media server, and it's these that seem to 'suffer' from a few problems; Seagate Barracuda LP - Looks to be the only true 512b sector hard disk. Serious firmware issues Western Digital Cavier Green - 4K sectors = crap write performance Hitachi 5K3000 - Variable sector sizing (according to tech. specs) Samsung SpinPoint F4 - Just plain old problems with them What is the best drive of the above 4, and are 4K drives really a no-no with ZFS. Are there any alternatives in the same price bracket? 4K drives are fine, especially if the workload is read-mostly. Depending on the OS, you can tell ZFS to ignore the incorrect physical sector size reported by some drives. Today, this is easiest in FreeBSD, a little bit more tricky in OpenIndiana (patches and source are available for a few different implementations). Or you can just trick them out by starting the pool with a 4K sector device that doesn't lie (eg, iscsi target). Who would have thought choosing a hard disk could be so 'hard'! I recommend enterprise-grade disks, none of which made your short list ;-(. -- richard I'm going through this at the moment. I've bought a pair of Seagate Barracuda XT 2Tb disks (which are a bit more Enterprise than the list above), just plugged them in, and so far they're OK. Not had them long enough to report on longevity. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 700GB gone?
On 06/30/11 08:50 PM, Orvar Korvar wrote: I have a 1.5TB disk that has several partitions. One of them is 900GB. Now I can only see 300GB. Where is the rest? Is there a command I can do to reach the rest of the data? Will scrub help? Not much to go on - no one can answer this. How did you go about partitioning the disk? What does the fdisk partitioning look like (if its x86)? What does the VToC slice layout look like? What are you using each partition and slice for? What tells you that you can only see 300GB? -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 06/27/11 11:32 PM, Bill Sommerfeld wrote: On 06/27/11 15:24, David Magda wrote: Given the amount of transistors that are available nowadays I think it'd be simpler to just create a series of SIMD instructions right in/on general CPUs, and skip the whole co-processor angle. see: http://en.wikipedia.org/wiki/AES_instruction_set Present in many current Intel CPUs; also expected to be present in AMD's "Bulldozer" based CPUs. I recall seeing a blog comparing the existing Solaris hand-tuned AES assembler performance with the (then) new AES instruction version, where the Intel AES instructions only got you about a 30% performance increase. I've seen reports of better performance improvements, but usually by comparing with the performance on older processors which are going to be slower for additional reasons then just missing the AES instructions. Also, you could claim better performance improvement if you compared against a less efficient original implementation of AES. What this means is that a faster CPU may buy you more crypto performance than the AES instructions alone will do. My understanding from reading the Intel AES instruction set (which I warn might not be completely correct) is that the AES encryption/decryption instruction is executed between 10 and 14 times (depending on key length) for each 128 bits (16 bytes) of data being encrypted/decrypted, so it's very much part of the regular instruction pipeline. The code will have to loop though this process multiple times to process a data block bigger than 16 bytes, i.e. a double nested loop, although I expect it's normally loop-unrolled a fair degree for optimisation purposes. Conversely, the crypto units in the T-series processors are separate from the CPU, and do the encryption/decryption whilst the CPU is getting on with something else, and they do it much faster than it could be done on the CPU. Small blocks are normally a problem for crypto offload engines because the overhead of farming off the work to the engine and getting the result back often means that you can do the crypto on the CPU faster than the time it takes to get the crypto engine started and stopped. However, T-series crypto is particularly good at handling small blocks efficiently, such as around 1kbyte which you are likely to find in a network packet, as it is much closer coupled to the CPU than a PCI crypto card can be, and performance with small packets was key for the crypto networking support T-series was designed for. Of course, it handles crypto of large blocks just fine too. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote: Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. I agree. And disksort is in the mix, too. Oh, I'd never looked at that. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. ...and disksort still survives... maybe we should kill it? It looks like it's possibly slightly worse than the pathologically worst response time case I described below... There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os sent to the disk. Does that also go through disksort? Disksort doesn't seem to have any concept of priorities (but I haven't looked in detail where it plugs in to the whole framework). So it might make better sense for ZFS to keep the disk queue depth small for HDDs. -- richard -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
Richard Elling wrote: Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. Thinking out loud here, but if you can queue up enough random I/Os, the embedded disk controller can probably do a good job reordering them into less random elevator sweep pattern, and increase IOPs through reducing the total seek time, which may be why IOPs does not drop as much as one might imagine if you think of the heads doing random seeks (they aren't random anymore). However, this requires that there's a reasonable queue of I/Os for the controller to optimise, and processing that queue will necessarily increase the average response time. If you run with a queue depth of 1 or 2, the controller can't do this. This is something I played with ~30 years ago, when the OS disk driver was responsible for the queuing and reordering disc transfers to reduce total seek time, and disk controllers were dumb. There are lots of options and compromises, generally weighing reduction in total seek time against longest response time. Best reduction in total seek time comes from planning out your elevator sweep, and inserting newly queued requests into the right position in the sweep ahead. That also gives the potentially worse response time, as you may have one transfer queued for the far end of the disk, whilst you keep getting new transfers queued for the track just in front of you, and you might end up reading or writing the whole disk before you get to do that transfer which is queued for the far end. If you can get a big enough queue, you can modify the insertion algorithm to never insert into the current sweep, so you are effectively planning two sweeps ahead. Then the worse response time becomes the time to process one queue full, rather than the time to read or write the whole disk. Lots of other tricks too (e.g. insertion into sweeps taking into account priority, such as if the I/O is a synchronous or asynchronous, and age of existing queue entries). I had much fun playing with this at the time. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compare snapshot to current zfs fs
Harry Putnam wrote: I have a sneaking feeling I'm missing something really obvious. If you have zfs fs that see little use and have lost track of whether changes may have occurred since last snapshot, is there some handy way to determine if a snapshot matches its filesystem. Or put another way, some way to determine if the snapshot is different than its current filesystem. I knot about the diff tools and of course I guess one could compare overall sizes in bytes for a good idea, but is there a way provided by zfs? If you have a recent enough OS release... zfs diff [ | ] -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely slow zpool scrub performance
On 05/14/11 01:08 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Donald Stahl Running a zpool scrub on our production pool is showing a scrub rate of about 400K/s. (When this pool was first set up we saw rates in the MB/s range during a scrub). Wait longer, and keep watching it. Or just wait till it's done and look at the total time required. It is normal to have periods of high and low during scrub. I don't know why. Check the IOPS per drive - you may be maxing out on one of them if it's in an area where there are lots of small blocks. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
Toby Thain wrote: On 08/05/11 10:31 AM, Edward Ned Harvey wrote: ... Incidentally, does fsync() and sync return instantly or wait? Cuz "time sync" might product 0 sec every time even if there were something waiting to be flushed to disk. The semantics need to be synchronous. Anything else would be a horrible bug. sync(2) is not required to be synchronous. I believe that for ZFS it is synchronous, but for most other filesystems, it isn't (although a second sync will block until the actions resulting from a previous sync have completed). fsync(3C) is synchronous. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
Dan Shelton wrote: Is anyone aware of any freeware program that can speed up copying tons of data (2 TB) from UFS to ZFS on same server? I use 'ufsdump | ufsrestore'*. I would also suggest try setting 'sync=disabled' during the operation, and reverting it afterwards. Certainly, fastfs (a similar although more dangerous option for ufs) makes ufs to ufs copying significantly faster. *ufsrestore works fine on ZFS filesystems (although I haven't tried it with any POSIX ACLs on the original ufs filesystem, which would probably simply get lost). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
Matthew Anderson wrote: Hi All, I've run into a massive performance problem after upgrading to Solaris 11 Express from oSol 134. Previously the server was performing a batch write every 10-15 seconds and the client servers (connected via NFS and iSCSI) had very low wait times. Now I'm seeing constant writes to the array with a very low throughput and high wait times on the client servers. Zil is currently disabled. How/Why? There is currently one failed disk that is being replaced shortly. Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134? What does "zfs get sync" report? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
Krunal Desai wrote: On Wed, Feb 23, 2011 at 8:38 AM, Mauricio Tavares wrote: I see what you mean; in http://mail.opensolaris.org/pipermail/opensolaris-discuss/2008-September/043024.html they claim it is supported by the uata driver. What would you suggest instead? Also, since I have the card already, how about if I try it out? My experience with SPARC is limited, but perhaps the Option ROM/BIOS for that card is intended for x86, and not SPARC? I might thinking of another controller, but this could be the case. You could always try to boot with the card; the worst that'll probably happen is boot hangs before the OS even comes into play. SPARC won't try to run the BIOS on the card anyway (it will only run OpenFirmware BIOS), but you will have to make sure the card has the non-RAID BIOS so that the PCI class doesn't claim it to be a RAID controller, which will prevent Solaris going anywhere near the card at all. These cards could be bought with either RAID or non-RAID BIOS, but RAID was more common. You can (or could some time back) download the RAID and non-RAID BIOS from Silicon Image and re-flash which also updates the PCI class, and I think you'll need a Windows system to actually flash the BIOS. You might want to do a google search on "3114 data corruption" too, although it never hit me back when I used the cards. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
Mauricio Tavares wrote: Perhaps a bit off-topic (I asked on the rescue list -- http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and was told to try here), but I am kinda shooting in the dark: I have been finding online scattered and vague info stating that this card can be made to work with a sparc solaris 10 box (http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html is the only link I can offer right now). Can anyone confirm or deny that? 3112/3114 was a very early (possibly the first?) SATA chipset, I think aimed for use before SATA drivers had been developed. I would suggest looking for something more modern. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)
Roy Sigurd Karlsbakk wrote: Nope. Most HDDs today have a single read channel, and they select which head uses that channel at any point in time. They cannot use multiple heads at the same time, because the heads to not travel the same path on their respective surfaces at the same time. There's no real vertical alignment of the tracks between surfaces, and every surface has its own embedded position information that is used when that surface's head is active. There were attempts at multi-actuator designs with separate servo arms and multiple channels, but mechanically they're too difficult to manufacture at high yields as I understood it. Perhaps a stupid question, but why don't they read from all platters in parallel? The answer is in the text you quoted above. There are drives now with two level actuators. The primary actuator is the standard actuator you are familiar with which moves all the arms. The secondary actuator is a piezo crystal towards the head end of the arm which can move the head a few tracks very quickly without having to move the arm, and these are one per head. In theory, this might allow multiple heads to lock on to their respective tracks at the same time for parallel reads, but I haven't heard that they are used in this way. If you go back to the late 1970's before tracks had embedded servo data, on multi-platter disks you had one surface which contained the head positioning servo data, and the drive relied on accurate vertical alignment between heads/surfaces to keep on track (and drives could head-switch instantly). Around 1980, tracks got too close together for this to work anymore, and the servo positioning data was embedded into each track itself. The very first drives of this type scanned all the surfaces on startup to build up an internal table of the relative misalignment of tracks across the surfaces, but this rapidly became unviable as drive capacity increased and this scan would take an unreasonable length of time. It may be that modern drives learn this as they go - I don't know. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey My google-fu is coming up short on this one... I didn't see that it had been discussed in a while ... BTW, there were a bunch of places where people said "ZFS doesn't need trim." Which I hope, by now, has been commonly acknowledged as bunk. The only situation where you don't need TRIM on a SSD is when (a) you're going to fill it once and never write to it again, which is highly unlikely considering the fact that you're buying a device for its fast write performance... (b) you don't care about performance, which is highly unlikely considering the fact that you bought a performance device ... (c) you are using whole disk encryption. This is a valid point. You would probably never TRIM anything from a fully encrypted disk ... In places where people said TRIM was thought to be unnecessary, the justification they stated was that TRIM will only benefit people whose usage patterns are sporadic, rather than sustained. The downfall of that argument is the assumption that the device can't perform TRIM operations simultaneously while performing other operations. That may be true in some cases, or even globally, but without backing, it's just an assumption. One which I find highly quesitonable. TRIM could also be useful where ZFS uses a storage LUN which is sparsely provisioned, in order to deallocate blocks in the LUN which have previously been allocated, but whose contents have since been invalidated. In this case, both ZFS and whatever is providing the storage LUN would need to support TRIM. Out of interest, what other filesystems out there today can generate TRIM commands? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] show hdd life time ?
Richard Elling wrote: On Jan 21, 2011, at 7:36 PM, Tobias Lauridsen wrote: it is possible to se my hdd total time it have been in use so I can switch to a new one before it gets too many hours old In theory, yes. In practice, I've never seen a disk properly report this data on a consistent basis :-( Perhaps some of the more modern disks do a better job? Look for the power on hours (POH) attribute of SMART. http://en.wikipedia.org/wiki/S.M.A.R.T. If you're looking for stats to give an indication of likely wear, and thus increasing probably of failure, POH is probably not very useful by itself (or even at all). Things like Head Flying Hours and Load Cycle Count are probably more indicative, although not necessarily maintained by all drives. Of course, data which gives indication of actual (rather than likely) wear is even more important as an indicator of impending failure, such as the various error and retry counts. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] incorrect vdev added to pool
On 01/15/11 11:32 PM, Gal Buki wrote: Hi I have a pool with a raidz2 vdev. Today I accidentally added a single drive to the pool. I now have a pool that partially has no redundancy as this vdev is a single drive. Is there a way to remove the vdev Not at the moment, as far as I know. and replace it with a new raidz2 vdev? If not what can I do to do damage control and add some redundancy to the single drive vdev? I think you should be able to attach another disk to it to make them into a mirror. (Make sure you attach, and not add.) -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?
Sridhar, You have switched to a new disruptive filesystem technology, and it has to be disruptive in order to break out of all the issues older filesystems have, and give you all the new and wonderful features. However, you are still trying to use old filesystem techniques with it, which is why things don't fit for you, and you are missing out on the more powerful way ZFS presents these features to you. On 11/16/10 06:59 AM, Ian Collins wrote: On 11/16/10 07:19 PM, sridhar surampudi wrote: Hi, How it would help for instant recovery or point in time recovery ?? i.e restore data at device/LUN level ? Why would you want to? If you are sending snapshots to another pool, you can do instant recovery at the pool level. Point in time recovery is a feature of ZFS snapshots. What's more, with ZFS you can see all your snapshots online all the time, read and/or recover just individual files or whole datasets, and the storage overhead is very efficient. If you want to recover a whole LUN, that's presumably because you lost the original, and in this case the system won't have the original filesystem mounted. Currently it is easy as I can unwind the primary device stack and restore data at device/ LUN level and recreate stack. It's probably easier with ZFS to restore data at the pool or filesystem level from snapshots. Trying to work at the device level is just adding an extra level of complexity to a problem already solved. I won't claim ZFS couldn't better support use of back-end Enterprise storage, but in this case, you haven't given any use cases where that's relevant. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing GUID
sridhar surampudi wrote: Hi I am looking in similar lines, my requirement is 1. create a zpool on one or many devices ( LUNs ) from an array ( array can be IBM or HPEVA or EMC etc.. not SS7000). 2. Create file systems on zpool 3. Once file systems are in use (I/0 is happening) I need to take snapshot at array level a. Freeze the zfs flle system ( not required due to zfs consistency : source : mailing groups) b. take array snapshot ( say .. IBM flash copy ) c. Got new snapshot device (having same data and metadata including same GUID of source pool) Now I need a way to change the GUID and pool of snapshot device so that the snapshot device can be accessible on same host or an alternate host (if the LUN is shared). Could you please post commands for the same. There is no way I know of currently. (There was an unofficial program floating around to do this on much earlier opensolaris versions, but it no longer works). If you have a support contract, raise a call and asked to be added to RFE 6744320. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?
sridhar surampudi wrote: Hi Darren, In shot I am looking a way to freeze and thaw for zfs file system so that for harware snapshot, i can do 1. run zfs freeze 2. run hardware snapshot on devices belongs to the zpool where the given file system is residing. 3. run zfs thaw Unlike other filesystems, ZFS is always consistent on disk, so there's no need to freeze a zpool to take a hardware snapshot. The hardware snapshot will effectively contain all transactions up to the last transaction group commit, plus all synchronous transactions up to the hardware snapshot. If you want to be sure that all transactions up to a certain point in time are included (for the sake of an application's data), take a ZFS snapshot (which will force a TXG commit), and then take the hardware snapshot. You will not be able to access the hardware snapshot from the system which has the original zpool mounted, because the two zpools will have the same pool GUID (there's an RFE outstanding on fixing this). The one thing you do need to be careful of is, that with a multi-disk zpool, the hardware snapshot is taken at an identical point in time across all the disks in the zpool. This functionality is usually an extra-charge option in Enterprise storage systems. If the hardware snapshots are staggered across multiple disks, all bets are off, although if you take a zfs snapshot immediately beforehand and you test import/scrub the hardware snapshot (on a different system) immediately (so you can repeat the hardware snapshot again if it fails), maybe you will be lucky. The right way to do this with zfs is to send/recv the datasets to a fresh zpool, or (S10 Update 9) to create an extra zpool mirror and then split it off with zpool split. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do you use >1 partition on x86?
Change the new partition type to something than none of the OS's on the system will know anything about, so they don't make any invalid assumptions about what might be in it. Then use the appropriate partition device node, /dev/dsk/c7t0d0p4 (assuming it's the 4th primary FDISK partition). Multiple zpools on one disk is not going to be good for performance if you use both together. There may be some way to grow the existing Solaris partition into the spare space without destroying the contents and then growing the zpool into the new space, but I haven't tried this with FDISK partitions, so I don't know if it works without damaging the existing contents. (I have done it with slices, and it does work in that case.) Bill Werner wrote: So when I built my new workstation last year, I partitioned the one and only disk in half, 50% for Windows, 50% for 2009.06. Now, I'm not using Windows, so I'd like to use the other half for another ZFS pool, but I can't figure out how to access it. I have used fdisk to create a second Solaris2 partition, did a re-con reboot, but format still only shows the 1 available partition. How do I used the second partition? selecting c7t0d0 Total disk size is 30401 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 Other OS 0 4 5 0 2 IFS: NTFS 5 19171913 6 3 ActiveSolaris2 1917 1497113055 43 4 Solaris2 14971 3017015200 50 format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c7t0d0 /p...@0,0/pci1028,2...@1f,2/d...@0,0 Thanks for any idea. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
casper@sun.com wrote: On Tue, Oct 5, 2010 at 11:49 PM, wrote: I'm not sure that that is correct; the drive works on naive clients but I believe it can reveal its true colors. The drive reports 512 byte sectors to all hosts. AFAIK there's no way to make it report 4k sectors. Too bad because it makes it less useful (specifically because the label mentions sectors and if you can use bigger sectors, you can address a larger drive). Having now read a number of forums about these, there's a strong feeling WD screwed up by not providing a switch to disable pseudo 512b access so you can use the 4k native. The industry as a whole will transition to 4k sectorsize over next few years, but these first 4k sectorsize HDs are rather less useful with 4k sectorsize-aware OS's. Let's hope other manufacturers get this right in their first 4k products. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
Michael DeMan wrote: The WD 1TB 'enterprise' drives are still 512 sector size and safe to use, who knows though, maybe they just started shipping with 4K sector size as I write this e-mail? Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? That part has me worried on this whole 4K sector migration thing more than what to buy today. Given the choice, I would prefer to buy 4K sector size now, but operating system support is still limited. Does anybody know if there any vendors that are shipping 4K sector drives that have a jumper option to make them 512 size? WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? Changing the sector size (if it's possible at all) would require a reformat of the drive. On SCSI disks which support it, you do it by changing the sector size on the relevant mode select page, and then sending a format-unit command to make the drive relayout all the sectors. I've no idea if these 4K sata drives have any such mechanism, but I would expect they would. BTW, I've been using a pair of 1TB Hitachi Ultrastar for something like 18 months without any problems at all. Of course, a 1 year old disk model is no longer available now. I'm going to have to swap out for bigger disks in the not too distant future. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
Richard L. Hamilton wrote: Typically on most filesystems, the inode number of the root directory of the filesystem is 2, 0 being unused and 1 historically once invisible and used for bad blocks (no longer done, but kept reserved so as not to invalidate assumptions implicit in ufsdump tapes). However, my observation seems to be (at least back at snv_97), the inode number of ZFS filesystem root directories (including at the top level of a spool) is 3, not 2. If there's any POSIX/SUS requirement for the traditional number 2, I haven't found it. So maybe there's no reason founded in official standards for keeping it the same. But there are bound to be programs that make what was with other filesystems a safe assumption. Perhaps a warning is in order, if there isn't already one. Is there some _reason_ why the inode number of filesystem root directories in ZFS is 3 rather than 2? If you look at zfs_create_fs(), you will see the first 3 items created are: Create zap object used for SA attribute registration Create a delete queue. Create root znode. Hence, inode 3. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding a higher level partition to ZFS pool
Axelle Apvrille wrote: Hi all, I would like to add a new partition to my ZFS pool but it looks like it's more stricky than expected. The layout of my disk is the following: - first partition for Windows. I want to keep it. (no formatting !) - second partition for OpenSolaris.This is where I have all the Solaris slices (c0d0s0 etc). I have a single ZFS pool. OpenSolaris boots on ZFS. - third partition: a FAT partition I want to keep (no formatting !) - fourth partition: I want to add this partition to my ZFS pool (or another pool ?). I don't care if information on that partition is lost, I can format it if necessary. zpool add c0d0p0:2 ? Hmm... You cannot add it to the root pool, as the root pool cannot be a RAID0. You can make another pool from it... Ideally, set the FDISK partition type to something that none of the OS's on the system will know anything about. (It doesn't matter what it is from the zfs point of view, but you don't want any of the OS's thinking it's something they believe they know how to use.) zpool create tank c0d0p4 (for the 4th FDISK primary partition). Note that two zpools on the same disk may give you poor performance if you are accessing both at the same time, as you are forcing head seeking between them. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Failed zfs send "invalid backup stream".............
Humberto Ramirez wrote: I'm trying to replicate a 300 GB pool with this command zfs send al...@3 | zfs receive -F omega about 2 hours in to the process it fails with this error "cannot receive new filesystem stream: invalid backup stream" I have tried setting the target read only (zfs set readonly=on omega) also disable Timeslider thinking it might have something to do with it. What could be causing the error ? Could be zfs filesystem version too old on the sending side (I have one such case). What are their versions, and what release/build of the OS are you using? if the target is a new hard drive can I use this zfs send al...@3 > /dev/c10t0d0 ? That command doesn't make much sense for the purpose of doing anything useful. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...
7;m not intimately familiar with the firmware versions, but if you're having problems, making sure you have latest firmware is probably a good thing to do. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
What you say is true only on the system itself. On an NFS client system, 30 seconds of lost data in the middle of a file (as per my earlier example) is a corrupt file. -original message- Subject: Re: [zfs-discuss] Solaris startup script location From: Edward Ned Harvey Date: 18/08/2010 17:17 > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Alxen4 > > Disabling ZIL converts all synchronous calls to asynchronous which > makes ZSF to report data acknowledgment before it actually was written > to stable storage which in turn improves performance but might cause > data corruption in case of server crash. > > Is it correct ? It is partially correct. With the ZIL disabled, you could lose up to 30 sec of writes, but it won't cause an inconsistent filesystem, or "corrupt" data. If you make a distinction between "corrupt" and "lost" data, then this is valuable for you to know: Disabling the ZIL can result in up to 30sec of lost data, but not corrupt data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Alxen4 wrote: Thanks...Now I think I understand... Let me summarize it andd let me know if I'm wrong. Disabling ZIL converts all synchronous calls to asynchronous which makes ZSF to report data acknowledgment before it actually was written to stable storage which in turn improves performance but might cause data corruption in case of server crash. Is it correct ? In my case I'm having serious performance issues with NFS over ZFS. You need a non-volatile slog, such as an SSD. My NFS Client is ESXi so the major question is there risk of corruption for VMware images if I disable ZIL ? Yes. If your NFS server takes an unexpected outage and comes back up again, some writes will have been lost which ESXi thinks succeeded (typically 5 to 30 seconds worth of writes/updates immediately before the outage). So as an example, if you had an application writing a file sequentially, you will likely find an area of the file is corrupt because the data was lost. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Andrew Gabriel wrote: Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) For sure I can manually remove/and add it by script and put the script in regular rc2.d location...I'm just looking for more elegant way to it. Can you start by explaining what you're trying to do, because this may be completely misguided? A ramdisk is volatile, so you'll lose it when system goes down, causing failure to mount on reboot. Recreating a ramdisk on reboot won't recreate the slog device you lost when the system went down. I expect the zpool would fail to mount. Furthermore, using a ramdisk as a ZIL is effectively just a very inefficient way to disable the ZIL. A better way to do this is to "zfs set sync=disabled ..." on relevant filesystems. I can't recall which build introduced this, but prior to that, you can set zfs://zil_disable=1 in /etc/system but that applies to all pools/filesystems. The double-slash was brought to you by a bug in thunderbird. The original read: set zfs:zil_disable=1 -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) For sure I can manually remove/and add it by script and put the script in regular rc2.d location...I'm just looking for more elegant way to it. Can you start by explaining what you're trying to do, because this may be completely misguided? A ramdisk is volatile, so you'll lose it when system goes down, causing failure to mount on reboot. Recreating a ramdisk on reboot won't recreate the slog device you lost when the system went down. I expect the zpool would fail to mount. Furthermore, using a ramdisk as a ZIL is effectively just a very inefficient way to disable the ZIL. A better way to do this is to "zfs set sync=disabled ..." on relevant filesystems. I can't recall which build introduced this, but prior to that, you can set zfs://zil_disable=1 in /etc/system but that applies to all pools/filesystems. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS automatic rollback and data rescue.
Constantine wrote: ZFS doesn't do this. I thought so too. ;) Situation brief: I've got OpenSolaris 2009.06 installed on the RAID-5 array on the controller with 512 Mb cache (as i can remember) without a cache-saving battery. I hope the controller disabled the cache then. Probably a good idea to run "zpool scrub rpool" to find out if it's broken. It will probably take some time. zpool status will show the progress. At the Friday lightning bolt hit the power supply station of colocating company,and turned out that their UPSs not much more then decoration. After reboot filesystem and logs are on their last snapshot version. Would also be useful to see output of: zfs list -t all -r zpool/filesystem wi...@zeus:~/.zfs/snapshot# zfs list -t all -r rpool NAME USED AVAIL REFER MOUNTPOINT rpool 427G 1.37T 82.5K /rpool rpool/ROOT 366G 1.37T19K legacy rpool/ROOT/opensolaris20.6M 1.37T 3.21G / rpool/ROOT/xvm8.10M 1.37T 8.24G / rpool/ROOT/xvm-1 690K 1.37T 8.24G / rpool/ROOT/xvm-2 35.1G 1.37T 232G / rpool/ROOT/xvm-3 851K 1.37T 221G / rpool/ROOT/xvm-4 331G 1.37T 221G / rpool/ROOT/xv...@install 144M - 2.82G - rpool/ROOT/xv...@xvm 38.3M - 3.21G - rpool/ROOT/xv...@2009-07-27-01:09:1456K - 8.24G - rpool/ROOT/xv...@2009-07-27-01:09:5756K - 8.24G - rpool/ROOT/xv...@2009-09-13-23:34:54 2.30M - 206G - rpool/ROOT/xv...@2009-09-13-23:35:17 1.14M - 206G - rpool/ROOT/xv...@2009-09-13-23:42:12 5.72M - 206G - rpool/ROOT/xv...@2009-09-13-23:42:45 5.69M - 206G - rpool/ROOT/xv...@2009-09-13-23:46:25 573K - 206G - rpool/ROOT/xv...@2009-09-13-23:46:34 525K - 206G - rpool/ROOT/xv...@2009-09-13-23:48:11 6.51M - 206G - rpool/ROOT/xv...@2010-04-22-03:50:25 24.6M - 221G - rpool/ROOT/xv...@2010-04-22-03:51:28 24.6M - 221G - Actually, there's 24.6Mbytes worth of changes to the filesystem since the last snapshot, which is coincidentally about the same as there was over the preceding minute between the last two snapshots. I can't tell if (or how much of) that happened before, verses after, the reboot though. rpool/dump16.0G 1.37T 16.0G - rpool/export 28.6G 1.37T21K /export rpool/export/home 28.6G 1.37T21K /export/home rpool/export/home/wiron 28.6G 1.37T 28.6G /export/home/wiron rpool/swap16.0G 1.38T 101M - = Normally in a power-out scenario, you will only lose asynchronous writes since the last transaction group commit, which will be up to 30 seconds worth (although normally much less), and you lose no synchronous writes. However, I've no idea what your potentially flaky RAID array will have done. If it was using its cache and thinking it was non-volatile, then it could easily have corrupted the zfs filesystem due to having got writes out of sequence with transaction commits, and this can render the filesystem no longer mountable because the back-end storage has lied to zfs about committing writes. Even though you were lucky and it still mounts, it might still be corrupted, hence the suggestion to run zpool scrub (and even more important, get the RAID array fixed). Since I presume ZFS doesn't have redundant storage for this zpool, any corrupted data can't be repaired by ZFS, although it will tell you about it. Running ZFS without redundancy on flaky storage is not a good place to be. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS automatic rollback and data rescue.
Constantine wrote: Hi. I've got the ZFS filesystem (opensolaris 2009.06), witch, as i can see, was automatically rollbacked by OS to the lastest snapshot after the power failure. ZFS doesn't do this. Can you give some more details of what you're seeing? Would also be useful to see output of: zfs list -t all -r zpool/filesystem There is a trouble - snapshot is too old, and ,consequently, there is a questions -- Can I browse pre-rollbacked corrupted branch of FS ? And, if I can, how ? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Global Spare for 2 pools
Tony MacDoodle wrote: I have 2 ZFS pools all using the same drive type and size. The question is can I have 1 global hot spare for both of those pools? Yes. A hot spare disk can be added to more than one pool at the same time. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Z stripes
Phil Harman wrote: On 10 Aug 2010, at 08:49, Ian Collins wrote: On 08/10/10 06:21 PM, Terry Hull wrote: I am wanting to build a server with 16 - 1TB drives with 2 – 8 drive RAID Z2 arrays striped together. However, I would like the capability of adding additional stripes of 2TB drives in the future. Will this be a problem? I thought I read it is best to keep the stripes the same width and was planning to do that, but I was wondering about using drives of different sizes. These drives would all be in a single pool. It would work, but you run the risk of the smaller drives becoming full and all new writes doing to the bigger vdev. So while usable, performance would suffer. Almost by definition, the 1TB drives are likely to be getting full when the new drives are added (presumably because of running out of space). Performance can only be said to suffer relative to a new pool built entirely with drives of the same size. Even if he added 8x 2TB drives in a RAIDZ3 config it is hard to predict what the performance gap will be (on the one hand: RAIDZ3 vs RAIDZ2, on the other: an empty group vs an almost full, presumably fragmented, group). One option would be to add 2TB drives as 5 drive raidz3 vdevs. That way your vdevs would be approximately the same size and you would have the optimum redundancy for the 2TB drives. I think you meant 6, but I don't see a good reason for matching the group sizes. I'm for RAIDZ3, but I don't see much logic in mixing groups of 6+2 x 1TB and 3+3 x 2TB in the same pool (in one group I appear to care most about maximising space, in the other I'm maximising availability) Another option - use the new 2TB drives to swap out the existing 1TB drives. If you can find another use for the swapped out drives, this works well, and avoids ending up with sprawling lower capacity drives as your pool grows in size. This is what I do at home. The freed-up drives get used in other systems and for off-site backups. Over the last 4 years, I've upgraded from 1/4TB, to 1/2TB, and now on 1TB drives. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS SCRUB
Mohammed Sadiq wrote: Hi Is it recommended to do scrub while the filesystem is mounted . How frequently do we have to do scrub and at what circumstances. You can scrub while the filesystems are mounted - most people do, there's no reason to unmount for for a scrub. (Scrub is pool level, not filesystem level.) Scrub does noticeably slow the filesystem, so pick a time of low application load or a time when performance isn't critical. If it overruns into a busy period, you can cancel the scrub. Unfortunately, you can't pause and resume - there's an RFE for this, so if you cancel one you can't restart it from where it got to - it has to restart from the beginning. You should scrub occasionally anyway. That's your check that data you haven't accessed in your application isn't rotting on the disks. You should also do a scrub before you do a planned reduction of the pool redundancy (e.g. if you're going to detach a mirror side in order to attach a larger disk), most particularly if you are reducing the redundancy to nothing. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Maximum zfs send/receive throughput
Jim Barker wrote: Just an update, I had a ticket open with Sun regarding this and it looks like they have a CR for what I was seeing (6975124). That would seem to describe a zfs receive which has stopped for 12 hours. You described yours as slow, which is not the term I personally would use for one which is stopped. However, you haven't given anything like enough detail here of your situation and what's happening for me to make any worthwhile guesses. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zvol recordsize for backing a zpool over iSCSI
Just wondering if anyone has experimented with working out the best zvol recordsize for a zvol which is backing a zpool over iSCSI? -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Phil Harman Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available The fact that people run unsafe systems seemingly without complaint for years assumes that they know silent data corruption when they see^H^H^Hhear it ... which, of course, they didn't ... because it is silent ... or having encountered corrupted data, that they have the faintest idea where it came from. In my day to day work I still find many people that have been (apparently) very lucky. Running with sync disabled, or ZIL disabled, you could call "unsafe" if you want to use a generalization and a stereotype. Just like people say "writeback" is unsafe. If you apply a little more intelligence, you'll know, it's safe in some conditions, and not in other conditions. Like ... If you have a BBU, you can use your writeback safely. And if you're not sharing stuff across the network, you're guaranteed the disabled ZIL is safe. But even when you are sharing stuff across the network, the disabled ZIL can still be safe under the following conditions: If you are only doing file sharing (NFS, CIFS) and you are willing to reboot/remount from all your clients after an ungraceful shutdown of your server, then it's safe to run with ZIL disabled. No, that's not safe. The client can still lose up to 30 seconds of data, which could be, for example, an email message which is received and foldered on the server, and is then lost. It's probably /*safe enough*/ for most home users, but you should be fully aware of the potential implications before embarking on this route. (As I said before, the zpool itself is not at any additional risk of corruption, it's just that you might find the zfs filesystems with sync=disabled appear to have been rewound by up to 30 seconds.) If you're unsure, then adding SSD nonvolatile log device, as people have said, is the way to go. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
Thomas Burgess wrote: On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie <mailto:sigbj...@nixtra.com>> wrote: Hi, I've been searching around on the Internet to fine some help with this, but have been unsuccessfull so far. I have some performance issues with my file server. I have an OpenSolaris server with a Pentium D 3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate (ST31500341AS) 1,5TB SATA drives. If I compile or even just unpack a tar.gz archive with source code (or any archive with lots of small files), on my Linux client onto a NFS mounted disk to the OpenSolaris server, it's extremely slow compared to unpacking this archive on the locally on the server. A 22MB .tar.gz file containng 7360 files takes 9 minutes and 12seconds to unpack over NFS. Unpacking the same file locally on the server is just under 2 seconds. Between the server and client I have a gigabit network, which at the time of testing had no other significant load. My NFS mount options are: "rw,hard,intr,nfsvers=3,tcp,sec=sys". Any suggestions to why this is? Regards, Sigbjorn as someone else said, adding an ssd log device can help hugely. I saw about a 500% nfs write increase by doing this. I've heard of people getting even more. Another option if you don't care quite so much about data security in the event of an unexpected system outage would be to use Robert Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes with sync=disabled, when the changes work their way into an available build. The risk is that if the file server goes down unexpectedly, it might come back up having lost some seconds worth of changes which it told the client (lied) that it had committed to disk, when it hadn't, and this violates the NFS protocol. That might be OK if you are using it to hold source that's being built, where you can kick off a build again if the server did go down in the middle of it. Wouldn't be a good idea for some other applications though (although Linux ran this way for many years, seemingly without many complaints). Note that there's no increased risk of the zpool going bad - it's just that after the reboot, filesystems with sync=disabled will look like they were rewound by some seconds (possibly up to 30 seconds). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
Richard Jahnel wrote: Any idea why? Does the zfs send or zfs receive bomb out part way through? I have no idea why mbuffer fails. Changing the -s from 128 to 1536 made it take longer to occur and slowed it down bu about 20% but didn't resolve the issue. It just ment I might get as far as 2.5gb before mbuffer bombed with broken pipe. Trying -r and -R with various values had no effect. I found that where the network bandwidth and the disks' throughput are similar (which requires a pool with many top level vdevs in the case of a 10Gb link), you ideally want a buffer on the receive side which will hold about 5 seconds worth of data. A large buffer on the transmit side didn't help. The aim is to be able to continue steaming data across the network whilst a transaction commit happens at the receive end and zfs receive isn't reading, but to have the data ready locally for zfs receive when it starts reading again. Then the network will stream, in spite of the bursty read nature of zfs receive. I recorded this in bugid http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6729347 However, I haven't verified the extent to which this still happens on more recent builds. Might be worth trying it over rsh if security isn't an issue, and then you lose the encryption overhead. Trouble is that then you've got almost no buffering, which can do bad things to the performance, which is why mbuffer would be ideal if it worked for you. I seem to remember reading that rsh was remapped to ssh in Solaris. No. On the system you're rsh'ing to, you will have to "svcadm enable svc:/network/shell:default", and set up appropriate authorisation in ~/.rhosts I heard of some folks using netcat. I haven't figured out where to get netcat nor the syntax for using it yet. I used a buffering program of my own, but I presume mbuffer would work too. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
Richard Jahnel wrote: I've tried ssh blowfish and scp arcfour. both are CPU limited long before the 10g link is. I'vw also tried mbuffer, but I get broken pipe errors part way through the transfer. Any idea why? Does the zfs send or zfs receive bomb out part way through? Might be worth trying it over rsh if security isn't an issue, and then you lose the encryption overhead. Trouble is that then you've got almost no buffering, which can do bad things to the performance, which is why mbuffer would be ideal if it worked for you. I'm open to ideas for faster ways to to either zfs send directly or through a compressed file of the zfs send output. For the moment I; zfs send > pigz scp arcfour the file gz file to the remote host gunzip < to zfs receive This takes a very long time for 3 TB of data, and barely makes use the 10g connection between the machines due to the CPU limiting on the scp and gunzip processes. Also, if you have multiple datasets to send, might be worth seeing if sending them in parallel helps. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 1tb SATA drives
Arne Jansen wrote: Jordan McQuown wrote: I’m curious to know what other people are running for HD’s in white box systems? I’m currently looking at Seagate Barracuda’s and Hitachi Deskstars. I’m looking at the 1tb models. These will be attached to an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as a large storage array for backups and archiving. I wouldn't recommend using desktop drives in a server RAID. They can't handle the vibrations well that are present in a server. I'd recommend at least the Seagate Constellation or the Hitachi Ultrastar, though I haven't tested the Deskstar myself. I've been using a couple of 1TB Hitachi Ultrastars for about a year with no problem. I don't think mine are still available, but I expect they have something equivalent. The pool is scrubbed 3 times a week which takes nearly 19 hours now, and hammers the heads quite hard. I keep meaning to reduce the scrub frequency now it's getting to take so long, but haven't got around to it. What I really want is pause/resume scrub, and the ability to trigger the pause/resume from the screensaver (or something similar). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended RAM for ZFS on various platforms
Garrett D'Amore wrote: Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors. You'll have better performance, and good resilience against errors. And you can grow later as you need to by just adding additional drive pairs. -- Garrett Or in my case, I find my home data growth is slightly less than the rate of disk capacity increase, so every 18 months or so, I simply swap out the disks for higher capacity ones. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
Linder, Doug wrote: Out of sheer curiosity - and I'm not disagreeing with you, just wondering - how does ZFS make money for Oracle when they don't charge for it? Do you think it's such an important feature that it's a big factor in customers picking Solaris over other platforms? Yes, it is one of many significant factors in customers choosing Solaris over other OS's. Having chosen Solaris, customers then tend to buy Sun/Oracle systems to run it on. Of course, there are the 7000 series products too, which are heavily based on the capabilities of ZFS, amongst other Solaris features. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable
Gordon Ross wrote: Anyone know why my ZFS filesystem might suddenly start giving me an error when I try to "ls -d" the top of it? i.e.: ls -d /tank/ws/fubar /tank/ws/fubar: Operation not applicable zpool status says all is well. I've tried snv_139 and snv_137 (my latest and previous installs). It's an amd64 box. Both OS versions show the same problem. Do I need to run a scrub? (will take days...) Other ideas? It might be interesting to run it under truss, to see which syscall is returning that error. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it possible to disable MPxIO during OpenSolaris installation?
James C. McPherson wrote: On 2/06/10 03:11 PM, Fred Liu wrote: Fix some typos. # In fact, there is no problem for MPxIO name in technology. It only matters for storage admins to remember the name. You are correct. I think there is no way to give short aliases to these long tedious MPxIO name. You are correct that we don't have aliases. However, I do not agree that the naming is tedious. It gives you certainty about the actual device that you are dealing with, without having to worry about whether you've cabled it right. Might want to add a call record to CR 6901193 Need a command to list current usage of disks, partitions, and slices which includes a request for vanity naming for disks. (Actually, vanity naming for disks should probably be brought out into a separate RFE.) -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle Pre-Sales Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Oracle is committed to developing practices and products that help protect the environment ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unsetting the bootfs property possible? imported a FreeBSD pool
Reshekel Shedwitz wrote: r...@nexenta:~# zpool set bootfs= tank cannot set property for 'tank': property 'bootfs' not supported on EFI labeled devices r...@nexenta:~# zpool get bootfs tank NAME PROPERTY VALUE SOURCE tank bootfstanklocal Could this be related to the way FreeBSD's zfs partitioned my disk? I thought ZFS used EFI by default though (except for boot pools). Looks like this bit of code to me: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/common/libzfs_pool.c#473 473 /* 474 * bootfs property cannot be set on a disk which has 475 * been EFI labeled. 476 */ 477 if (pool_uses_efi(nvroot)) { 478 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, 479 "property '%s' not supported on " 480 "EFI labeled devices"), propname); 481 (void) zfs_error(hdl, EZFS_POOL_NOTSUP, errbuf); 482 zpool_close(zhp); 483 goto error; 484 } 485 zpool_close(zhp); 486 break; It's not checking if you're clearing the property before bailing out with the error about setting it. A few lines above, another test (for a valid bootfs name) does get bypassed in the case of clearing the property. Don't know if that alone would fix it. -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle Pre-Sales Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Oracle is committed to developing practices and products that help protect the environment ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [ZIL device brainstorm] intel x25-M G2 has ram cache?
Erik Trimble wrote: Frankly, I'm really surprised that there's no solution, given that the *amount* of NVRAM needed for ZIL (or similar usage) is really quite small. a dozen GB is more than sufficient, and really, most systems do fine with just a couple of GB (3-4 or so). Producing a small, DRAM-based device in a 3.5" HD form-factor with built-in battery shouldn't be hard, and I'm kinda flabberghasted nobody is doing it. Well, at least in the sub-$1000 category. I mean, it's 2 SODIMMs, a AAA-NiCad battery, a PCI-E->DDR2 memory controller, a PCI-E to SATA6Gbps controller, and that's it. It's a bit of a wonky design. The DRAM could do something of the order 1,000,000 IOPS, and is then throttled back to a tiny fraction of that by the SATA bottleneck. Disk interfaces like SATA/SAS really weren't designed for this type of use. What you probably want is a motherboard which has a small area of main memory protected by battery, and a ramdisk driver which knows how to use it. Then you'd get the 1,000,000 IOPS. No idea if anyone makes such a thing. You are correct that ZFS gets an enormous benefit from even tiny amounts if NV ZIL. Trouble is that no other operating systems or filesystems work this well with such relatively tiny amounts of NV storage, so such a hardware solution is very ZFS-specific. -- Andrew Gabriel | Solaris Systems Architect Email: andrew.gabr...@oracle.com Mobile: +44 7720 598213 Oracle Pre-Sales Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom ORACLE Corporation UK Ltd is a company incorporated in England & Wales | Company Reg. No. 1782505 | Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA Oracle is committed to developing practices and products that help protect the environment ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs question
Mihai wrote: hello all, I have the following scenario of using zfs. - I have a HDD images that has a NTFS partition stored in a zfs dataset in a file called images.img - I have X physical machines that boot from my server via iSCSI from such an image - Every time a machine ask for a boot request from my server a clone of the zfs dataset is created and the machine is given the clone to boot from I want to make an optimization to my framework that involves using a ramdisk pool to store the initial hdd images and the clones of the image being stored on a disk based pool. I tried to do this using zfs, but it wouldn't let me do cross pool clones. If someone has any idea on how to proceed in doing this, please let me know. It is not necessary to do this exactly as I proposed, but it has to be something in this direction, a ramdisk backed initial image and more disk backed clones. You haven't said what your requirement is - i.e. what are you hoping to improve by making this change? I can only guess. If you are reading blocks from your initial hdd images (golden images) frequently enough, and you have enough memory on your system, these blocks will end up on the ARC (memory) anyway. If you don't have enough RAM for this to help, then you could add more memory, and/or an SSD as a L2ARC device ("cache" device in zpool command line terms). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
Robert Milkowski wrote: To add my 0.2 cents... I think starting/stopping scrub belongs to cron, smf, etc. and not to zfs itself. However what would be nice to have is an ability to freeze/resume a scrub and also limit its rate of scrubbing. One of the reason is that when working in SAN environments one have to take into account more that just a server where a scrub will be running as while it might not impact the server it might cause an issue for others, etc. There's an RFE for this (pause/resume a scrub), or rather there was - unfortunately, it's got subsumed into another RFE/BUG and the pause/resume requirement got lost. I'll see about reinstating it. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When to Scrub..... ZFS That Is
Thomas Burgess wrote: I scrub once a week. I think the general rule is: once a week for consumer grade drives once a month for enterprise grade drives. and before any planned operation which will reduce your redundancy/resilience, such as swapping out a disk for a new larger one when growing a pool. The resulting resilver will read all the data in the datasets in order to reconstruct the new disk, some of which might not have been read for ages (or since the last scrub), and that's not the ideal time to discover your existing copy of some blocks went bad some time back. Better to discover this before you reduce the pool redundancy/resilience, whilst it's still fixable. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel SASUC8I - worth every penny
Dedhi Sujatmiko wrote: As a user of el-cheapo US$18 SIL3114, I managed to make the system freeze continuously when one of SATA cable got disconnected. I am using 8 disks RAIDZ2 driven by 2 x SIL3114 System is still able to answer the ping, but SSH and console are no longer responsive, obviously also the NFS and CIFS share. The console keep sending "waiting for disk" loop. The only way to recover is to reset the system, and as expected, one of the disk went offline, but the service is back online in degraded ZFS pool. The SIL3112/3114 were very early SATA controllers, indeed barely SATA controllers at all by todays standards as I think they always pretend to be PATA to the host system. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. Is it destroying old snapshots or creating new ones that causes this dead time? What does each of these procedures do that could affect the system? What can I do to make this less visible to users? Creating a snapshot shouldn't do anything much more than a regular transaction group commit, which should be happening at least every 30 seconds anyway. Deleting a snapshot potentially results in freeing up the space occupied by files/blocks which aren't in any other snapshots. One way to think of this is that when you're using regular snapshots, the freeing up of space which happens when you delete files is in effect all deferred until you destroy the snapshot(s) which also refer to that space, which has the effect of bunching all your space freeing. If this is the cause (a big _if_, as I'm just speculating), then it might be a good idea to: a) spread out the deleting of the snapshots, and b) create more snapshots more often (and conversely delete more snapshots, more often), so each one contains fewer accumulated space to be freed off. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool vdev imbalance
Ian Collins wrote: I was running zpool iostat on a pool comprising a stripe of raidz2 vdevs that appears to be writing slowly and I notice a considerable imbalance of both free space and write operations. The pool is currently feeding a tape backup while receiving a large filesystem. Is this imbalance normal? I would expect a more even distribution as the poll configuration hasn't been changed since creation. The second and third ones are pretty much full, with the others having well over 10 times more free space, so I wouldn't expect many writes to the full ones. Have the others ever been in a degraded state? That might explain why the fill level has become unbalanced. The system is running Solaris 10 update 7. capacity operationsbandwidth pool used avail read write read write - - - - - - tank 15.9T 2.19T 87119 2.34M 1.88M raidz2 2.90T 740G 24 27 762K 95.5K c0t1d0- - 14 13 273K 18.6K c1t1d0- - 15 13 263K 18.3K c4t2d0- - 17 14 288K 18.2K spare - - 17 20 104K 17.2K c5t2d0 - - 16 13 277K 17.6K c7t5d0 - - 0 14 0 17.6K c6t3d0- - 15 12 242K 18.7K c7t3d0- - 15 12 242K 17.6K c6t4d0- - 16 12 272K 18.1K c1t0d0- - 15 13 275K 16.8K raidz2 3.59T 37.8G 20 0 546K 0 c0t2d0- - 11 0 184K361 c1t3d0- - 10 0 182K361 c4t5d0- - 14 0 237K361 c5t5d0- - 13 0 220K361 c6t6d0- - 12 0 155K361 c7t6d0- - 11 0 149K361 c7t4d0- - 14 0 219K361 c4t0d0- - 14 0 213K361 raidz2 3.58T 44.1G 27 0 1.01M 0 c0t5d0- - 16 0 290K361 c1t6d0- - 15 0 301K361 c4t7d0- - 20 0 375K361 c5t1d0- - 19 0 374K361 c6t7d0- - 17 0 285K361 c7t7d0- - 15 0 253K361 c0t0d0- - 18 0 328K361 c6t0d0- - 18 0 348K361 raidz2 3.05T 587G 7 47 24.9K 1.07M c0t4d0- - 3 21 254K 187K c1t2d0- - 3 22 254K 187K c4t3d0- - 5 22 350K 187K c5t3d0- - 5 21 350K 186K c6t2d0- - 4 22 265K 187K c7t1d0- - 4 21 271K 187K c6t1d0- - 5 22 345K 186K c4t1d0- - 5 24 333K 184K raidz2 2.81T 835G 8 45 30.9K 733K c0t3d0- - 5 16 339K 126K c1t5d0- - 5 16 333K 126K c4t6d0- - 6 16 441K 127K c5t6d0- - 6 17 435K 126K c6t5d0- - 4 18 294K 126K c7t2d0- - 4 18 282K 124K c0t6d0- - 7 19 446K 124K c5t7d0- - 7 21 452K 122K - - - - - - -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help, my zone's dataset has disappeared!
Jesse Reynolds wrote: Does ZFS store a log file of all operations applied to it? It feels like someone has gained access and run 'zfs destroy mailtmp' to me, but then again it could just be my own ineptitude. Yes... zpool history rpool -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
Darren J Moffat wrote: You have done a risk analysis and if you are happy that your NTFS filesystems could be corrupt on those ZFS ZVOLs if you lose data then you could consider turning off the ZIL. Note though that it isn't just those ZVOLs you are serving to Windows that lose access to a ZIL but *ALL* datasets on *ALL* pools and that includes your root pool. For what it's worth I personally run with the ZIL disabled on my home NAS system which is serving over NFS and CIFS to various clients, but I wouldn't recommend it to anyone. The reason I say never to turn off the ZIL is because in most environments outside of home usage it just isn't worth the risk to do so (not even for a small business). People used fastfs for years in specific environments (hopefully understanding the risks), and disabling the ZIL is safer than fastfs. Seems like it would be a useful ZFS dataset parameter. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot receive new filesystem stream: invalid backup stream
Darren J Moffat wrote: On 12/02/2010 09:55, Andrew Gabriel wrote: Can anyone suggest how I can get around the above error when sending/receiving a ZFS filesystem? It seems to fail when about 2/3rds of the data have been passed from send to recv. Is it possible to get more diagnostics out? You could try using /usr/bin/zstreamdump on recent builds. Ah, thanks Darren, didn't know about that. As far as I can see, it runs without a problem right to the end of the stream. Not sure what I should be looking for in the way of errors, but there are no occurrences in the output of any of the error strings revealed by running strings(1) on zstreamdump. Without -v ... zfs send export/h...@20100211 | zstreamdump BEGIN record version = 1 magic = 2f5bacbac creation_time = 4b7499c5 type = 2 flags = 0x0 toguid = 435481b4a8c20fbd fromguid = 0 toname = export/h...@20100211 END checksum = 6797d43d709150c1/e1ec581cbaf0cfa4/7f6d80fa0f23c741/2c2cb821b4a2e639 SUMMARY: Total DRR_BEGIN records = 1 Total DRR_END records = 1 Total DRR_OBJECT records = 90963 Total DRR_FREEOBJECTS records = 23389 Total DRR_WRITE records = 212381 Total DRR_FREE records = 520856 Total records = 847591 Total write size = 17395359232 (0x40cd81e00) Total stream length = 17683854968 (0x41e0a3678) This filesystem has failed in this way for a long time, and I've ignored it thinking something might get fixed in the future, but this hasn't happened yet. It's a home directory which has been in existence and used for about 3 years. One thing is that the pool version (3) and zfs version (1) are old - could that be the problem? The sending system is currently running build 125 and receiving system something approximating to 133, but I've had the same problem with this filesystem for all builds I've used over the last 2 years. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot receive new filesystem stream: invalid backup stream
Can anyone suggest how I can get around the above error when sending/receiving a ZFS filesystem? It seems to fail when about 2/3rds of the data have been passed from send to recv. Is it possible to get more diagnostics out? This filesystem has failed in this way for a long time, and I've ignored it thinking something might get fixed in the future, but this hasn't happened yet. It's a home directory which has been in existence and used for about 3 years. One thing is that the pool version (3) and zfs version (1) are old - could that be the problem? The sending system is currently running build 125 and receiving system something approximating to 133, but I've had the same problem with this filesystem for all builds I've used over the last 2 years. -- Cheers Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of an enterprise class SSD on ZIL performance
Peter Radig wrote: I was interested in the impact the type of an SSD has on the performance of the ZIL. So I did some benchmarking and just want to share the results. My test case is simply untarring the latest ON source (528 MB, 53k files) on an Linux system that has a ZFS file system mounted via NFS over gigabit ethernet. I got the following results: - remotely with no dedicated ZIL device: 36 min 37 sec (factor 73 compared to local) - remotely with an Intel X25-E 32 GB as ZIL device: 3 min 11 sec (factor 6.4 compared to local) That's about the same ratio I get when I demonstrate this on the SSD/Flash/Turbocharge Discovery Days I run the UK from time to time (the name changes over time;-). -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Brandon High wrote: On Wed, Feb 3, 2010 at 3:13 PM, David Dyer-Bennet wrote: Which is to say that 45 drives is really quite a lot for a HOME NAS. Particularly when you then think about backing up that data. The origin of this thread was how to buy a J4500 (48 drive chassis). One thing that I enjoy about this list (and I'm sure the Sun guys get annoyed about) is the discussion of how to build various "small" systems for home use. We don't get annoyed at all. What do you think we build to run at home? ;-) After sitting on the sidelines for a while, I assembled an 8TB server for home. Yeah, 8TB is more than I can backup over my home DSL connection. But it's only got about 2.5TB in use, and most of that is our DVD collection that I've ripped to play on the Popcorn Hour in our living room, or CDs that I've ripped. I'd hate to have to re-rip it all, but I can get it back. The rest is photos and important documents which are copied to a VM instance and backed up offsite via Mozy. I'm considering doing a send/receive of a few volumes to a friend's system (as he will do to mine) to have offsite backups of the pools. It's mostly dependent on him buying more disk. ;-) And for what it's worth, my toying with ZFS and discussing it with coworkers has raised interest in Sun's storage line to replace NetApp at the office. Absolutely. My homebrew system has come up in many conversations about ZFS, which has ended up with a customer buying a Thumpers or Amber Road systems from Sun. (but that's my job, I guess;-) -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
LICON, RAY (ATTPB) wrote: Thanks for the reply. In many situations, the hardware design isn't up to me and budgets tend to dictate everything these days. True, nobody wants to swap, but the question is "if" you had to -- what design serves you best. Independent swap slices or putting it all under control of zfs. It depends why you need to swap, i.e. why are you using more memory than you have, and is your working set size bigger than memory (thrashing), or is swapping likely to be just a once-off event or infrequently repeated? You probably need to forget most of what you learned about swapping 25 years ago, when systems routinely swapped, and technology was very different. Disks have got faster over that period, probably of the order 100 times faster. However, CPUs have got 100,000 times faster, so in reality a disk looks to be 1000 times slower from the CPU's standpoint than it did 25 years ago. This means that CPU cycles lost due to swapping will appear to have a proportionally much more dire effect on performance than they did many years back. There are lots more options available today than there were when systems routinely swapped. A couple of examples that spring to mind... ZFS has been explicitly designed to swap it's own cache data, only we don't call it swapping - we call it an L2ARC or ReadZilla. So if you have a system where the application is going to struggle with main memory, you might configure ZFS to significantly reduce it's memory buffer (ARC), and instead give it an L2ARC on a fast solid state disk. This might result in less performance degradation in some systems where memory is short, depending heavily on the behaviour of the application. If you do have to go with brute force old style swapping, then you might want to invest in solid state disk swap devices, which will go some way towards reducing the factor of 1000 I mentioned above. (Take note of aligning swap to the 4k flash i/o boundaries.) Probably lots of other possibilities too, given more than a couple of minutes thought. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
RayLicon wrote: Has anyone done research into the performance of SWAP on the traditional partitioned based SWAP device as compared to a SWAP area set up on ZFS with a zvol? I can find no best practices for this issue. In the old days it was considered important to separate the swap devices onto individual disks (controllers) and select the outer cylinder groups for the partition (to gain some read speed). How does this compare to creating a single SWAP zvol within a rootpool and then mirroring the rootpool across two separate disks? Best practice nowadays is to design a system so it doesn't need to swap. Then it doesn't matter what the performance of the swap device is. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 2gig file limit on ZFS?
Michelle Knight wrote: Fair enough. So where do you think my problem lies? Do you think it could be a limitation of the driver I loaded to read the ext3 partition? Without knowing exactly what commands you typed and exactly what error messages they produced, and which directories/files are on which types of file systems, we're limited to guessing. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
Robert Milkowski wrote: I think one should actually compare whole solutions - including servers, fc infrastructure, tape drives, robots, software costs, rack space, ... Servers like x4540 are ideal for zfs+rsync backup solution - very compact, good $/GB ratio, enough CPU power for its capacity, allow to easily scale it horizontally, and it is not too small and not too big. Then thanks to its compactness they are very easy to administer. Depending on an anvironment one could deploy them always in paris - one in one datacenter and 2nd one in other datacanter with ZFS send based replication of all backups (snapshots). Or one may replicate (cross-replicate) only selected clients if needed. Something else that often sells the 4500/4540 relates to internal company politics. Often, inside a company storage has to be provisioned from the company's storage group, using very expensive SAN based storage, indeed so expensive by the time the company's storage group have added their overhead onto the already expensive SAN, that whole projects become unviable. Instead, teams find they can order 4500/4540's which slip under the radar as servers (or even PCs), and they now have affordable storage for their projects, which makes them viable once more. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "NearLine SAS"?
Edward Ned Harvey wrote: A poster in another forum mentioned that Seagate (and Hitachi, amongst others) is now selling something labeled as "NearLine SAS" storage (e.g. Seagate's NL35 series). Industry has moved again. Better get used to it. Nearline SAS is a replacement for SATA. It's a lower cost drive than SAS, with higher reliability than SATA. I have begun seeing vendors that sell only SAS and NearLine SAS. (Some dell servers.) SATA is not dead. But this will certainly change things up a bit. Actually, this sounds like really good news for ZFS. ZFS (or rather, Solaris) can make good use of the multi-pathing capability, only previously available on high speed drives. Of course, ZFS can make use of any additional IOPs to be had, again only previously available on high speed drives. However, ZFS generally doesn't need high speed drives and does much better with hybrid storage pools. So these drives sound to me to have been designed specifically for ZFS! It's hard to imagine any other filesystem which can exploit them so completely. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
Mark Grant wrote: Yeah, this is my main concern with moving from my cheap Linux server with no redundancy to ZFS RAID on OpenSolaris; I don't really want to have to pay twice as much to buy the 'enterprise' disks which appear to be exactly the same drives with a flag set in the firmware to limit read retries, but I also don't want to lose all my data because a sector fails and the drive hangs for a minute trying to relocate it, causing the file system to fall over. I haven't found a definitive answer as to whether this will kill a ZFS RAID like it kills traditional hardware RAID or whether ZFS will recover after the drive stops attempting to relocate the sector. At least with a single drive setup the OS will eventually get an error response and the other files on the disk will be readable when I copy them over to a new drive. I don't think ZFS does any timing out. It's up to the drivers underneath to timeout and send an error back to ZFS - only they know what's reasonable for a given disk type and bus type. So I guess this may depend which drivers you are using. I don't know what the timeouts are, but I have observed them to be long in some cases when things do go wrong and timeouts and retries are triggered. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on ssd
Bob Friesenhahn wrote: The interesting thing for the future will be non-volatile main memory, with the primary concern being how to firewall damage due to a bug. You would be able to turn your computer off and back on and be working again almost instantaneously. Some of us are old enough (just) to have used computers back in the days when they all did this anyway... Funny how things go full circle... -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Kjetil Torgrim Homme wrote: Daniel Carosone writes: Would there be a way to avoid taking snapshots if they're going to be zero-sized? I don't think it is easy to do, the txg counter is on a pool level, AFAIK: # zdb -u spool Uberblock magic = 00bab10c version = 13 txg = 1773324 guid_sum = 16611641539891595281 timestamp = 1258992244 UTC = Mon Nov 23 17:04:04 2009 it would help when the entire pool is idle, though. (posted here, rather than in response to the mailing list reference given, because I'm not subscribed [...] ditto. I think I can see from this which filesystems have and have not changed since the last snapshot (@20091122 in this case)... a20$ zfs list -t snapshot | grep 20091122 export/d...@20091122499K - 227G - export/h...@20091122144K - 15.6G - export/mu...@20091122 0 - 66.5G - export/virtual...@20091122 0 - 484K - export/virtualbox/os0...@20091122 0 - 3.52G - export/virtualbox/x...@20091122 0 - 12.1G - export/zo...@20091122 0 - 22.5K - export/zones/s...@20091122 0 - 5.21G - a20$ All the ones with USED = 0 haven't changed. Don't know if this info is available without spinning up disks though. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver/scrub times?
Bill Sommerfeld wrote: Yesterday's integration of 6678033 resilver code should prefetch as part of changeset 74e8c05021f1 (which should be in build 129 when it comes out) may improve scrub times, particularly if you have a large number of small files and a large number of snapshots. I recently tested an early version of the fix, and saw one pool go from an elapsed time of 85 hours to 20 hours; another (with many fewer snapshots) went from 35 to 17. I've been wondering what difference that might make. I'm currently running snv_125. $ zfs list -t snapshot | wc -l 4407 $ Yep, quite a few snapshots. However, more important to me would be reducing the impact a scrub has on the rest of the system, even if it takes longer. Conversely, I can imagine you might want a resilver to run as fast as possible in some cases, even at the expense of other system activities. It would be really nice to have a speed knob on these operations, which you can vary while the activity progresses, depending on other uses of the system. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver/scrub times?
Colin Raven wrote: Hi all! I've decided to take the "big jump" and build a ZFS home filer (although it might also do "other work" like caching DNS, mail, usenet, bittorent and so forth). YAY! I wonder if anyone can shed some light on how long a pool scrub would take on a fairly decent rig. These are the specs as-ordered: Asus P5Q-EM mainboard Core2 Quad 2.83 GHZ 8GB DDR2/80 On this system (Dual core Athlon64 4600+ 2.4GHz, 8GB ram), 7200RPM enterprise duty disks, it takes about 14 hours on the following zpool: # zpool status pool: export state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 13h45m with 0 errors on Fri Nov 20 15:05:20 2009 config: NAMESTATE READ WRITE CKSUM export ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3d0ONLINE 0 0 0 c2d0ONLINE 0 0 0 errors: No known data errors # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT export 930G 503G 427G54% ONLINE - # OS: 2 x SSD's in RAID 0 (brand/size not decided on yet, but they will definitely be some flavor of SSD) Can only be RAID1 (mirrored) for the OS. Data: 4 x 1TB Samsung Spin Point 7200 RPM 32MB cache SATA HD's (RAIDZ) Prefer mirrors myself, but depends on how much data you have verses how much disk you can afford. Data payload initially will be around 550GB or so, (before loading any stuff from another NAS and so on) Does scrub like memory, or CPU, or both? There's enough horsepower available, I would think. Same question applies to resilvering if I need to swap out drives at some point. [cough] I can't wait to get this thing built! :) I also use the system as a desktop now (after doing some system consolidation). Scrub doesn't use much CPU, but it does interfere with the interactive response of the desktop. I suspect this is due to the i/o's it queues up on the disks. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss