Re: [zfs-discuss] Sun Flash Accelerator F20
Andrey Kuzmin wrote: On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. So this is just 256 IOPS in the controller, not 64K. No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed the numbers. --Arne Regards, Andrey -- richard -- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Depth of Scrub
David Dyer-Bennet wrote: But what about the parity? Obviously it has to be checked, but I can't find any indications for it in the literature. The man page only states that the data is being checksummed and only if that fails the redundancy is being used. Please tell me I'm wrong ;) I believe you're wrong. Scrub checks all the blocks used by ZFS, regardless of what's in them. (It doesn't check free blocks.) Thanks for all the reassurances :) I don't know why I got weak, of course it would be pointless if it wouldn't check all redundant data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Depth of Scrub
Hi, I have a small question about the depth of scrub in a raidz/2/3 configuration. I'm quite sure scrub does not check spares or unused areas of the disks (it could check if the disks detects any errors there). But what about the parity? Obviously it has to be checked, but I can't find any indications for it in the literature. The man page only states that the data is being checksummed and only if that fails the redundancy is being used. Please tell me I'm wrong ;) But what I'm really targeting with my question: How much coverage can be reached with a find | xargs wc in contrast to scrub? It misses the snapshots, but anything beyond that? Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot destroy ... dataset already exists
Is the pool mounted? I ran into this problem frequently, until I set mountpoint to legacy. It may be that I had to destroy the filesystem afterwards, but since I stopped mounting the backup target everything runs smoothly. Nevertheless I agree it would be nice to find the root cause for this. -- Arne Edward Ned Harvey wrote: This is the problem: [r...@nasbackup backup-scripts]# zfs destroy storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30 cannot destroy 'storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30': dataset already exists This is apparently a common problem. It's happened to me twice already, and the third time now. Each time it happens, it's on the backup server, so fortunately, I have total freedom to do whatever I want, including destroy the pool. The previous two times, I googled around, basically only found destroy the pool as a solution, and I destroyed the pool. This time, I would like to dedicate a little bit of time and resource to finding the cause of the problem, so hopefully this can be fixed for future users, including myself. This time I also found apply updates and repeat your attempt to destroy the snapshot ... So I applied updates, and repeated. But no improvement. The OS was sol 10u6, but now it’s fully updated. Problem persists. I’ve also tried exporting and importing the pool. Somebody on the Internet suspected the problem is somehow aftermath of killing a zfs send or receive. This is distinctly possible, as I’m sure that’s happened on my systems. But there is currently no send or receive being killed ... Any such occurrence is long since past, and even beyond reboots and such. I do not use clones. There are no clones of this snapshot anywhere, and there never have been. I do have other snapshots, which were incrementally received based on this one. But that shouldn't matter, right? I have not yet called support, although we do have a support contract. Any suggestions? FYI: [r...@nasbackup backup-scripts]# zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 19.3G 126G34K /rpool rpool/ROOT 16.3G 126G21K legacy rpool/ROOT/nasbackup_slash 16.3G 126G 16.3G / rpool/dump1.00G 126G 1.00G - rpool/swap 2.00G 127G 1.08G - storagepool 1.28T 4.06T 34.4K /storage storagepool/nas-lyricpool 1.27T 4.06T 1.13T /storage/nas-lyricpool storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30 94.1G - 1.07T - storagepool/nas-lyricp...@daily-2010-06-01-00-00-00 0 - 1.13T - storagepool/nas-rpool-ROOT-nas_slash 8.65G 4.06T 8.65G /storage/nas-rpool-ROOT-nas_slash storagepool/nas-rpool-root-nas_sl...@daily-2010-06-01-00-00-00 0 - 8.65G - zfs-external1 1.13T 670G24K /zfs-external1 zfs-external1/nas-lyricpool 1.12T 670G 1.12T /zfs-external1/nas-lyricpool zfs-external1/nas-lyricp...@daily-2010-06-01-00-00-00 0 - 1.12T - zfs-external1/nas-rpool-ROOT-nas_slash 8.60G 670G 8.60G /zfs-external1/nas-rpool-ROOT-nas_slash zfs-external1/nas-rpool-root-nas_sl...@daily-2010-06-01-00-00-00 0 - 8.60G - And [r...@nasbackup ~]# zfs get origin NAME PROPERTY VALUE SOURCE rpool origin- - rpool/ROOT origin- - rpool/ROOT/nasbackup_slash origin- - rpool/dump origin- - rpool/swap origin- - storagepool origin- - storagepool/nas-lyricpool origin- - storagepool/nas-lyricp...@nasbackup-2010-05-14-15-56-30 origin- - storagepool/nas-lyricp...@daily-2010-06-01-00-00-00 origin- - storagepool/nas-lyricp...@daily-2010-06-02-00-00-00 origin- - storagepool/nas-rpool-ROOT-nas_slash origin- -
Re: [zfs-discuss] creating a fast ZIL device for $200
Neil Perrin wrote: Yes, I agree this seems very appealing. I have investigated and observed similar results. Just allocating larger intent log blocks but only writing to say the first half of them has seen the same effect. Despite the impressive results, we have not pursued this further mainly because of it's maintainability. There is quite a variance between drives so, as mentioned, feedback profiling of the device is needed in the working system. The layering of the Solaris IO subsystem doesn't provide the feedback necessary and the ZIL code is layered on the SPA/DMU. Still it should be possible. Good luck! Thanks :) Though I hoped to get a different answer. An integration into ZFS code would be much more elegant, but of course in a few years the necessity for this optimization will be gone, when SSDs are cheap, fast and reliable. There seems to be some interest in this idea here. Would it make sense to start a project for it? Currently I'm implementing a driver as a proof of concept, but I'm in need of a lot of discussions about algo- rithms and concepts, and maybe some code reviews. Can I count on some support from here? --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
(resent because of mail problems) Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. Um ... I certainly have a doubt about this. My understanding is that hard disks are already optimized for sustained sequential throughput. I have a really hard time believing Seagate, WD, etc, designed their drives such that you read/write one track, then pause and wait for a full rotation, then read/write one track, and wait again, and so forth. This would limit the drive to approx 50% duty cycle, and the market is very competitive. Yes, I am really quite sure, without any knowledge at all, that the drive mfgrs are intelligent enough to map the logical blocks in such a way that sequential reads/writes which are larger than a single track will not suffer such a huge penalty. Just a small penalty to jump up one track, and wait for a few degrees of rotation, not 360 degrees. I'm afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
(resent because of received bounce) Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] So this brings me back to the question I indirectly asked in the middle of a much longer previous email - Is there some way, in software, to detect the current position of the head? If not, then I only see two possibilities: Either you have some previous knowledge (or assumptions) about the drive geometry, rotation speed, and wall clock time passed since the last write completed, and use this (possibly vague or inaccurate) info to make your best guess what available blocks are accessible with minimum latency next ... That is my approach currently, and it works quite well. I obtain the prior knowledge through a special measuring process run before first using the disk. To keep the driver in sync with the disk during idle times it issues dummy ops in regular intervals, say 20 per second. or else some sort of new hardware behavior would be necessary. Possibly a special type of drive, which always assumes a command to write to a magical block number actually means write to the next available block or something like that ... or reading from a magical block actually tells you the position of the head or something like that... That would be nice. But what would be much nicer is a drive with an extremely small setup time. Current drives need the command 0.4-0.7ms in advance, depending on manufacturer and drive type. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
Edward Ned Harvey wrote: From: sensille [mailto:sensi...@gmx.net] The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. Um ... I certainly have a doubt about this. My understanding is that hard disks are already optimized for sustained sequential throughput. I have a really hard time believing Seagate, WD, etc, designed their drives such that you read/write one track, then pause and wait for a full rotation, then read/write one track, and wait again, and so forth. This would limit the drive to approx 50% duty cycle, and the market is very competitive. Yes, I am really quite sure, without any knowledge at all, that the drive mfgrs are intelligent enough to map the logical blocks in such a way that sequential reads/writes which are larger than a single track will not suffer such a huge penalty. Just a small penalty to jump up one track, and wait for a few degrees of rotation, not 360 degrees. I'm afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] creating a fast ZIL device for $200
Recently, I've been reading through the ZIL/slog discussion and have the impression that a lot of folks here are (like me) interested in getting a viable solution for a cheap, fast and reliable ZIL device. I think I can provide such a solution for about $200, but it involves a lot of development work. The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don't write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. I have done some testing and am quite enthusiastic. If I take a decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise the synchronous write performance from 166 writes/s to about 2000 writes/s (!). 2000 IOPS is more than sufficient for our production environment. Currently I'm implementing a re-mapping driver for this. The reason I'm writing to this list is that I'd like to find support from the zfs team, find sparring partners to discuss implementation details and algorithms and, most important, find testers! If there is interest it would be great to build an official project around it. I'd be willing to contribute most of the code, but any help will be more than welcome. So, anyone interested? :) -- Arne Jansen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of sensille The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don't write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. The reason why hard drives are less effective as ZIL dedicated log devices compared to such things as SSD's, is because of the rotation of the hard drives; the physical time to seek a random block. There may be a possibility to use hard drives as dedicated log devices, cheaper than SSD's with possibly comparable latency, if you can intelligently eliminate the random seek. If you have a way to tell the hard drive Write this data, to whatever block happens to be available at minimum seek time. Thanks for rephrasing my idea :) The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. For rough estimates: Assume the drive is using Zone Density Recording, like this: http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm Suppose you're able to keep your hard drive head on the outer sectors. Suppose 1000 sectors per track (I have no idea if that's accurate, but at least according to the above article in the year 2000 it was ballpark realistic). Suppose 10krpm. Then the physical seek time could theoretically be brought down to as low as 10^-7 seconds. Of course, that's not realistic - some sectors may already be used - the electronics themselves could be a factor - But the point remains, the physical seek time can be effectively eliminated. At least in theory. And that was the year 2000. The mentioned Hitachi disk (at least the one I have in my test machine) has 1764 sectors on head1 and 1680 sectors on head2 in the first zone, which has 50 tracks. I'm quite sure the limiting factor is the electronics. This disk needs the write about 140 sectors in advance. It may be that also the servo information on the platters has to be taken into account. Other disks don't behave that well. I tried with 1TB SATA disks, but they doesn't seem to have any predictable timing. I have done some testing and am quite enthusiastic. If I take a decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise the synchronous write performance from 166 writes/s to about 2000 writes/s (!). 2000 IOPS is more than sufficient for our production environment. Um ... Careful there. There are many apples, oranges, and bananas to be compared inaccurately against each other. When I measure IOPS of physical disks, with all the caches disabled, I get anywhere from 200 to 2400 for a single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a SSD (SATA). Just depending on the benchmark configuration. Because ZFS is doing all sorts of acceleration behind the scenes, which make the results vary *immensely* from some IOPS number that you look up online. The measurement is simple: disable write cache, write on sector, when that write returns, calculate the next optimal sector to write to, write, calculate again... This gives a quite stable result of about 2000 writes/s or 0.5ms average service time, single threaded. No ZFS involved, just pure disk performance. So you believe you can know the drive geometry, the instantaneous head position, and the next available physical block address in software? No need for special hardware? That's cool. I hope there aren't any gotchas as-yet undiscovered. Yes, I already did a mapping of several drives. I measured at least the track length, the interleave needed between two writes and the interleave if a track-to-track seek is involved. Of course you can always learn more about a disk, but that's a good starting point. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
Bob Friesenhahn wrote: On Wed, 26 May 2010, sensille wrote: The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don't write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. I like your idea. It would require a profiling application to learn the physical geometry and timing of a given disk drive in order to save the configuration data for it. The timing could vary under heavy system load so the data needs to be sent early enough that it will always be there when needed. The profiling application might need to drive a disk for several hours (or a day) in order to fully understand how it behaves. A day is a good landmark. Currently the application runs several hours just to map the tracks. But there's lots of room for algorithms that measure and fine-tune on the fly. Every write is also a measurement. Remapped failed sectors would cause this micro-timing to fail, but only for the remapped sectors. Of course you could detect those remapped sectors because of the failed timing and stop using them in the future :) -- Arne Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Don wrote: With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL? They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at half the price of an Intel X25-E (3.3k IOPS, $400). Needless to say I'd love to know if anyone has evaluated these drives to see if they make sense as a ZIL- for example- do they honor cache flush requests? Are those sustained IOPS numbers? In my understanding nearly the only relevant number is the number of cache flushes a drive can handle per second, as this determines my single thread performance. Has anyone an idea what numbers I can expect from an Intel X25-E or an OCZ Vertex 2? -Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss