Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
2009/9/7 Ritesh Raj Sarraf r...@researchut.com: The Discard/Trim command is also available as part of the SCSI standard now. Now, if you look from a SAN perspective, you will need a little of both. Filesystems will need to be able to deallocate blocks and then the same should be triggered as a SCSI Trim to the Storage Controller. For a virtualized environment, the filesystem should be able to punch holes into virt image files. F_FREESP is only on XFS to my knowledge. I found F_FREESP while looking through the OpenSolaris source, and it is supported on all filesystems which implement VOP_SPACE. (I was initially investigating what it would take to transform writes of zeroed blocks into block frees on ZFS. Although it would not appear to be too difficult, I'm not sure if it would be worth complicating the code paths.) So how does ZFS tackle the above 2 problems ? At least for file backed filesystems, ZFS already does its part. It is the responsibility of the hypervisor to execute the mentioned fcntl(), wether it is triggered by a TRIM or whatever else. ZFS does not use TRIM itself, though it is not recommended to use it on top of files anyway, nor is there a need for virtualization purposes. It does appear that the ATA TRIM command should be used with great care though, or avoided all together. Not only does it need to wait for the entire queue to empty, it can cause a delay of ~100ms if you execute them without enough elapsed time. (See the thread linked from the article I mentioned.) As far as I can tell, Solaris is missing the equivalent of a DKIOCDISCARD ioctl(). Something like that should be implemented to allow recovery of space on zvols and iSCSI backing stores. (Though, the latter would require implementing the SCSI TRIM support as well, if I understand correctly.) Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
2009/9/7 Richard Elling richard.ell...@gmail.com: On Sep 7, 2009, at 10:20 AM, Bob Friesenhahn wrote: The purpose of the TRIM command is to allow the FLASH device to reclaim and erase storage at its leisure so that the writer does not need to wait for erasure once the device becomes full. Otherwise the FLASH device does not know when an area stops being used. Yep, it is there to try and solve the problem of rewrites in a small area, smaller than the bulk erase size. While it would be trivial to traverse the spacemap and TRIM the free blocks, it might not improve performance for COW file systems. My crystal ball says smarter flash controllers or a form of managed flash will win and obviate the need for TRIM entirely. -- richard I agree with this sentiment, although I still look forward to it being obviated by a better memory technology instead, like PRAM. In any case, the ATA TRIM command may not be so useful after all, as it can't be queued: http://lwn.net/Articles/347511/ As an aside, after a bit of digging, I came across fcntl(F_FREESP). This will at least allow you to put the sparse back into sparse files if you so desire. Unfortunately, I don't see any way to do this for a zvol. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
2009/9/2 Eric Sproul espr...@omniti.com: Adam, Is it known approximately when this bug was introduced? I have a system running snv_111 with a large raidz2 pool and I keep running into checksum errors though the drives are brand new. They are 2TB drives, but the pool is only about 14% used (~250G/drive across 13 drives). For a drive to develop hundreds of checksum errors at less than 20% capacity seems far above the expected error rate. This may be 6826470, which was present for some time, and fixed it b114. If you have replaced a device on b111, you will see a lot of checksum errors, even after the resilver completes. In fact, when I scrubbed my pool it encountered so many that it transitioned the vdev to a faulted state. (I had to run zpool clear periodically in a loop to allow it to finish.) See the details at: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6826470 Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] odd slog behavior on B70
On Nov 26, 2007 8:41 PM, Joe Little [EMAIL PROTECTED] wrote: I was playing with a Gigabyte i-RAM card and found out it works great to improve overall performance when there are a lot of writes of small files over NFS to such a ZFS pool. However, I noted a frequent situation in periods of long writes over NFS of small files. Here's a snippet of iostat during that period. sd15/sd16 are two iscsi targets, and sd17 is the iRAM card (2GB) [iostat output] During this time no operations can occur. I've attached the iRAM disk via a 3124 card. I've never seen a svc_t time of 0, and full wait and busy disk. Any clue what this might mean? This sounds like 6566207: si3124 driver loses interrupts. I have observed similar behavior as a result of this bug. Upgrading to build 71 or later should fix things. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
On Nov 19, 2007 10:08 PM, Richard Elling [EMAIL PROTECTED] wrote: James Cone wrote: Hello All, Here's a possibly-silly proposal from a non-expert. Summarising the problem: - there's a conflict between small ZFS record size, for good random update performance, and large ZFS record size for good sequential read performance Poor sequential read performance has not been quantified. I think this is a good point. A lot of solutions are being thrown around, and the problems are only theoretical at the moment. Conventional solutions may not even be appropriate for something like ZFS. The point that makes me skeptical is this: blocks do not need to be logically contiguous to be (nearly) physically contiguous. As long as you reallocate the blocks close to the originals, chances are that a scan of the file will end up being mostly physically contiguous reads anyway. ZFS's intelligent prefetching along with the disk's track cache should allow for good performance even in this case. ZFS may or may not already do this, I haven't checked. Obviously, you won't want to keep a years worth of snapshots, or run the pool near capacity. With a few minor tweaks though, it should work quite well. Talking about fundamental ZFS design flaws at this point seems unnecessary to say the least. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]
On 6/2/07, Richard Elling [EMAIL PROTECTED] wrote: Chris Csanady wrote: On 6/1/07, Frank Cusack [EMAIL PROTECTED] wrote: On June 1, 2007 9:44:23 AM -0700 Richard Elling [EMAIL PROTECTED] wrote: [...] Semiconductor memories are accessed in parallel. Spinning disks are accessed serially. Let's take a look at a few examples and see what this looks like... Disk iops bw atime MTBF UER endurance - - SanDisk 32 GByte 2.5 SATA 7,450 67 0.11 2,000,000 10^-20 ? SiliconSystems 8 GByte CF 500 8 2 4,000,000 10^-14 2,000,000 ... these are probably different technologies though? if cf cards aren't generally fast, then the sata device isn't a cf card just with a different form factor. or is the CF interface the limiting factor? also, isn't CF write very slow (relative to read)? if so, you should really show read vs write iops. Most vendors don't list this, for obvious reasons. SanDisk is honest enough to do so though, and the number is spectacularly bad: 15. For the SanDisk 32 GByte 2.5 SATA, write bandwidth is 47 MBytes/s -- quite respectable. I was quoting the random write IOPS number at 4kB. The theoretical sequential write bandwidth is fine, but I don't think that 15 IOPS can be considered respectable. They also list the number at 512kB, and it is still only 16 IOPS. This is probably an artifact of striping across a large number of flash chips, each of which has a large page size. It is unknown how large a transfer is required to actually reach that respectable sequential write performance, though it probably won't happen often, if at all. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]
On 6/1/07, Frank Cusack [EMAIL PROTECTED] wrote: On June 1, 2007 9:44:23 AM -0700 Richard Elling [EMAIL PROTECTED] wrote: [...] Semiconductor memories are accessed in parallel. Spinning disks are accessed serially. Let's take a look at a few examples and see what this looks like... Disk iops bw atime MTBF UER endurance - - SanDisk 32 GByte 2.5 SATA 7,450 67 0.11 2,000,000 10^-20 ? SiliconSystems 8 GByte CF 500 8 2 4,000,000 10^-14 2,000,000 ... these are probably different technologies though? if cf cards aren't generally fast, then the sata device isn't a cf card just with a different form factor. or is the CF interface the limiting factor? also, isn't CF write very slow (relative to read)? if so, you should really show read vs write iops. Most vendors don't list this, for obvious reasons. SanDisk is honest enough to do so though, and the number is spectacularly bad: 15. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?
On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote: Greetings learned ZFS geeks guru's, Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. # zpool create tp raidz c4t600 ... 8 disks worth of zpool # zfs create tp/pool # zfs set recordsize=8k tp/pool # zfs set mountpoint=/pool tp/pool This is a known problem, and is an interaction between the alignment requirements imposed by RAID-Z and the small recordsize you have chosen. You may effectively avoid it in most situations by choosing a RAID-Z strip width of 2^n+1. For a fixed record size, this will work perfectly well. Even so, there will still be cases where small files will cause problems for RAID-Z. While it does not affect many people right now, I think it will become a more serious issue when disks move to 4k sectors. I think the reason for the alignment constraint was to ensure that the stranded space was accounted for, otherwise it would cause problems as the pool fills up. (Consider a 3 device RAID-Z, where only one data sector and one parity sector are written; the third sector in that stripe is essentially dead space.) Would it be possible (or worthwhile) to make the allocator aware of this dead space, rather than imposing the alignment requirements? Something like a concept of tentatively allocated space in the allocator, which would be managed based on the requirements of the vdev. Using such a mechanism, it could coalesce the space if possible for allocations. Of course, it would also have to convert the misaligned bits back into tentatively allocated space when blocks are freed. While I expect this may require changes which would not easily be backward compatible, the alignment on RAID-Z has always felt a bit wrong. While the more severe effects can be addressed by also writing out the dead space, that will not address uneven placement of data and parity across the stripes. Any thoughts? Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID-Z resilver broken
On 4/11/07, Marco van Lienen [EMAIL PROTECTED] wrote: A colleague at work and I have followed the same steps, included running a digest on the /test/file, on a SXCE:61 build today and can confirm the exact same, and disturbing?, result. My colleague mentioned to me he has witnessed the same 'resilver' behavior on builds 57 and 60. Thank you for taking the time to confirm this. Just as long as people are aware of it, it shouldn't really cause much trouble. Still, it gave me quite a scare after replacing a bad disk. I don't think these checksum errors are a good sign. The sha1 digest on the file *does* show to be the same so the question arises: is the resilver process truly broken (even though in this test-case the test file does appear to unchanged based on the sha1 digest) ? ZFS still has good data, so this is not unexpected. It is interesting though that it managed to read all of the data without finding any bad blocks. I just tried this with a more complex directory structure, and other variations, with the same result. It is bizarre, but ZFS only manages to use the good data in normal operation. To see exactly what is damaged though, try the following instead. After the resilver completes, zpool offline a known good device of the RAID-Z. Then, do a scrub or try to read the data. Afterward, zpool status -v will display a list of the damaged files, which is very nice. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RAID-Z resilver broken
In a recent message, I detailed the excessive checksum errors that occurred after replacing a disk. It seems that after a resilver completes, it leaves a large number of blocks in the pool which fail to checksum properly. Afterward, it is necessary to scrub the pool in order to correct these errors. After some testing, it seems that this only occurs with RAID-Z. The same behavior can be observed on both snv_59 and snv_60, though I do not have any other installs to test at the moment. The following commands should reproduce this result in a small test pool. Chris mkdir /tmp/test mkfile 64m /tmp/test/0 /tmp/test/1 zpool create test raidz /tmp/test/0 /tmp/test/1 mkfile 16m /test/file zpool export test rm /tmp/test/0 zpool import -d /tmp/test test mkfile 64m /tmp/test/0 zpool replace test /tmp/test/0 # wait for the resilver to complete, and observe that it completes successfully zpool status test # scrub the pool zpool scrub test # watch the checksum errors accumulate as the scrub progresses zpool status test ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Excessive checksum errors...
I have some further data now, and I don't think that it is a hardware problem. Half way through the scrub, I rebooted and exchanged the controller and cable used with the bad disk. After restarting the scrub, it proceeded error free until about the point where it left off, and then it resumed the exact same behavior. Basically, almost exactly one fourth of the amount of data that is read from the resilvered disk is written to the same disk. This was constant throughout the scrub. Meanwhile, fmd writes ereport.fs.zfs.io events to errlog, until the disk is full. At this point, it seems as if the resilvering code in snv_60 is broken, and one fourth of the data was not reconstructed properly. I have an iosnoop trace of the disk in question, if anyone is interested. I will try to make some sense of it, but that probably won't happen today. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Excessive checksum errors...
After replacing a bad disk and waiting for the resilver to complete, I started a scrub of the pool. Currently, I have the pool mounted readonly, yet almost a quarter of the I/O is writes to the new disk. In fact, it looks like there are so many checksum errors, that zpool doesn't even list them properly: pool: p state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 18.71% done, 2h17m to go config: NAMESTATE READ WRITE CKSUM p ONLINE 0 0 0 raidz1ONLINE 0 0 0 c2d0ONLINE 0 0 0 c3d0ONLINE 0 0 0 c5d0ONLINE 0 0 0 c4d0ONLINE 0 0 231.5 errors: No known data errors I assume that that should be followed by a K. Is my brand new replacement disk really returning gigabyte after gigabyte of silently corrupted data? I find that quite amazing, and I thought that I would inquire here. This is on snv_60. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Firewire/USB enclosures
It looks like the following bug is still open: 6424510 usb ignores DKIOCFLUSHWRITECACHE Until it is fixed, I wouldn't even consider using ZFS on USB storage. Even so, not all bridge boards (Firewire included) implement this command. Unless you can verify that it functions correctly, it is safer to avoid USB and Firewire all together, as you risk serious corruption in the event of a power loss. This holds true for any filesystem. Another good reason is that scrubs and rebuilds will take a long time. Unfortunately, I don't think that port multipliers are yet supported in the SATA framework, so probably the best bet is a large enclosure with internal SATA disks. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
2007/2/12, Frank Hofmann [EMAIL PROTECTED]: On Mon, 12 Feb 2007, Peter Schuller wrote: Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. This is true for NCQ with SATA, but SCSI also supports ordered tags, so it should not be necessary. At least, that is my understanding. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
2007/2/12, Frank Hofmann [EMAIL PROTECTED]: On Mon, 12 Feb 2007, Chris Csanady wrote: This is true for NCQ with SATA, but SCSI also supports ordered tags, so it should not be necessary. At least, that is my understanding. Except that ZFS doesn't talk SCSI, it talks to a target driver. And that one may or may not treat async I/O requests dispatched via its strategy() entry point as strictly ordered / non-coalescible / non-cancellable. See e.g. disksort(9F). Yes, however, this functionality could be exposed through the target driver. While the implementation does not (yet) take full advantage of ordered tags, linux does provide an interface to do this: http://www.mjmwired.net/kernel/Documentation/block/barrier.txt From a correctness standpoint, the interface seems worthwhile, even if the mechanisms are never implemented. It just feels wrong to execute a synchronize cache command from ZFS, when often that is not the intention. The changes to ZFS itself would be very minor. That said, actually implementing the underlying mechanisms may not be worth the trouble. It is only a matter of time before disks have fast non-volatile memory like PRAM or MRAM, and then the need to do explicit cache management basically disappears. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] Cheap ZFS homeserver.
2007/1/19, [EMAIL PROTECTED] [EMAIL PROTECTED]: ACHI SATA ... probably look at Intel boards instead. Whats ACHI ? I didnt see anything useful on google or wikipedia ... is it a chipset ? The issue I take with intel is there chips are either grossly power hungry/hot (anything pre-pentium M) or ungodly expensive (core, core 2). They dont have anything that competes with a 65W AM2 athlon64. Oops, I seem to have transposed some characters while typing that. It is AHCI: Advanced Host Controller Interface. Many hardware vendors are standardizing on this specification for SATA interfaces. The most common ones are found in the Intel ICH6R, ICH7R, and ICH8R south bridges, but others from VIA, Nvidia, SiS and Jmicron are planned or available. See http://www.opensolaris.org/os/community/device_drivers/projects/AHCI The driver is fairly new and does not support much at present, but it should be a safe bet in the future. As for PCIe cards, I think the only options are the the two port SiI3132 and Jmicron based cards. Sorry about the typo. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] Re: Re: Re[2]: Re: Dead drives and ZFS
On 11/14/06, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Rainer, Tuesday, November 14, 2006, 4:43:32 AM, you wrote: RH Sorry for the delay... RH No, it doesn't. The format command shows the drive, but zpool RH import does not find any pools. I've also used the detached bad RH SATA drive for testing; no go. Once a drive is detached, there RH seems to be no (not enough?) information about the pool that allows import. Aha, you did zpool detach - sorry I missed it. Then zpool import won't show you any pools to import from such disk. I agree with you it would be useful to do so. After examining the source, it clearly wipes the vdev label during a detach. I suppose it does this so that the machine can't get confused at a later date. It would be nice if the detach simply renamed something, rather than destroying the pool though. At the very least, the manual page ought to reflect the destructive nature of the detach command. That said, it looks as if the code only zeros the first uberblock, so the data may yet be recoverable. In order to reconstruct the pool, I think you would need to replace the vdev labels with ones from another of your mirrors, and possibly the EFI label so that the GUID matched. Then, corrupt the first uberblock, and pray that it imports. (It may be necessary to modify the txg in the labels as well, though I have already speculated enough...) Can anyone say for certain? Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv_51 hangs
I have experienced two hangs so far with snv_51. I was running snv_46 until recently, and it was rock solid, as were earlier builds. Is there a way for me to force a panic? It is an x86 machine, with only a serial console. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_51 hangs
Thank you all for the very quick and informative responses. If it happens again, I will try to get a core out of it. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Dead drives and ZFS
On 11/11/06, Rainer Heilke [EMAIL PROTECTED] wrote: Nope. I get no pools available to import. I think that detaching the drive cleared any pool information/headers on the drive, which is why I can't figure out a way to get the data/pool back. Did you also export the original pool before you tried this? I believe it was said that you can't import a pool if one of the same name already exists on the system. (Of course, you should pull the other disks as well, or it may not import the right pool.) In any case, I don't think this is expected behavior. It should be possible to remove part of a mirror or simply pull a disk without affecting the contents. (Assuming that the pool is a single N-way mirror vdev.) The manual page for zpool offline indicates that no further attempts are made to read or write the device, so the data should still be there. While it does not elaborate on the result of a zpool detach, I would expect it to behave the same way, by leaving the data intact. If it does not work that way, that seems like a serious bug. Removing a disk should not destroy a complete replica, wether it is through zpool detach and attach, or zpool offline and replace. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Dead drives and ZFS
On 11/11/06, Robert Milkowski [EMAIL PROTECTED] wrote: CC The manual page for zpool offline indicates that no further attempts CC are made to read or write the device, so the data should still be CC there. While it does not elaborate on the result of a zpool detach, I CC would expect it to behave the same way, by leaving the data intact. He did use detach not offline. Also I'm not sure offline works the way you describe (but I guess it does). If it does 'zpool import' should show a pool to import however I'm not sure if there won't be a problem with pool id (not pool name). Perhaps I have confused the issue of identical pool id and identical pool names. Still, I expect there will be issues trying to import an orphaned part of an existing pool. This seems like an area which could use a bit of work. While a single mirror vdev pool is a corner case, it probably will be fairly common. If a disk is intact when removed from the mirror, though detach, offline, or simply being pulled, it should remain importable somehow. (Perhaps it does, after addressing the identical pool id issue, though I haven't tried.) In a similar way, it may be nice to allow detach to work on multiple devices atomically. For instance, if you have a set of mirror vdevs, you could then split off an entire replica of the pool, and move it to another machine. I think you can do this today by simply exporting the pool though, so it is not a major inconvenience. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] system hangs on POST after giving zfs a drive
On 10/11/06, John Sonnenschein [EMAIL PROTECTED] wrote: As it turns out now, something about the drive is causing the machine to hang on POST. It boots fine if the drive isn't connected, and if I hot plug the drive after the machine boots, it works fine, but the computer simply will not boot with the drive attatched. any thoughts on resolution? Are you using an nforce4 based board? I have a Tyan K8E, and it hangs on boot if there are EFI labeled disks present. (Which is what ZFS uses when you give it whole disks.) If this is the problem, configure the BIOS settings so as to not probe those disks, and then it should boot. Of course, it won't be possible to boot off those disks, but they should work fine in Solaris. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: system hangs on POST after giving zfs a drive
On 10/12/06, John Sonnenschein [EMAIL PROTECTED] wrote: well, it's an SiS 960 board, and it appears my only option to turn off probing of the drives is to enable RAID mode (which makes them inacessable by the OS) I think the option is in the standard CMOS setup section, and allows you to set the disk geometry, translation, etc. There should be options for each disk, something like: auto detect/manual/not present. Hopefully your BIOS has a similar setting. what would be my next (cheapest) option, a proper SATA add-in card? I've heard good things about the silicon image 3132 based cards, but I'm not sure if they'll still leave my BIOS in the same position if i run the drives in ATA mode The best supported card is the Supermicro AOC-SAT2-MV8. Drivers are also present for the SiI 3132/3124 based cards in the SATA framework, but they haven't been updated in a while, and don't support NCQ yet. Either way, unless you are using a recent nevada build, any controller will only run in compatibility mode. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Metaslab alignment on RAID-Z
I believe I have tracked down the problem discussed in the low disk performance thread. It seems that an alignment issue will cause small file/block performance to be abysmal on a RAID-Z. metaslab_ff_alloc() seems to naturally align all allocations, and so all blocks will be aligned to asize on a RAID-Z. At certain block sizes which do not produce full width writes, contiguous writes will leave holes of dead space in the RAID-Z. What I have observed with the iosnoop dtrace script is that the first disks aggregate the single block writes, while the last disk(s) are forced to do numerous writes every other sector. If you would like to reproduce this, simply copy a large file to a recordsize=4k filesystem on a 4 disk RAID-Z. It would probably fix the problem if this dead space was explicitly zeroed to allow the writes to be aggregated, but that would be an egregious hack. If the alignment constraints could be relaxed though, that should improve the parity distribution, as well as get rid of the dead space and associated problem. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Metaslab alignment on RAID-Z
On 9/26/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: Chris Csanady wrote: What I have observed with the iosnoop dtrace script is that the first disks aggregate the single block writes, while the last disk(s) are forced to do numerous writes every other sector. If you would like to reproduce this, simply copy a large file to a recordsize=4k filesystem on a 4 disk RAID-Z. Why would I want to set recordsize=4k if I'm using large files? For that matter, why would I ever want to use a recordsize=4k, is there a database which needs 4k record sizes? Sorry, I wasn't very clear about the reasoning for this. It is not something that you would normally do, but it generates just the right combination of block size and stripe width to make the problem very apparent. It is also possible to encounter this on a filesystem with the default recordsize, and I have observed the effect while extracting a large archive of sources. Still, it was never bad enough for my uses to be anything more than a curiosity. However, while trying to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo seemingly stumbled upon this worst case performance scenerio. (Though, unlike my example, it is also possible to end up with holes in the second column.) Also, while it may be a small error, could these stranded sectors throw off the space accounting enough to cause problems when a pool is nearly full? Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: low disk performance
On 9/22/06, Gino Ruopolo [EMAIL PROTECTED] wrote: Update ... iostat output during zpool scrub extended device statistics device r/sw/s Mr/s Mw/s wait actv svc_t %w %b sd34 2.0 395.20.10.6 0.0 34.8 87.7 0 100 sd3521.0 312.21.22.9 0.0 26.0 78.0 0 79 sd3620.01.01.20.0 0.0 0.7 31.4 0 13 sd3720.01.01.00.0 0.0 0.7 35.1 0 21 sd34 is always at 100% ... What is strange, is that this is almost all writes. Do you have the rsync running at this time? A scrub alone should not look like this. I have also observed some strange behavior on a 4 disk raidz, which may be related. It is possible to saturate a single disk, while all the others in the same vdev are completely idle. It is very easy to reproduce, so try the following: Create a filesystem with a 4k recordsize on a 4 disk raidz. Now, copy a large file to it, while observing 'iostat -xnz 5'. This is the worst case I have been able to produce, but the imbalance is apparent even with an untar at the default recordsize. Interestingly, it is always the last disk in the set which is busy. This behavior does not occur with a 3 disk raidz, nor is it as bad with other record sizes. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Bandwidth disparity between NFS and ZFS
While dd'ing to an nfs filesystem, half of the bandwidth is unaccounted for. What dd reports amounts to almost exactly half of what zpool iostat or iostat show; even after accounting for the overhead of the two mirrored vdevs. Would anyone care to guess where it may be going? (This is measured over 10 second intervals. For 1 second intervals, the bandwidth to the disks jumps around from 40MB/s to 240MB/s) With a local dd, everything adds up. This is with a b41 server, and a MacOS 10.4 nfs client. I have verified that the bandwidth at the network interface is approximately that reported by dd, so the issue would appear to be within the server. Any suggestions would be welcome. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive write cache
On 5/26/06, Bart Smaalders [EMAIL PROTECTED] wrote: There are two failure modes associated with disk write caches: Failure modes aside, is there any benefit to a write cache when command queueing is available? It seems that the primary advantage is in allowing old ATA hardware to issue writes in an asynchronous manner. Beyond that, it doesn't really make much sense, if the queue is deep enough. ZFS enables the write cache and flushes it when committing transaction groups; this insures that all of a transaction group appears or does not appear on disk. How often is the write cache flushed, and is it synchronous? Unless I am misunderstanding something, wouldn't it be better to use ordered tags, and avoid cache flushes all together? Also, does ZFS disable the disk read cache? It seems that this would be counterproductive with ZFS. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss