[zfs-discuss] ZFS Load-balancing over vdevs vs. real disks?
Hi, my ZFS pool for my home server is a bit unusual: pool: pelotillehue state: ONLINE scrub: scrub completed with 0 errors on Mon Aug 21 06:10:13 2006 config: NAMESTATE READ WRITE CKSUM pelotillehue ONLINE 0 0 0 mirrorONLINE 0 0 0 c0d1s5 ONLINE 0 0 0 c1d0s5 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0d0s3 ONLINE 0 0 0 c0d1s3 ONLINE 0 0 0 c1d0s3 ONLINE 0 0 0 c1d1s3 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0d1s4 ONLINE 0 0 0 c1d0s4 ONLINE 0 0 0 c1d1s4 ONLINE 0 0 0 The reason is simple: I have 4 differently-sized disks (80, 80, 200, 250 GB. It's a home server and so I crammed whatever I could find elswhere into that box :) ) and my goal was to create the biggest pool possible but retaining some level of redundancy. The above config therefore groups the biggest slices that can be created on all four disks into the 4-disk RAID-Z vdev, then the biggest slices that can be created on 3 disks into the 3-disk RAID-Z, then two large slices remain which are mirrored. It's like playing Tetris with disk slices... But the pool can tolerate 1 broken disk and it gave me maximum storage capacity, so be it. This means that we have one pool with 3 vdevs that access up to 3 different sliced on the same physical disk. Question: Does ZFS consider the underlying physical disks when load-balancing or does it only load-balance across vdevs thereby potentially overloading physical disks with up to 3 parallel requests per physical disk at once? I'm pretty sure ZFS is very intelligent and will do the right thing, but a confirmation would be nice here. Best regards, Constantin -- Constantin GonzalezSun Microsystems GmbH, Germany Platform Technology Group, Client Solutionshttp://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] destroyed pools signatures
Hello zfs-discuss, I've got many ydisks in a JBOD (100) and while doing tests there are lot of destroyed pools. Then some disks are re-used to be part of new pools. Now if I do zpool import -D I can see lot of destroyed pool in a state that I can't import them anyway (like only two disks left from a previously much larger raid-z group, etc.). It's getting messy. It would be nice to have an command to 'clear' such disks - remove ZFS signatures so nothing will show up for those disks. What do you think? -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] pool ID
Hello zfs-discuss, Looks like I can't get pool ID once pool is imported. IMHO zpool show should display it also. -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] destroyed pools signatures
Hi Robert, Maybe this RFE would contribute to alleviate your problem: 6417135 need generic way to dissociate disk or slice from it's filesystem http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6417135 -r Robert Milkowski writes: Hello zfs-discuss, I've got many ydisks in a JBOD (100) and while doing tests there are lot of destroyed pools. Then some disks are re-used to be part of new pools. Now if I do zpool import -D I can see lot of destroyed pool in a state that I can't import them anyway (like only two disks left from a previously much larger raid-z group, etc.). It's getting messy. It would be nice to have an command to 'clear' such disks - remove ZFS signatures so nothing will show up for those disks. What do you think? -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: user-defined properties
Eric Schrock writes: Following up on a string of related proposals, here is another draft proposal for user-defined properties. As usual, all feedback and comments are welcome. The prototype is finished, and I would expect the code to be integrated sometime within the next month. - Eric INTRODUCTION ZFS currently supports a well-defined set of properties for managing ZFS datasets. These properties represent either read-only statistics exported by the ZFS framework ('available', 'compressratio', etc), or editable properties which affect the behavior of ZFS ('compression', 'readonly', etc). While these properties provide a structured way to interact with ZFS, a common request is to allow unstructured properties to be attached to ZFS datasets. This is covered by the following RFE: 6281585 user defined properties This would allow administrators to add annotations to datasets, as well as allowing ISVs to store application-specific settings that interact with individual datasets. DETAILS This proposal adds a new classification of ZFS properties known as 'user properties'. The existing native properties will remain, as they provide additional semantics (mainly validation) which are closely tied to the underlying implementation. Any property which contains a colon (':') is defined as a 'user property'. The name can contain alphanumeric characters, plus the following special characters: ':', '-', '.', '_'. User properties are always strings, and are always inherited. No additional validation is done on the contents. Properties are set and retrieved through the standard mechanisms: 'zfs set', 'zfs get', and 'zfs inherit'. Inheriting a property which is not set in any parent is equivalent to clearing the property, as there is no default value for user-defined properties. It is expected that the colon will serve two purposes: to distinguish between native properties and provide an (unenforced) namespace for user properties. For example, it is hoped that properties are defined as 'module:property', to group properties together and to provide a larger namespace for logical separation of properties. No enforcement of this namespace is done by ZFS, however, and the empty string is valid on both sides of the colon. EXAMPLES # zfs set local:department=12345 test # zfs get -r local:department test NAME PROPERTY VALUE SOURCE test local:department 12345 local test/foo local:department 12345 inherited from test # zfs list -o name,local:department NAME LOCAL:DEPARTMENT test 12345 test/foo 12345 # zfs set local:department=67890 test/foo # zfs inherit local:department test # zfs get -s local -r all test NAME PROPERTY VALUE SOURCE test/foo local:department 12345 local # zfs list -o name,local:department NAME LOCAL:DEPARTMENT test - test/foo 12345 MANPAGE CHANGES TBD ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Great. We might need something to 'destroy' those properties, locally and recursively ? Is empty string a valid VALUE, does this need to be spelled out ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS write performance problem with compression set to ON
I've a few questions: - Does 'zpool iostat' report numbers from the top of the ZFS stack or at the bottom? I've corelated the zpool iostat numbers with the system iostat numbers and they matchup. This tells me the numbers are from the 'bottom' of the ZFS stack, right? Having said that it'd be nice to have zpool iostat return numbers at the top of the stack. This becomes relevant when we've compression =ON. - Secondly, I did some more tests and I find the same read waves and the consistent write throughput. I've been reading another thread on this forum about Niagara and the compression where Matt Ahrens noted that the compression at this time is single-threaded. Further, he stated that there maybe a bugfix released to use multiple threads. I eagerly await the fix. Thanks again for a great feature. Looking forward to more fun stuff out of Sun and you Mr. Bonwick. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS questions with mirrors
IHAC that is asking the following. any thoughts would be appreciated Take two drives, zpool to make a mirror. Remove a drive - and the server HANGS. Power off and reboot the server, and everything comes up cleanly. Take the same two drives (still Solaris 10). Install Veritas Volume Manager (4.1). Mirror the two drives. Remove a drive - everything is still running. Replace the drive, everything still working. No outage. So the big questions to Tech support: 1. Is this a known property of ZFS ? That when a drive from a hot swap system is removed the server hangs ? (We were attempting to simulate a drive failure) 2. Or is this just because it was an E450 ? Ie, would removing a zfs mirror disk (unexpected hardware removal as opposed to using zfs to remove the disk) on a V240 or V480 cause the same problem ? 3. What could we expect if a drive mysteriously failed during operation of a server with a zfs mirror ? Would the server hang like it did during testing ? How can we test this ? 4. If it is a known property of zfs, is there a date when it is expected to be fixed (if ever) ? Peter PS: I may not be on this alias so please respond to me directly -- = __ /_/\ /_\\ \Peter Wilk - OS/Security Support /_\ \\ / Sun Microsystems /_/ \/ / / 1 Network Drive, P.O Box 4004 /_/ / \//\ Burlington, Massachusetts 01803-0904 \_\//\ / / 1-800-USA-4SUN, opt 1, opt 1,case number# \_/ / /\ / Email: [EMAIL PROTECTED] \_/ \\ \ = \_\ \\ \_\/ = ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
RE: [zfs-discuss] ZFS Filesytem Corrpution
I agree with you, but only 50%. Mirroring will only mask the problem and will delay the fs corruption (Depending on who zfs responds to data corruption. Does it go back and recheck the blocks later or just marks them bad?) The problem lies in somewhere in hardware, but certainly not in disks. I have over 20 machines exhibiting the same behavior. If I put a raid card in between the problem disappears altogether. ...Sanjaya -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 18, 2006 11:59 AM To: Srivastava, Sanjaya Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS Filesytem Corrpution Srivastava, Sanjaya wrote: I have been seeing data corruption on the ZFS filesystem. Here are some details. The machine is running s10 on X86 platform with a single 160Gb SATA disk. (root on s0 and zfs on s7) I'd wager that it is a hardware problem. Personally, I've had less than satisfactory reliability experiences with 160 GByte disks from a variety of vendors. Try mirroring. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions with mirrors
The current behavior depends on the implementation of the driver and support for hotplug events. When a drive is yanked, one of two things can happen: - I/Os will fail, and any attempt to re-open the device will result in failure. - I/Os will fail, but the device can continued to be opened by its existing path. ZFS currently handles case #1 and will mark the device faulted, generating an FMA fault in the process. Future ZFS/FMA integration will address problem #2, and is on the short list of features to address. In the meantime, you can 'zpool offline' the bad device to prevent ZFS from trying to access it. That being said, the server should never hang - only proceed arbitrarily slowly. When you say 'hang', what does that mean? - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Load-balancing over vdevs vs. real disks?
Constantin Gonzalez wrote: Hi, my ZFS pool for my home server is a bit unusual: pool: pelotillehue state: ONLINE scrub: scrub completed with 0 errors on Mon Aug 21 06:10:13 2006 config: NAMESTATE READ WRITE CKSUM pelotillehue ONLINE 0 0 0 mirrorONLINE 0 0 0 c0d1s5 ONLINE 0 0 0 c1d0s5 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0d0s3 ONLINE 0 0 0 c0d1s3 ONLINE 0 0 0 c1d0s3 ONLINE 0 0 0 c1d1s3 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0d1s4 ONLINE 0 0 0 c1d0s4 ONLINE 0 0 0 c1d1s4 ONLINE 0 0 0 The reason is simple: I have 4 differently-sized disks (80, 80, 200, 250 GB. It's a home server and so I crammed whatever I could find elswhere into that box :) ) and my goal was to create the biggest pool possible but retaining some level of redundancy. The above config therefore groups the biggest slices that can be created on all four disks into the 4-disk RAID-Z vdev, then the biggest slices that can be created on 3 disks into the 3-disk RAID-Z, then two large slices remain which are mirrored. It's like playing Tetris with disk slices... But the pool can tolerate 1 broken disk and it gave me maximum storage capacity, so be it. This means that we have one pool with 3 vdevs that access up to 3 different sliced on the same physical disk. Question: Does ZFS consider the underlying physical disks when load-balancing or does it only load-balance across vdevs thereby potentially overloading physical disks with up to 3 parallel requests per physical disk at once? ZFS only does dynamic striping across the (top-level) vdevs. I understand why you setup your pool that way, but ZFS really likes whole disks instead of slices. Trying to interpret that the devices are really slices and part of other vdevds seems overly complicated for the gain achieved. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SCSI synchronize cache cmd
Hi, I work on a support team for the Sun StorEdge 6920 and have a question about the use of the SCSI sync cache command in Solaris and ZFS. We have a bug in our 6920 software that exposes us to a memory leak when we receive the SCSI sync cache command: 6456312 - SCSI Synchronize Cache Command is flawed It will take some time for this bug fix to role out to the field so we need to understand our exposure here. I have been informed that ZFS may use this in S10 thru a new sd/ssd ioctl. Can anyone confirm that as well as whether there is a config option to disable this command? Thanks, Steve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] destroyed pools signatures
Robert Milkowski wrote: Hello zfs-discuss, I've got many ydisks in a JBOD (100) and while doing tests there are lot of destroyed pools. Then some disks are re-used to be part of new pools. Now if I do zpool import -D I can see lot of destroyed pool in a state that I can't import them anyway (like only two disks left from a previously much larger raid-z group, etc.). It's getting messy. It would be nice to have an command to 'clear' such disks - remove ZFS signatures so nothing will show up for those disks. What do you think? That could be nice, though a way of doing that now is overwriting the labels by using dd (assuming you can overwrite all the devices from the now-defunct destroyed pool). eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: SCSI synchronize cache cmd
Yes, ZFS uses this command very frequently. However, it only does this if the whole disk is under the control of ZFS, I believe; so a workaround could be to use slices rather than whole disks when creating a ZFS pool on a buggy device. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Niagara and ZFS compression?
On 8/21/06, Richard Elling - PAE [EMAIL PROTECTED] wrote: I haven't done measurements of this in years, but...I'll wager that compression is memory bound, not CPU bound, for today'sservers.A system with low latency and high bandwidth memory will performwell (UltraSPARC-T1).Threading may not help much on systems with a single memory interface, but should help some on systems with multiple memoryinterfaces (UltraSPARC-*, Opteron, Athlon FX, etc.)-- richardA rather simple test using CSQamp.pkg from the cooltools download site. There's nothing magical about this file - it just happens to be a largish file that I had on hand. $ time gzip -c CSQamp.pkg /dev/nullV40z:real 0m15.339suser 0m14.534ssys 0m0.485sV240:real 0m35.825suser 0m35.335ssys 0m0.284sT2000: time gzip -c CSQamp.pkg /dev/nullreal 1m33.669suser 1m32.768ssys 0m0.881sIf I do 8 gzips in parallel:V40z:time ~/scripts/pgzip real 0m32.632s user 1m53.382ssys 0m1.653sV240:time ~/scripts/pgzip real 2m24.704suser 4m42.430ssys 0m2.305sT2000:time ~/scripts/pgzip real 1m40.165s user 13m10.475ssys 0m6.578sIn each of the tests, the file was in /tmp. As expected, the V40z running 8 gzip processes (using 4 cores) took twice as long as it did running 1 (using 1 core). The V240 took 4 times as long (8 processes, 2 threads) as the original, and the T2000 ran 8 (8 processes, 8 cores) in just about the same amount of time as it ran 1. For giggles, I ran 32 processes on the T2000 and came up with 5m4.585s (real) 158m33.380s (user) and 42.484s (sys). In other words, the T2000 running 32 gzip processes had an elapsed time of 3 times greater than 8 processes. Even though the elapsed jumped by 3x, the %sys jumped by nearly 7x. Here's a summary:Server gzips Seconds MB/secV40z 8 32.632 49,445 T2000 32 304.585 21,189 T2000 8 100.165 16,108 V40z 1 15.339 13,149 V240 8 144.704 11,150 V240 1 35.825 5,630 T2000 1 99.669 2,024 Clearly more threads doing compression with gzip give better performance than a single thread. How that translates into memory vs. CPU speed, I am not sure. However, I can't help but think that if my file server is compressing every data block that it writes that it would be able to write more data if it used a thread (or more) per core I would come out ahead. I am a firm believer that the next generation of compression commands and libraries need to use parallel algorithms. The simplest way to do this would be to divide the data into chunks and farm out each chunk to various worker threads. This will likely come at the cost of efficiency of the compression, but in intial tests I have done this amounts to a very small difference in size relative to the speedup achieved. Initial tests were with a chunk of C code and zlib. Mike-- Mike Gerdtshttp://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: SCSI synchronize cache cmd
On Mon, Aug 21, 2006 at 02:40:40PM -0700, Anton B. Rang wrote: Yes, ZFS uses this command very frequently. However, it only does this if the whole disk is under the control of ZFS, I believe; so a workaround could be to use slices rather than whole disks when creating a ZFS pool on a buggy device. Actually, we issue the command no matter if we are using a whole disk or just a slice. Short of an mdb script, there is not a way to disable it. We are trying to figure out ways to allow users to specify workarounds for broken hardware without getting the ZFS code all messy as a result. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss