Re: [zfs-discuss] ZFS API (again!), need quotactl(7I)
Jeff A. Earickson wrote: Hi, I was looking for the zfs system calls to check zfs quotas from within C code, analogous to the quotactl(7I) interface for UFS, and realized that there was nothing similar. Is anything like this planned? Why no public API for ZFS? Do I start making calls to zfs_prop_get_int(), like in the df code, to find out what I want? Will this blow up later? What is it that you are trying to do here ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs share=.foo-internal.bar.edu on multipleinterfaces?
I have a Sun x4200 with 4x gigabit ethernet NICs. I have several of them configured with distinct IP addresses on an internal (10.0.0.0) network. [off topic] Why are you using distinct IP addresses instead of IPMP ? [/off] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Mike Gerdts wrote: On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort. While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 12/09/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! Flexibility is always nice, but this seems to greatly complicate things, both technically and conceptually (sometimes, good design is about what is left out :) ). Seems to me this lets you say 'files in this directory are x times more valuable than files elsewhere'. Others have covered some of my concerns (guarantees, cleanup, etc.). In addition, * if I move a file somewhere else, does it become less important? * zpools let you do that already (admittedly with less granularity, but *much* *much* more simply - and disk is cheap in my world) * I don't need to do that :) The only real use I'd see would be for redundant copies on a single disk, but then why wouldn't I just add a disk? * disks are cheap, and creating a mirror from a single disk is very easy (and conceptually simple) * *removing* a disk from a mirror pair is simple too - I make mistakes sometimes * in my experience, disks fail. When you get bad errors on part of a disk, the disk is about to die. * you can already create a/several zpools using disk partitions as vdevs. That's not all that safe, and I don't see this being any safer. Sorry to be negative, but to me ZFS' simplicity is one of its major features. I think this provides a cool feature, but I question it's usefulness. Quite possibly I just don't have the particular itch this is intended to scratch - is this a much requested feature? -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
Hi Matt, Interesting proposal. Has there been any consideration if free space being reported for a ZFS filesystem would take into account the copies setting? Example: zfs create mypool/nonredundant_data zfs create mypool/redundant_data df -h /mypool/nonredundant_data /mypool/redundant_data (shows same amount of free space) zfs set copies=3 mypool/redundant_data Would a new df of /mypool/redundant_data now show a different amount of free space (presumably 1/3 if different) than /mypool/nonredundant_data? As I understand the proposal, there's nothing new to do here. The filesystem might be 25% full, and it would be 25% full no matter how many copies of the filesystem there are. Similarly with quotas, I'd argue that the extra copies should not count towards a user's quota, since a quota is set on the filesystem. If I'm using 500M on a filesystem, I only have 500M of data no matter how many copies of it the administrator has decided to keep (cf. RAID1). I also don't see why a copy can't just be dropped if the copies value is decreased. Having said this, I don't see any value in the proposal at all, to be honest. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: ZFS + rsync, backup on steroids.
Thank you all for your advices. Finally, I chose the way writing 2 scripts ( client server) using Port forwading via SSH for security reasons. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS + rsync, backup on steroids.
On 12/09/2006, at 1:28 AM, Nicolas Williams wrote: On Mon, Sep 11, 2006 at 06:39:28AM -0700, Bui Minh Truong wrote: Does ssh -v tell you any more ? I don't think problem is ZFS send/recv. I think it's take a lot of time to connect over SSH. I tried to access SSH by typing: ssh remote_machine. It also takes serveral seconds( one or a half second) to connect. Maybe because of Solaris SSH. If you have 100files, it may take : 1000 x 0.5 = 500seconds You're not doing making an SSH connection for every file though -- you're making an SSH connection for every snapshot. Now, if you're taking snapshots every second, and each SSH connection takes on the order of .5 seconds, then you might have a problem. So that I gave up that solution. I wrote 2 pieces of perl script: client and server. Their roles are similar to ssh and sshd, then I can connect faster. But is that secure? Do you have any suggestions? Yes. First, let's see if SSH connection establishment latency is a real problem. Second, you could adapt your Perl scripts to work over a persistent SSH connection, e.g., by using SSH port forwarding: % ssh -N -L 12345:localhost:56789 remote-host Now you have a persistent SSH connection to remote-host that forwards connections to localhost:12345 to port 56789 on remote-host. So now you can use your Perl scripts more securely. It would be *so* nice if we could get some of the OpenSSH behaviour in this area. Recent versions include the ability to open a persistent connection and then automatically re-use it for subsequent connections to the same host/user. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS API (again!), need quotactl(7I)
On Tue, 12 Sep 2006, Darren J Moffat wrote: Date: Tue, 12 Sep 2006 10:30:33 +0100 From: Darren J Moffat [EMAIL PROTECTED] To: Jeff A. Earickson [EMAIL PROTECTED] Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS API (again!), need quotactl(7I) Jeff A. Earickson wrote: Hi, I was looking for the zfs system calls to check zfs quotas from within C code, analogous to the quotactl(7I) interface for UFS, and realized that there was nothing similar. Is anything like this planned? Why no public API for ZFS? Do I start making calls to zfs_prop_get_int(), like in the df code, to find out what I want? Will this blow up later? What is it that you are trying to do here ? Modify the dovecot IMAP server so that it can get zfs quota information to be able to implement the QUOTA feature of the IMAP protocol (RFC 2087). In this case pull the zfs quota numbers for quoted home directory/zfs filesystem. Just like what quotactl() would do with UFS. I am really surprised that there is no zfslib API to query/set zfs filesystem properties. Doing a fork/exec just to execute a zfs get or zfs set is expensive and inelegant. Jeff Earickson Colby College ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] ZFS and free space
Hello Mark, Monday, September 11, 2006, 4:25:40 PM, you wrote: MM Jeremy Teo wrote: Hello, how are writes distributed as the free space within a pool reaches a very small percentage? I understand that when free space is available, ZFS will batch writes and then issue them in sequential order, maximising write bandwidth. When free space reaches a minimum, what happens? Thanks! :) MM Just what you would expect to happen: MM As contiguous write space becomes unavailable, writes will be come MM scattered and performance will degrade. More importantly: at this MM point ZFS will begin to heavily write-throttle applications in order MM to ensure that there is sufficient space on disk for the writes to MM complete. This means that there will be less writes to batch up MM in each transaction group for contiguous IO anyway. MM As with any file system, performance will tend to degrade at the MM limits. ZFS keeps a small overhead reserve (much like other file MM systems) to help mitigate this, but you will definitely see an MM impact. I hope it won't be a problem if space is getting low i a file system with quota set however in a pool the file system is in there's plenty of space, right? -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 12/09/06, Darren J Moffat [EMAIL PROTECTED] wrote: Dick Davies wrote: The only real use I'd see would be for redundant copies on a single disk, but then why wouldn't I just add a disk? Some systems have physical space for only a single drive - think most laptops! True - I'm a laptop user myself. But as I said, I'd assume the whole disk would fail (it does in my experience). If your hardware craps differently to mine, you could do a similar thing with partitions (or even files) as vdevs. Wouldn't be any less reliable. I'm still not Feeling the Magic on this one :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Bizzare problem with ZFS filesystem
I'm experiencing a bizzare write performance problem while using a ZFS filesystem. Here are the relevant facts: [b]# zpool list[/b] NAMESIZEUSED AVAILCAP HEALTH ALTROOT mtdc 3.27T502G 2.78T14% ONLINE - zfspool68.5G 30.8G 37.7G44% ONLINE - [b]# zfs list[/b] NAME USED AVAIL REFER MOUNTPOINT mtdc 503G 2.73T 24.5K /mtdc mtdc/sasmeta 397M 627M 397M /sasmeta mtdc/u001 30.5G 226G 30.5G /u001 mtdc/u002 29.5G 227G 29.5G /u002 mtdc/u003 29.5G 226G 29.5G /u003 mtdc/u004 28.4G 228G 28.4G /u004 mtdc/u005 28.3G 228G 28.3G /u005 mtdc/u006 29.8G 226G 29.8G /u006 mtdc/u007 30.1G 226G 30.1G /u007 mtdc/u008 30.6G 225G 30.6G /u008 mtdc/u099 266G 502G 266G /u099 zfspool 30.8G 36.6G 24.5K /zfspool zfspool/apps 30.8G 33.2G 28.5G /apps zfspool/[EMAIL PROTECTED] 2.28G - 29.8G - zfspool/home 15.4M 2.98G 15.4M /home [b]# zfs list mtdc/u099[/b] NAME PROPERTY VALUE SOURCE mtdc/u099type filesystem - mtdc/u099creation Thu Aug 17 10:21 2006 - mtdc/u099used 267G - mtdc/u099available 501G - mtdc/u099referenced 267G - mtdc/u099compressratio 3.10x - mtdc/u099mountedyes- mtdc/u099quota 768G local mtdc/u099reservationnone default mtdc/u099recordsize 128K default mtdc/u099mountpoint /u099 local mtdc/u099sharenfs offdefault mtdc/u099checksum on default mtdc/u099compressionon local mtdc/u099atime offlocal mtdc/u099deviceson default mtdc/u099exec on default mtdc/u099setuid on default mtdc/u099readonly offdefault mtdc/u099zoned offdefault mtdc/u099snapdirhidden default mtdc/u099aclmodegroupmask default mtdc/u099aclinherit secure default [b]No error messages listed by zpool or /var/opt/messages.[/b] When I try to save a file the operation takes an inordinate amount of time, in the 30+ second range!!! I truss'd the vi session to see the hangup and it waits at the write system call. # truss -p pid read(0, 0xFFBFD0AF, 1) (sleeping...) read(0, w, 1)= 1 write(1, w, 1) = 1 read(0, q, 1)= 1 write(1, q, 1) = 1 read(0, 0xFFBFD00F, 1) (sleeping...) read(0, \r, 1)= 1 ioctl(0, I_STR, 0x000579F8) Err#22 EINVAL write(1, \r, 1) = 1 write(1, d e l e t e m e , 10)= 10 stat64(deleteme, 0xFFBFCFA0) = 0 creat(deleteme, 0666) = 4 ioctl(2, TCSETSW, 0x00060C10) = 0 [b]write(4, l f f j d\n, 6) = 6[/b] still waiting while I type this message!! This problem manifests itself only on this filesystem and not on the other ZFS filesystems on the same server built from the same ZFS pool. While I was awaiting completion of the above write I was able to start a new vi session in another window and saved a file to the /u001 filesystem without any problem. System loads are very low. Can anybody comment on this bizzare behavior? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
This proposal would benefit greatly by a problem statement. As it stands, it feels like a solution looking for a problem. The Introduction mentions a different problem and solution, but then pretends that there is value to this solution. The Description section mentions some benefits of 'copies' relative to the existing situation, but requires that the reader piece together the whole picture. And IMO there aren't enough pieces :-) , i.e. so far I haven't seen sufficient justification for the added administrative complexity and potential for confusion, both administrative and user. Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! --matt A. INTRODUCTION ZFS stores multiple copies of all metadata. This is accomplished by storing up to three DVAs (Disk Virtual Addresses) in each block pointer. This feature is known as Ditto Blocks. When possible, the copies are stored on different disks. See bug 6410698 ZFS metadata needs to be more highly replicated (ditto blocks) for details on ditto blocks. This case will extend this feature to allow system administrators to store multiple copies of user data as well, on a per-filesystem basis. These copies are in addition to any redundancy provided at the pool level (mirroring, raid-z, etc). B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). The pool must be at least on-disk version 2 to use this feature (see 'zfs upgrade'). By default (copies=1), only two copies of most filesystem metadata are stored. However, if we are storing multiple copies of user data, then 3 copies (the maximum) of filesystem metadata will be stored. This feature is similar to using mirroring, but differs in several important ways: * Different filesystems in the same pool can have different numbers of copies. * The storage configuration is not constrained as it is with mirroring (eg. you can have multiple copies even on a single disk). * Mirroring offers slightly better performance, because only one DVA needs to be allocated. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). * In a pool with single-parity raid-z a filesystem with copies=2 will be stored with 2 copies, each copy protected by its own parity block. The filesystem can tolerate any 3 disks failing without data loss (assuming that there are at least ncopies=2 raid-z groups). C. MANPAGE CHANGES *** zfs.man4Tue Jun 13 10:15:38 2006 --- zfs.man5Mon Sep 11 16:34:37 2006 *** *** 708,714 --- 708,725 they are inherited. + copies=1 | 2 | 3 +Controls the number of copies of data stored for this dataset. +These copies are in addition to any redundancy provided by the +pool (eg. mirroring or raid-z). The copies will be stored on +different disks if possible. + +Changing this property only affects newly-written data. +Therefore, it is recommended that this property be set at +filesystem creation time, using the '-o copies=' option. + + Temporary Mountpoint Properties When a file system is mounted, either through mount(1M) for legacy mounts or the zfs mount command for normal file D. REFERENCES ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS + rsync, backup on steroids.
On Tue, Sep 12, 2006 at 05:57:33PM +1000, Boyd Adamson wrote: On 12/09/2006, at 1:28 AM, Nicolas Williams wrote: Now you have a persistent SSH connection to remote-host that forwards connections to localhost:12345 to port 56789 on remote-host. So now you can use your Perl scripts more securely. It would be *so* nice if we could get some of the OpenSSH behaviour in this area. Recent versions include the ability to open a persistent connection and then automatically re-use it for subsequent connections to the same host/user. There's an RFE for this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe still encrypted. From a user interface perspective, I'd expect something like Encryption: Being enabled, 75% complete or Encryption: Being disabled, 25% complete, about 2h23m remaining I'm not sure how you'd map this into a property (or several), but it seems like on/off ought to be paired with transitioning to on/transitioning to off for any changes which aren't instantaneous. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: multiple copies of user data
Anton B. Rang wrote: The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe still encrypted. From a user interface perspective, I'd expect something like Encryption: Being enabled, 75% complete or Encryption: Being disabled, 25% complete, about 2h23m remaining and if we are still writing to the file systems at that time ? Maybe this really does need to be done with the file system locked. I'm not sure how you'd map this into a property (or several), but it seems like on/off ought to be paired with transitioning to on/transitioning to off for any changes which aren't instantaneous. Agreed, and checksum and compression would have the same issue if there was a mechanism to rewrite with the new checksums or compression settings. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
True - I'm a laptop user myself. But as I said, I'd assume the whole disk would fail (it does in my experience). That's usually the case, but single-block failures can occur as well. They're rare (check the uncorrectable bit error rate specifications) but if they happen to hit a critical file, they're painful. On the other hand, multiple copies seems (to me) like a really expensive way to deal with this. ZFS is already using relatively large blocks, so it could add an erasure code on top of them and have far less storage overhead. If the assumed problem is multi-block failures in one area of the disk, I'd wonder how common this failure mode is; in my experience, multi-block failures are generally due to the head having touched the platter, in which case the whole drive will shortly fail. (In any case, multi-block failures could be addressed by spreading the data from a large block and using an erasure code.) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Proposal: multiple copies of user data
And if we are still writing to the file systems at that time ? New writes should be done according to the new state (if encryption is being enabled, all new writes are encrypted), since the goal is that eventually the whole disk will be in the new state. The completion percentage should probably reflect the existing data at the time that the state change is initiated, since new writes won't affect how much data has to be replaced. Maybe this really does need to be done with the file system locked. I don't see any technical reason to require that, and users expect better from us these days. :-) As you point out, checksum compression will have the same issue once we have on-line changes for those as well. The framework ought to take care of this. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! I've read the proposal, and followed the discussion so far. I have to say that I don't see any particular need for this feature. Possibly there is a need for a different feature, in which the entire control of redundancy is moved away from the pool level and to the file or filesystem level. I definitely see the attraction of being able to specify by file and directory different degrees of reliability needed. However, the details of the feature actually proposed don't seem to satisfy the need for extra reliability at the level that drives people to employ redundancy; it doesn't provide a guaranty. I see no need for additional non-guaranteed reliability on top of the levels of guaranty provided by use of redundancy at the pool level. Furthermore, as others have pointed out, this feature would add a high degree of user-visible complexity. From what I've seen here so far, I think this is a bad idea and should not be added. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] System hang caused by a bad snapshot
I had a strange ZFS problem this morning. The entire system would hang when mounting the ZFS filesystems. After trial and error I determined that the problem was with one of the 2500 ZFS filesystems. When mounting that users' home the system would hang and need to be rebooted. After I removed the snapshots (9 of them) for that filesystem everything was fine. I don't know how to reproduce this and didn't get a crash dump. I don't remember seeing anything about this before so I wanted to report it and see if anyone has any ideas. The system is a Sun Fire 280R with 3GB of RAM running SXCR b40. The pool looks like this (I'm running a scrub currently): # zpool status pool1 pool: pool1 state: ONLINE scrub: scrub in progress, 78.61% done, 0h18m to go config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 errors: No known data errors Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Darren J Moffat wrote: While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I would also think that there's a significant problem around what to do about the previously unencrypted data. I assume that when performing a scrub to encrypt the data, the encrypted data will not be written on the same blocks previously used to hold the unencrypted data. As such, there's a very good chance that the unencrypted data would still be there for quite some time. You may not be able to access it through the filesystem, but someone with access to the raw disks may be able to recover at least parts of it. In this case, the scrub would not only have to write the encrypted data but also overwrite the unencrypted data (multiple times?). Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Neil A. Wilson wrote: Darren J Moffat wrote: While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I would also think that there's a significant problem around what to do about the previously unencrypted data. I assume that when performing a scrub to encrypt the data, the encrypted data will not be written on the same blocks previously used to hold the unencrypted data. As such, there's a very good chance that the unencrypted data would still be there for quite some time. You may not be able to access it through the filesystem, but someone with access to the raw disks may be able to recover at least parts of it. In this case, the scrub would not only have to write the encrypted data but also overwrite the unencrypted data (multiple times?). Right, that is a very important issue. Would a ZFS scrub framework do copy on write ? As you point out if it doesn't then we still need to do something about the old clear text blocks because strings(1) over the raw disk will show them. I see the desire to have a knob that says make this encrypted now but I personally believe that it is actually better if you can make this choice at the time you create the ZFS data set. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS API (again!), need quotactl(7I)
On Tue, Sep 12, 2006 at 07:23:00AM -0400, Jeff A. Earickson wrote: Modify the dovecot IMAP server so that it can get zfs quota information to be able to implement the QUOTA feature of the IMAP protocol (RFC 2087). In this case pull the zfs quota numbers for quoted home directory/zfs filesystem. Just like what quotactl() would do with UFS. I am really surprised that there is no zfslib API to query/set zfs filesystem properties. Doing a fork/exec just to execute a zfs get or zfs set is expensive and inelegant. The libzfs API will be made public at some point. However, we need to finish implementing the bulk of our planned features before we can feel comfortable with the interfaces. It will take a non-trivial amount of work to clean up all the interfaces as well as document them. It will be done eventually, but I wouldn't expect it any time soon - there are simply too many important things to get done first. If you don't care about unstable interfaces, you're welcome to use them as-is. If you want a stable interface, you are correct that the only way is through invoking 'zfs get' and 'zfs set'. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 10:36:30AM +0100, Darren J Moffat wrote: Mike Gerdts wrote: Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort. While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS framework becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is safe, but if it says it is off that isn't the truth either because some of it maybe in encrypted. I agree -- there needs to be a filesystem re-write option, something like a scrub but at the filesystem level. Things that might be accomplished through it: - record size changes - compression toggling / compression algorithm changes - encryption/re-keying/alg. changes - checksum alg. changes - ditto blocking What else? To me it's important that such scrubs not happen simply as a result of setting/changing a filesystem property, but it's also important that the user/admin be told that changing the property requires scrubbing in order to take effect for data/meta-data written before the change. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] sys_mount problem
Hello, I'm trying to set ZFS to work with RBAC so that I could manage all ZFS stuff w/out root. However, in my setup there is sys_mount privilege needed: - without sys_mount: vk199839:tessier:~$ zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT local 264G 71.4G193G27% ONLINE - vk199839:tessier:~$ profiles ZFS Storage Management ZFS File system Management Basic Solaris User All vk199839:tessier:~$ ppriv $$ 317:bash flags = none E: basic,dtrace_kernel,dtrace_proc,dtrace_user I: basic,dtrace_kernel,dtrace_proc,dtrace_user P: basic,dtrace_kernel,dtrace_proc,dtrace_user L: all vk199839:tessier:~$ pfexec zfs create local/testfs cannot create 'local/testfs': permission denied vk199839:tessier:~$ pfexec truss zfs create local/testfs snip zone_lookup(0x) = 0 ioctl(4, ZFS_IOC_OBJSET_STATS, 0x0804679C) Err#2 ENOENT ioctl(4, ZFS_IOC_CREATE, 0x0804679C)Err#1 EPERM [sys_mount] brk(0x080CA000) = 0 fstat64(2, 0x080457C0) = 0 cannot create 'write(2, c a n n o t c r e a t.., 15) = 15 local/testfswrite(2, l o c a l / t e s t f s, 12)= 12 ': permission deniedwrite(2, ' : p e r m i s s i o.., 20) = 20 - however with sys_mount: vk199839:tessier:~$ ppriv $$ 434:/usr/bin/bash flags = none E: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount I: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount P: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount L: all vk199839:tessier:~$ profiles ZFS Storage Management ZFS File system Management Basic Solaris User All vk199839:tessier:~$ pfexec zfs create local/testfs vk199839:tessier:~$ echo $? 0 vk199839:tessier:~$ zfs list |grep testfs local/testfs 9K 191G 9K /local/testfs vk199839:sier:~$ ls -ald /local/testfs/ drwxr-xr-x 2 root sys2 Sep 12 19:15 /local/testfs/ vk199839:tessier:~$ ls -ald /local/ drwxrwxr-x 14 vk199839 sys 16 Sep 12 19:15 /local/ Any idea what is wrong ? Also, I would like the fs to be created with vk199839:sys and not with root:sys ownership. v. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and free space
Robert Milkowski wrote: Hello Mark, Monday, September 11, 2006, 4:25:40 PM, you wrote: MM Jeremy Teo wrote: Hello, how are writes distributed as the free space within a pool reaches a very small percentage? I understand that when free space is available, ZFS will batch writes and then issue them in sequential order, maximising write bandwidth. When free space reaches a minimum, what happens? Thanks! :) MM Just what you would expect to happen: MM As contiguous write space becomes unavailable, writes will be come MM scattered and performance will degrade. More importantly: at this MM point ZFS will begin to heavily write-throttle applications in order MM to ensure that there is sufficient space on disk for the writes to MM complete. This means that there will be less writes to batch up MM in each transaction group for contiguous IO anyway. MM As with any file system, performance will tend to degrade at the MM limits. ZFS keeps a small overhead reserve (much like other file MM systems) to help mitigate this, but you will definitely see an MM impact. I hope it won't be a problem if space is getting low i a file system with quota set however in a pool the file system is in there's plenty of space, right? If you are running close to your quota, there will be a little bit of performance degradation, but not to the same degree as when running low on free space in the pool. The reason performance degrades when you're near your quota is that we aren't exactly sure how much space will be used until we actually get around to writing it out (due to compression, snapshots, etc). So we have to write things out in smaller batches (ie. flush out transaction groups more frequently than is optimal). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
This is simply not true. ZFS would protect against the same type of errors seen on an individual drive as it would on a pool made of HW raid LUN(s). It might be overkill to layer ZFS on top of a LUN that is already protected in some way by the devices internal RAID code but it does not make your data susceptible to HW errors caused by the storage subsystem's RAID algorithm, and slow down the I/O. I disagree, and vehemently at that. I maintain that if the HW RAID is used, the chance of data corruption is much higher, and ZFS would have a lot more repairing to do than it would if it were used directly on disks. Problems with HW RAID algorithms have been plaguing us for at least 15 years or more. The venerable Sun StorEdge T3 comes to mind! Further, while it is perfectly logical to me that doing RAID calculations twice is slower than doing it once, you maintain that is not the case, perhaps because one calculation is implemented in FW/HW? Well, why don't you simply try it out? Once with both RAID HW and ZFS, and once with just ZFS directly on the disks? RAID HW is very likely to have a slower CPU or CPUs than any modern system that ZFS will be running on. Even if we assume that the HW RAID's CPU is the same speed or faster than the CPU in the server, you still have TWICE the amount of work that has to be performed for every write. Once by the hardware and once by the software (ZFS). Caches might help some, but I fail to see how double the amount of work (and hidden, abstracted complexity) would be as fast or faster than just using ZFS directly on the disks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
There are also the speed enhancement provided by a HW raid array, and usually RAS too, compared to a native disk drive but the numbers on that are still coming in and being analyzed. (See previous threads.) Speed enhancements? What is the baseline of comparison? Hardware RAIDs can be banalized to two features: cache which does data reordering for optimal disk writes and parity calculation which is being offloaded off of the server's CPU. But HW calculations still take time, and the in-between, battery backed cache serves to replace the individual disk caches, because of the traditional file system approach which had to have some assurance that the data made it to disk in one way or another. With ZFS however the in-between cache is obsolete, as individual disk caches can be used directly. I also openly question whether even the dedicated RAID HW is faster than the newest CPUs in modern servers. Unless there is something that I'm missing, I fail to see the benefit of a HW RAID in tandem with ZFS. In my view, this holds especially true when one gets into SAN storage like SE6920, EMC and Hitachi products. Furthermore, need I remind of the buggy SE6920 firmware? I don't trust it as far as I can throw it. Or, lets put it this way: I trust Mr. Bonwick a whole lot more than some firmware writers. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
On September 12, 2006 11:35:54 AM -0700 UNIX admin [EMAIL PROTECTED] wrote: There are also the speed enhancement provided by a HW raid array, and usually RAS too, compared to a native disk drive but the numbers on that are still coming in and being analyzed. (See previous threads.) It would be nice if you would attribute your quotes. Maybe this is a limitation of the web interface? Speed enhancements? What is the baseline of comparison? Hardware RAIDs can be banalized to two features: cache which does data reordering for optimal disk writes and parity calculation which is being offloaded off of the server's CPU. But HW calculations still take time, and the in-between, battery backed cache serves to replace the individual disk caches, because of the traditional file system approach which had to have some assurance that the data made it to disk in one way or another. With ZFS however the in-between cache is obsolete, as individual disk caches can be used directly. I also openly question whether even the dedicated RAID HW is faster than the newest CPUs in modern servers. Unless there is something that I'm missing, I fail to see the benefit of a HW RAID in tandem with ZFS. In my view, this holds especially true when one gets into SAN storage like SE6920, EMC and Hitachi products. I agree with your basic point, that the HW RAID cache is obsoleted by zfs (which seems to be substantiated here by benchmark results), but I think you slightly mischaracterize its use. The speed of the HW RAID CPU is irrelevant; the parity is XOR which is extremely fast with any CPU when compared to disk write speed. What is relevant is, as Anton points out, the CPU cache on the host system. Parity calculations kill the cache and will hurt memory-intensive apps. So in this case, offloading it may help in the ufs case. (Not for zfs, as I understand from reading here, since checksums still have to be done. I would argue that this is *absolutely essential* [and zfs obsoletes all other filesystems] and therefore the gain in the ufs on HW RAID-5 case is worthless due to the correctness tradeoff.) It would be interesting to have a zfs enabled HBA to offload the checksum and parity calculations. How much of zfs would such an HBA have to understand? -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sys_mount problem
Vladimir Kotal wrote: Hello, I'm trying to set ZFS to work with RBAC so that I could manage all ZFS stuff w/out root. However, in my setup there is sys_mount privilege needed: - without sys_mount: Currently, anything in zfs that changes dataset configurations, such as file systems and properties requires sys_mount privilege. This actually comes from the secpolicy_zfs() function if your curious. ioctl(4, ZFS_IOC_CREATE, 0x0804679C)Err#1 EPERM [sys_mount] brk(0x080CA000) = 0 fstat64(2, 0x080457C0) = 0 cannot create 'write(2, c a n n o t c r e a t.., 15) = 15 local/testfswrite(2, l o c a l / t e s t f s, 12)= 12 ': permission deniedwrite(2, ' : p e r m i s s i o.., 20) = 20 - however with sys_mount: vk199839:tessier:~$ ppriv $$ 434:/usr/bin/bash flags = none E: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount I: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount P: basic,dtrace_kernel,dtrace_proc,dtrace_user,sys_mount L: all vk199839:tessier:~$ profiles ZFS Storage Management ZFS File system Management Basic Solaris User All vk199839:tessier:~$ pfexec zfs create local/testfs vk199839:tessier:~$ echo $? 0 vk199839:tessier:~$ zfs list |grep testfs local/testfs 9K 191G 9K /local/testfs vk199839:sier:~$ ls -ald /local/testfs/ drwxr-xr-x 2 root sys2 Sep 12 19:15 /local/testfs/ vk199839:tessier:~$ ls -ald /local/ drwxrwxr-x 14 vk199839 sys 16 Sep 12 19:15 /local/ Any idea what is wrong ? Also, I would like the fs to be created with vk199839:sys and not with root:sys ownership. That will be changed once the delegated administration model is integrated. Once it is integrated a file systems root node will be created with the uid/gid of the user that creates the file system. For more information on this check out the following thread http://www.opensolaris.org/jive/thread.jspa?threadID=11130tstart=15 -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
Take this for what it is: the opinion on someone who knows less about zfs than probably anyone else on this thread ,but... I would like to add my support for this proposal. As I understand it, the reason for using ditto blocks on metadata, is that maintaining their integrity is vital for the health of the filesystem, even if the zpool isn't mirrored or redundant in any way ie laptops, or people who just don't or can't add another drive. One of the great things about zfs, is that it protects not just against mechanical failure, but against silent data corruption. Having this available to laptop owners seems to me to be important to making zfs even more attractive. Granted, if you are running a enterprise based fileserver, this probably isn't going to be your first choice for data protection. You will probably be using the other features of zfs like mirroring, raidz raidz2 etc. Am I correct in assuming that having say 2 copies of your documents filesystem means should silent data corruption occur, your data can be reconstructed. So that you can leave your os and base applications with 1 copy, but your important data can be protected. In a way, this reminds me of intel's matrix raid but much cooler (it doesn't rely on a specific motherboard for one thing). I would also agree that utilities like 'ls' and quotas should report both and count against peoples quotas. It just doesn't seem to hard to me to understand that because you have 2 copies, you halve the amount of available space. Just to reiterate, I think this would be an awesome feature! Celso. PS. Please feel free to correct me on any technical inaccuracies. I am trying to learn about zfs and Solaris 10 in general. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: One of the great things about zfs, is that it protects not just against mechanical failure, but against silent data corruption. Having this available to laptop owners seems to me to be important to making zfs even more attractive. I'm not arguing against that. I was just saying that *if* this was useful to you (and you were happy with the dubious resilience/performance benefits) you can already create mirrors/raidz on a single disk by using partitions as building blocks. There's no need to implement the proposal to gain that. Am I correct in assuming that having say 2 copies of your documents filesystem means should silent data corruption occur, your data can be reconstructed. So that you can leave your os and base applications with 1 copy, but your important data can be protected. Yes. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
UNIX admin wrote: This is simply not true. ZFS would protect against the same type of errors seen on an individual drive as it would on a pool made of HW raid LUN(s). It might be overkill to layer ZFS on top of a LUN that is already protected in some way by the devices internal RAID code but it does not make your data susceptible to HW errors caused by the storage subsystem's RAID algorithm, and slow down the I/O. I disagree, and vehemently at that. I maintain that if the HW RAID is used, the chance of data corruption is much higher, and ZFS would have a lot more repairing to do than it would if it were used directly on disks. Problems with HW RAID algorithms have been plaguing us for at least 15 years or more. The venerable Sun StorEdge T3 comes to mind! Please expand on your logic. Remember that ZFS works on top of LUNs. A disk drive by itself is a LUN when added to a ZFS pool. A LUN can also be comprised of multiple disk drives striped together and presented to a host as one logical unit. Or a LUN can be offered by a virtualization gateway that in turn imports raid array LUNs that are really made up of individual disk drives. Or ... insert a million different ways to get a host something called a LUN that allows the host to read and write blocks. They could be really slow LUNs because they're two hamsters shuffling zeros and ones back and forth on little wheels. (OK, that might be too slow.) Outside of the cache enabling when entire disk drives are presented to the pool ZFS doesn't care what the LUN is made of. ZFS reliability features are available and work on top of the LUNs you give it and the configuration you use. The type of LUN is inconsequential at the ZFS level. If I had 12 LUNS that were single disk drives and created a RAIDZ pool it would have the same reliability at the ZFS level as if I presented it 12 LUNs that were really quad-mirrors from 12 independent hw raid array. You can make argument that the 12 disk drive config is easier to use or that the overall reliability of the 12 quad-mirror LUNs system has a higher reliability but at ZFSs point of view it's the same. Its happily writing blocks, checking checksums, reading things from the LUNs, etc. etc. etc. On top of that disk drives are not some simple beast that just coughs up i/o when you want it to. A modern disk drive does all sorts of stuff under the covers to speed up i/o and - surprise - increase the reliability of the drive as much as possible. If you think you're really writing straight to disk you're not. Cache, ZBR, bad block re-allocation, all come into play. As for problems with specific raid arrays, including the T3, you are preaching to the choir but I'm definitely not going to get into a pissing contest over specific components having more or less bugs then an other. Further, while it is perfectly logical to me that doing RAID calculations twice is slower than doing it once, you maintain that is not the case, perhaps because one calculation is implemented in FW/HW? As the man says, It depends. A really fast raid array might be responding to i/o requests faster then a single disk drive. It might not given the nature of the i/o coming in. Don't think of it in terms of RAID calculations taking a certain amount of time. Think of it in terms of having to meet a specific amount of requirements to manage your data. I'll be the first to say that if you're going to be putting ZFS on a desktop then a simple JBOD is a box to look at. If you're going to look at an enterprise data center the answer is going to be different. That is something a lot of people on this alias seem to be missing out on. Stating ZFS on JBODs is the answer to everything is the punchline of the When all you have is a hammer... routine. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: One of the great things about zfs, is that it protects not just against mechanical failure, but against silent data corruption. Having this available to laptop owners seems to me to be important to making zfs even more attractive. I'm not arguing against that. I was just saying that *if* this was useful to you (and you were happy with the dubious resilience/performance benefits) you can already create mirrors/raidz on a single disk by using partitions as building blocks. There's no need to implement the proposal to gain that. It's not as granular though is it? In the situation you describe: ...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct? With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored. That's my opinion anyway. I always enjoy choice, and I really believe this is a useful and flexible one. Celso This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] marvel cards.. as recommended
So, people here recommended the Marvell cards, and one even provided a link to acquire them for SATA jbod support. Well, this is what the latest bits (B47) say: Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Any takers on how to get around this one? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
On Tue, 12 Sep 2006, Mark Maybee wrote: Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. Another nasty and risky workaround is to start making copies of a large file in /tmp while watching your available swap space carefully. When you hit the low memory water mark, ZFS will free up a snitload (technical term (TM)) of memory. Then immediately rm all the files you created in /tmp. You don't want to completely exhaust memory or you'll probably loose the system. Remember my first line: nasty and risky. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] marvel cards.. as recommended
Joe Little wrote: So, people here recommended the Marvell cards, and one even provided a link to acquire them for SATA jbod support. Well, this is what the latest bits (B47) say: Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Any takers on how to get around this one? You could start by providing the output from prtpicl -v and prtconf -v as well as /usr/X11/bin/scanpci -v -V 1 so we know which device you're actually having a problem with. Is the pci vendor+deviceid for that card listed in your /etc/driver_aliases file against the marvell88sx driver? James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
On Sep 12, 2006, at 2:04 PM, Mark Maybee wrote: Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. We are running the jun 06 version of solaris (10/6?). I don't have a core dump now -- but can probably get one in the next week or so. Where should I send it? Also, where do I set arc.c_max? In etc/system? Out of curiosity, why isn't limiting arc.c_max considered best practice (I just want to make sure I am not missing something about the effect limiting it will have)? My guess is that in our case (lots of small groups -- 50 people or less -- sharing files over the web) that file system caches are not that useful. The small groups mean that no one file gets used that often and, since access is over the web, their response time will be largely limited by their internet connection. Thanks a lot for the response! As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. -Mark Tom Burns ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to NOT mount a ZFS storage pool/ZFS file system?
I currently have a system which has two ZFS storage pools. One of the pools is coming from a faulty piece of hardware. I would like to bring up our server mounting the storage pool which is okay and NOT mounting the one with from the hardware with problems. Is there a simple way to NOT mount one of my ZFS storage pools? The system is currently down due to the disk issues from one of the above pools. Thanks, David This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to NOT mount a ZFS storage pool/ZFS file system?
zfs export On September 12, 2006 2:41:27 PM -0700 David Smith [EMAIL PROTECTED] wrote: I currently have a system which has two ZFS storage pools. One of the pools is coming from a faulty piece of hardware. I would like to bring up our server mounting the storage pool which is okay and NOT mounting the one with from the hardware with problems. Is there a simple way to NOT mount one of my ZFS storage pools? The system is currently down due to the disk issues from one of the above pools. Thanks, David This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Thomas Burns wrote: On Sep 12, 2006, at 2:04 PM, Mark Maybee wrote: Thomas Burns wrote: Hi, We have been using zfs for a couple of months now, and, overall, really like it. However, we have run into a major problem -- zfs's memory requirements crowd out our primary application. Ultimately, we have to reboot the machine so there is enough free memory to start the application. What I would like is: 1) A way to limit the size of the cache (a gig or two would be fine for us) 2) A way to clear the caches -- hopefully, something faster than rebooting the machine. Is there any way I can do either of these things? Thanks, Tom Burns Tom, What version of solaris are you running? In theory, ZFS should not be hogging your system memory to the point that it crowds out your primary applications... but this is still an area that we are working out the kinks in. If you could provide a core dump of the machine when it gets to the point that you can't start your app, it would help us. We are running the jun 06 version of solaris (10/6?). I don't have a core dump now -- but can probably get one in the next week or so. Where should I send it? You can drop cores via ftp to: sunsolve.sun.com login as anonymous or ftp deposit into /cores Also, where do I set arc.c_max? In etc/system? Out of curiosity, why isn't limiting arc.c_max considered best practice (I just want to make sure I am not missing something about the effect limiting it will have)? My guess is that in our case (lots of small groups -- 50 people or less -- sharing files over the web) that file system caches are not that useful. The small groups mean that no one file gets used that often and, since access is over the web, their response time will be largely limited by their internet connection. We don't want users to need to tune a bunch of knobs to get performance out of ZFS. We want it to work well out of the box. So we are trying to discourage using these tunables, and instead figure out what the root problem is and fix it. There is really no reason why zfs shouldn't be able to adapt itself appropriately to the available memory. Thanks a lot for the response! As to your questions; I will give you some ways to do these things, but these are not considered best practice: 1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. 2) First try unmount/remounting your file system to clear the cache. If that doesn't work, try exporting/importing your pool. -Mark Tom Burns ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Memory Usage
1) You should be able to limit your cache max size by setting arc.c_max. Its currently initialized to be phys-mem-size - 1GB. Mark's assertion that this is not a best practice is something of an understatement. ZFS was designed so that users/administrators wouldn't have to configure tunables to achieve optimal system performance. ZFS performance is still a work in progress. The problem with adjusting arc.c_max is that its definition may change from one release to another. It's an internal kernel variable, its existence isn't guaranteed. There are also no guarantees about the semantics of what a future arc.c_max might mean. It's possible that future implementations may change the definition such that reducing c_max has other unintended consequences. Unfortunately, at the present time this is probably the only way to limit the cache size. Mark and I are working on strategies to make sure that ZFS is a better citizen when it comes to memory usage and performance. Mark has recently made a number of changes which should help ZFS reduce its memory footprint. However, until these changes and others make it into a production build we're going to have to live with this inadvisable approach for adjusting the cache size. -j This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: ...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct? Everything in the filesystems in the pool, yes. With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored. So my machine will boot if a disk fails. Which happened the other day :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Also, where do I set arc.c_max? In etc/system? Out of curiosity, why isn't limiting arc.c_max considered best practice (I just want to make sure I am not missing something about the effect limiting it will have)? My guess is that in our case (lots of small groups -- 50 people or less -- sharing files over the web) that file system caches are not that useful. The small groups mean that no one file gets used that often and, since access is over the web, their response time will be largely limited by their internet connection. We don't want users to need to tune a bunch of knobs to get performance out of ZFS. We want it to work well out of the box. So we are trying to discourage using these tunables, and instead figure out what the root problem is and fix it. There is really no reason why zfs shouldn't be able to adapt itself appropriately to the available memory. Ah, the ZFS philosophy that I love (not have to tune a bunch of knobs)! Seems like you need a way for the kernal to say I would like some memory back now. I don't have the slightest idea how practical that is though... BTW -- did I guess right wrt where I need to set arc.c_max (etc/system)? Thanks, Tom ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: ...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct? Everything in the filesystems in the pool, yes. With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored. So my machine will boot if a disk fails. Which happened the other day :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss ok cool. I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn't help you in this situation either! I still think that silent data corruption is a valid concern, one that ditto blocks would solve. Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two). I also echo Darren's comments on zfs performing better when it has the whole disk. Hopefully we can agree that you lose nothing by adding this feature, even if you personally don't see a need for it. Celso This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Celso wrote: Hopefully we can agree that you lose nothing by adding this feature, even if you personally don't see a need for it. If I read correctly user tools will show more space in use when adding copies, quotas are impacted, etc. One could argue the added confusion outweighs the addition of the feature. As others have asked I'd like to see the problem that this feature is designed to solve. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded torage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. Damn! That's a real shame! I was really starting to look forward to that. Please reconsider??! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss Celso This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. I think it's a valid problem. My understanding was that this didn't give a *guaranteed* solution, though. I think most people, when committing to the point of replication (spending actual money), need a guarantee at some level (not of course of total safety; but that the data actually does exist on separate disks, and will survive the destruction of one disk). A good solution to this problem would be valuable. (And I'd accept a non-guarantee on a single disk; or rather a guarantee that said if enough blocks to find the data exist, and a copy of each data block exists, we can retrieve the data; but that guarantee *does* exist I think). Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. I was never concerned at the free space issues (though I was concerned by some of the proposed solutions to what I saw as a non-issue). I'd be happy if the free space described how many bytes of default files you could add to the pool, and the user would have to understand that results would differ if they used non-default parameters. You're probably right that fewer people would mind having *more* space than an unthinking reading would show than less. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn't help you in this situation either! Exactly. I still think that silent data corruption is a valid concern, one that ditto blocks would solve. Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two). Well, you'd only be duplicating the data on the mirror. If you don't want to mirror the base OS, no one's saying you have to. For the sake of argument, let's assume: 1. disk is expensive 2. someone is keeping valuable files on a non-redundant zpool 3. they can't scrape enough vdevs to make a redundant zpool (remembering you can build vdevs out of *flat files*) Even then, to my mind: to the user, the *file* (screenplay, movie of childs birth, civ3 saved game, etc.) is the logical entity to have a 'duplication level' attached to it, and the only person who can score that is the author of the file. This proposal says the filesystem creator/admin scores the filesystem. Your argument against unneccessary data duplication applies to all 'non-special' files in the 'special' filesystem. They're wasting space too. If the user wants to make sure the file is 'safer' than others, he can just make multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp server, whatever. The redundancy you're talking about is what you'd get from 'cp /foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future. I also echo Darren's comments on zfs performing better when it has the whole disk. Me too, but a lot of laptop users dual-boot, which makes it a moot point. Hopefully we can agree that you lose nothing by adding this feature, even if you personally don't see a need for it. Sorry, I don't think we're going to agree on this one :) I've seen dozens of project proposals in the few months I've been lurking around opensolaris. Most of them have been of no use to me, but each to their own. I'm afraid I honestly think this greatly complicates the conceptual model (not to mention the technical implementation) of ZFS, and I haven't seen a convincing use case. All the best Dick. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. This is unfortunate. As a laptop user with only a single drive, I was looking forward to it since I've been bitten in the past by data loss caused by a bad area on the disk. I don't care about the space consumption because I generally don't come anywhere close to filling up the available space. It may not be the primary market for ZFS, but it could be a very useful side benefit. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. I don't see much need for this in any area that I would use ZFS (either my own personal use or for any case in which I would recommend it for production use). However, if you think that it's OK to under-report free space, then why not just do that for the data ditto blocks. If one or more of my filesystems are configured to keep two copies of the data, then simply report only half of the available space. If duplication isn't enabled for the entire pool but only for certain filesystems, then perhaps you could even take advantage of quotas for those filesystems to make a more accurate calculation. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Dick Davies wrote: For the sake of argument, let's assume: 1. disk is expensive 2. someone is keeping valuable files on a non-redundant zpool 3. they can't scrape enough vdevs to make a redundant zpool (remembering you can build vdevs out of *flat files*) Given those assumptions, I think that the proposed feature is the perfect solution. Simply put those files in a filesystem that has copies1. Also note that using files to back vdevs is not a recommended solution. If the user wants to make sure the file is 'safer' than others, he can just make multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp server, whatever. It seems to me that asking the user to solve this problem by manually making copies of all his files puts all the burden on the user/administrator and is a poor solution. For one, they have to remember to do it pretty often. For two, when they do experience some data loss, they have to manually reconstruct the files! They could have one file which has part of it missing from copy A and part of it missing from copy B. I'd hate to have to reconstruct that manually from two different files, but the proposed solution would do this transparently. The redundancy you're talking about is what you'd get from 'cp /foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future. Whether it's hard to understand is debatable, but this feature integrates very smoothly with the existing infrastructure and wouldn't cause any trouble when extending or porting ZFS. I'm afraid I honestly think this greatly complicates the conceptual model (not to mention the technical implementation) of ZFS, and I haven't seen a convincing use case. Just for the record, these changes are pretty trivial to implement; less than 50 lines of code changed. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso [EMAIL PROTECTED] wrote: I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn't help you in this situation either! Exactly. I still think that silent data corruption is a valid concern, one that ditto blocks would solve. Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two). Well, you'd only be duplicating the data on the mirror. If you don't want to mirror the base OS, no one's saying you have to. Yikes! that sounds like even more partitioning! For the sake of argument, let's assume: 1. disk is expensive 2. someone is keeping valuable files on a non-redundant zpool 3. they can't scrape enough vdevs to make a redundant zpool (remembering you can build vdevs out of *flat files*) Even then, to my mind: to the user, the *file* (screenplay, movie of childs birth, civ3 saved game, etc.) is the logical entity to have a 'duplication level' attached to it, and the only person who can score that is the author of the file. This proposal says the filesystem creator/admin scores the filesystem. Your argument against unneccessary data duplication applies to all 'non-special' files in the 'special' filesystem. They're wasting space too. If the user wants to make sure the file is 'safer' than others, he can just make multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp server, whatever. The redundancy you're talking about is what you'd get from 'cp /foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future. the proposed solution differs in one important aspect: it automatically detects data corruption. I also echo Darren's comments on zfs performing better when it has the whole disk. Me too, but a lot of laptop users dual-boot, which makes it a moot point. Hopefully we can agree that you lose nothing by adding this feature, even if you personally don't see a need for it. Sorry, I don't think we're going to agree on this one :) No worries, that's cool. All the best Dick. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss Celso This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On Sep 12, 2006, at 4:39 PM, Celso wrote: On 12/09/06, Celso [EMAIL PROTECTED] wrote: I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn't help you in this situation either! Exactly. I still think that silent data corruption is a valid concern, one that ditto blocks would solve. Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two). Well, you'd only be duplicating the data on the mirror. If you don't want to mirror the base OS, no one's saying you have to. Yikes! that sounds like even more partitioning! The redundancy you're talking about is what you'd get from 'cp /foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future. the proposed solution differs in one important aspect: it automatically detects data corruption. Detecting data corruption is a function of the ZFS checksumming feature. The proposed solution has _nothing_ to do with detecting corruption. The difference is in what happens when/if such bad data is detected. Without a duplicate copy, via some RAID level or the proposed ditto block copies, the file is corrupted. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Doing it for the filesystem is just one step higher (and makes it administratively easier as i don't have to type the same command for each file thats important). Mirroring is just like another step above that - though its possibly replicating stuff you just don't care about. Now placing extra copies of the data doesn't guarantee that data will survive multiple diskf failures; but neither does having a mirrored pool guarantee the data will be there either (2 disk failures). Both methods are about increasing your chances of having your valuable data around. I for one would have loved to have multiple copy filesystems + ZFS on my powerbook when i was travelling in Australia for a month - think of all the digital pictures you take and how pissed you would be if the one with the wild wombat didn't survive. Its maybe not an enterprise solution, but it seems like a consumer solution. Ensuring that the space accounting tools make sense is definitely a valid point though. eric Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
Chad Lewis wrote: On Sep 12, 2006, at 4:39 PM, Celso wrote: the proposed solution differs in one important aspect: it automatically detects data corruption. Detecting data corruption is a function of the ZFS checksumming feature. The proposed solution has _nothing_ to do with detecting corruption. The difference is in what happens when/if such bad data is detected. Without a duplicate copy, via some RAID level or the proposed ditto block copies, the file is corrupted. With a mirrored ZFS pool, what are the odds of losing all copies of the [meta]data, for N disks (where N = 1, 2, etc)? I thought we understood this pretty well, and that the answer was extremely small. -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
Here's the information you requested. Script started on Tue Sep 12 16:46:46 2006 # uname -a SunOS umt1a-bio-srv2 5.10 Generic_118833-18 sun4u sparc SUNW,Netra-T12 # prtdiag System Configuration: Sun Microsystems sun4u Sun Fire E2900 System clock frequency: 150 MHZ Memory size: 96GB === CPUs === E$ CPU CPU CPU Freq SizeImplementation MaskStatus Location --- -- --- - -- 0,512 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P0 1,513 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P1 2,514 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P2 3,515 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P3 8,520 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P0 9,521 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P1 10,522 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P2 11,523 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P3 16,528 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P0 17,529 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P1 18,530 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P2 19,531 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P3 # md mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ] arc::stat print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x11917e1200 p = 0x116e8a1a40 c = 0x11917cf428 c_min = 0xbf77c800 c_max = 0x17aef9 hits = 0x489737a8 misses = 0x8869917 deleted = 0xc832650 skipped = 0x15b29b2 hash_elements = 0x1273d0 hash_elements_max = 0x17576f hash_collisions = 0x4e0ceee hash_chains = 0x3a9b2 Segmentation Fault - core dumped # mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ] ::kmastat ::pgrep vi | ::walk thread 3086600f660 :[K 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598d91 [ 02a104598d91 cv_wait_sig+0x114() ] 02a104598e41 str_cv_wait+0x28() 02a104598f01 strwaitq+0x238() 02a104598fc1 strread+0x174() 02a1045990a1 fop_read+0x20() 02a104599161 read+0x274() 02a1045992e1 syscall_trap32+0xcc() 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598c71 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598bb1 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() 3086600f660::findstack stack pointer for thread 3086600f660 (TS_FREE): 2a104598ba1 02a104598fe1 segvn_unmap+0x1b8() 02a1045990d1 as_free+0xf4() 02a104599181 proc_exit+0x46c() 02a104599231 exit+8() 02a1045992e1 syscall_trap32+0xcc() [m# df -h Filesystem size used avail capacity Mounted on /dev/md/dsk/d10 32G 6.7G25G22%/ /devices 0K 0K 0K 0%/devices ctfs 0K 0K 0K 0%/system/contract proc 0K 0K 0K 0%/proc mnttab 0K 0K 0K 0%/etc/mnttab swap
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote: So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Yes, you could do that. *I* would make a copy on a CD, which I would carry in a separate case from the laptop. I think my presentation is a lot safer than your presentation. Similarly for your digital images example; I don't consider it safe until I have two or more *independent* copies. Two copies on a single hard drive doesn't come even close to passing the test for me; as many people have pointed out, those tend to fail all at once. And I will also point out that laptops get stolen a lot. And of course all the accidents involving fumble-fingers, OS bugs, and driver bugs won't be helped by the data duplication either. (Those will mostly be helped by sensible use of snapshots, though, which is another argument for ZFS on *any* disk you work on a lot.) The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On 9/12/06, Celso [EMAIL PROTECTED] wrote: Whether it's hard to understand is debatable, but this feature integrates very smoothly with the existing infrastructure and wouldn't cause any trouble when extending or porting ZFS. OK, given this statement... Just for the record, these changes are pretty trivial to implement; less than 50 lines of code changed. and this statement, I can't see any reasons not to include it. If the changes are easy to do, don't require anymore of the zfs team's valuable time, and don't hinder other things, I would plead with you to include them, as I think they are genuinely valuable and would make zfs not only the best enterprise level filesystem, but also the best filesystem for laptops/home computers. While I'm not a big fan of this feature, if the work is that well understood and that small, I have no objection to it. (Boy that sounds snotty; apologies, not what I intend here. Those of you reading this know how muich you care about my opinion, that's up to you.) I do pity the people who count on the ZFS redundancy to protect their presentation on an important sales trip -- and then have their laptop stolen. But those people might well be the same ones who would have *no* redundancy otherwise. And nothing about this feature prevents the paranoids like me from still making our backup CD and carrying it separately. I'm not prepared to go so far as to argue that it's bad to make them feel safer :-). At least, to make them feel safer *by making them actually safer*. -- David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/ RKBA: http://www.dd-b.net/carry/ Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/ Dragaera/Steven Brust: http://dragaera.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
David Dyer-Bennet wrote: While I'm not a big fan of this feature, if the work is that well understood and that small, I have no objection to it. (Boy that sounds snotty; apologies, not what I intend here. Those of you reading this know how muich you care about my opinion, that's up to you.) One could make the argument that the feature could cause enough confusion to not warrant its inclusion. If I'm a typical user and I write a file to the filesystem where the admin set three copies but didn't tell me it might throw me into a tizzy trying to figure out why my quota is 3X where I expect it to be. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss