Re: [zfs-discuss] Who is using ZFS ACL's in production?
"Paul B. Henson" writes: > Good :). I am certainly not wedded to my proposal, if some other > solution is proposed that would meet my requirements, great. However, > pretty much all of the advice has boiled down to either "ACL's are > broken, don't use them", or "why would you want to do *that*?", which > isn't particularly useful. you haven't demonstrated why the current capabilities are insufficient for your requirements. it's a bit hard to offer advice for perceived problems other than "reconsider your perception". -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
"Paul B. Henson" writes: > On Tue, 2 Mar 2010, Kjetil Torgrim Homme wrote: > >> no. what happens when an NFS client without ACL support mounts your >> filesystem? your security is blown wide open. the filemode should >> reflect the *least* level of access. if the filemode on its own allows >> more access, then you've lost. > > Say what? > > If you're using secure NFS, access control is handled on the server > side. If an NFS client that doesn't support ACL's mounts the > filesystem, it will have whatever access the user is supposed to have, > the lack of ACL support on the client is immaterial. this is true for AUTH_SYS, too, sorry about the bad example. but it doesn't really affect my point. if you just consider the filemode to be the lower bound for access rights, aclmode=passthrough will not give you any nasty surprises regardless of what clients do, *and* an ACL-ignorant client will get the behaviour it needs and wants. win-win! >> if your ACLs are completely specified and give proper access on their >> own, and you're using aclmode=passthrough, "chmod -R 000 /" will not >> harm your system. > > Actually, it will destroy the three special ACE's, user@, group@, and > every...@. On the other hand, with a hypothetical aclmode=ignore or > aclmode=deny, such a chmod would indeed not harm the system. you're not using those, are you? they are a direct mapping of the old style permissions, so it would be pretty weird if they were allowed to diverge. >> if you have rogue processes doing "chmod a+rwx" or other nonsense, you >> need to fix the rogue process, that's not an ACL problem or a problem >> with traditional Unix permissions. > > What I have are processes that don't know about ACL's. Are they > broken? Not in and of themselves, they are simply incompatible with a > security model they are unaware of. you made that model. > Why on earth would I want to go and try to make every single > application in the world ACL aware/compatible instead of simply having > a filesystem which I can configure to ignore any attempt to manipulate > legacy permissions? you don't have to. just subscribe to the principle of least security, and it just works. >> not at all. you just have to use them correctly. > > I think we're just not on the same page on this; while I am not saying > I'm on the right page, it does seem you need to do a little more > reading up on how ACL's work. nice insult. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?
Freddie Cash writes: > Kjetil Torgrim Homme wrote: > > it would be inconvenient to make a dedup copy on harddisk or tape, > you could only do it as a ZFS filesystem or ZFS send stream. it's > better to use a generic tool like hardlink(1), and just delete > files afterwards with > > Why would it be inconvenient? This is pretty much exactly what ZFS + > dedupe is perfect for. the duplication is not visible, so it's still a wilderness of duplicates when you navigate the files. > Since dedupe is pool-wide, you could create individual filesystems for > each DVD. Or use just 1 filesystem with sub-directories. Or just one > filesystem with snapshots after each DVD is copied over top. > > The data would be dedupe'd on write, so you would only have 1 copy of > unique data. for this application, I don't think the OP *wants* COW if he changes one file. he'll want the duplicates to be kept in sync, not diverging (in contrast to storage for VMs, for instance). with hardlinks, it is easier to identify duplicates and handle them however you like. if there is a reason for the duplicate access paths to your data, you can keep them. I would want to straighten the mess out, though, rather than keep it intact as closely as possible. > To save it to tape, just "zfs send" it, and save the stream file. the zfs stream format is not recommended for archiving. > ZFS dedupe would also work better than hardlinking files, as it works > at the block layer, and will be able to dedupe partial files. yes, but for the most part this will be negligible. copies of growing files, like log files, or perhaps your novel written as a stream of conciousness, will benefit. unrelated partially identical files are rare. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?
"valrh...@gmail.com" writes: > I have been using DVDs for small backups here and there for a decade > now, and have a huge pile of several hundred. They have a lot of > overlapping content, so I was thinking of feeding the entire stack > into some sort of DVD autoloader, which would just read each disk, and > write its contents to a ZFS filesystem with dedup enabled. [...] That > would allow me to consolidate a few hundred CDs and DVDs onto probably > a terabyte or so, which could then be kept conveniently on a hard > drive and archived to tape. it would be inconvenient to make a dedup copy on harddisk or tape, you could only do it as a ZFS filesystem or ZFS send stream. it's better to use a generic tool like hardlink(1), and just delete files afterwards with find . -type f -links +1 -exec rm {} \; (untested! notice that using xargs or -exec rm {} + will wipe out all copies of your duplicate files, so don't do that!) http://linux.die.net/man/1/hardlink perhaps this is more convenient: http://netdial.caribe.net/~adrian2/fdupes.html -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
"Paul B. Henson" writes: > On Sun, 28 Feb 2010, Kjetil Torgrim Homme wrote: > >> why are you doing this? it's inherently insecure to rely on ACL's to >> restrict access. do as David says and use ACL's to *grant* access. >> if needed, set permission on the file to 000 and use umask 777. > > Umm, it's inherently insecure to rely on Access Control Lists to, > well, control access? Doesn't that sound a bit off? no. what happens when an NFS client without ACL support mounts your filesystem? your security is blown wide open. the filemode should reflect the *least* level of access. if the filemode on its own allows more access, then you've lost. > The only reason it's insecure is because the ACL's don't stand alone, > they're propped up on a legacy chmod interoperability house of cards > which frequently falls down. not if you do it right. >> why is umask 022 when you want 077? *that's* your problem. > > What I want is for my inheritable ACL's not to be mixed in with legacy > concepts. ACL's don't have a umask. One of the benefits of inherited > ACL's is you don't need to globally pick "022, let people see what I'm > up to" vs "077, hide it all". You can just create files, with the > confidence that every one you create will have the appropriate > permissions as configured. if your ACLs are completely specified and give proper access on their own, and you're using aclmode=passthrough, "chmod -R 000 /" will not harm your system. if you have rogue processes doing "chmod a+rwx" or other nonsense, you need to fix the rogue process, that's not an ACL problem or a problem with traditional Unix permissions. > Except, of course, when they're comingled with incompatible security > models. Basically, it sounds like you're arguing I shouldn't try to > fix ACL/chmod issues because ACL's are insecure because they have > chmod issues 8-/. not at all. you just have to use them correctly. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
"Paul B. Henson" writes: > On Fri, 26 Feb 2010, David Dyer-Bennet wrote: >> I think of using ACLs to extend extra access beyond what the >> permission bits grant. Are you talking about using them to prevent >> things that the permission bits appear to grant? Because so long as >> they're only granting extended access, losing them can't expose >> anything. > > Consider the example of creating a file in a directory which has an > inheritable ACL for new files: why are you doing this? it's inherently insecure to rely on ACL's to restrict access. do as David says and use ACL's to *grant* access. if needed, set permission on the file to 000 and use umask 777. > drwx--s--x+ 2 henson csupomona 4 Feb 27 09:21 . > owner@:rwxpdDaARWcC--:-di---:allow > owner@:rwxpdDaARWcC--:--:allow > group@:--x---a-R-c---:-di---:allow > group@:--x---a-R-c---:--:allow > everyone@:--x---a-R-c---:-di---:allow > everyone@:--x---a-R-c---:--:allow > owner@:rwxpdDaARWcC--:f-i---:allow > group@:--:f-i---:allow > everyone@:--:f-i---:allow > > When the ACL is respected, then regardless of the requested creation > mode or the umask, new files will have the following ACL: > > -rw---+ 1 henson csupomona 0 Feb 27 09:26 foo > owner@:rw-pdDaARWcC--:--:allow > group@:--:--:allow > everyone@:--:--:allow > > Now, let's say a legacy application used a requested creation mode of > 0644, and the current umask was 022, and the application calculated > the resultant mode and explicitly set it with chmod(0644): why is umask 022 when you want 077? *that's* your problem. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.
tomwaters writes: > I created a zfs file system, cloud/movies and shared it. > I then filled it with movies and music. > I then decided to rename it, so I used rename in the Gnome to change > the folder name to media...ie cloud/media. < MISTAKE > I then noticed the zfs share was pointing to /cloud/movies which no > longer exists. I think you should file a bug against Nautilus (the GNOME file manager). When you rename a directory, it should check for it being a mountpoint and warn appropriately. (adding ZFS specific code to DTRT is perhaps asking for a bit too much.) evidently it got an error for the rename(2) and instead started to copy/delete the original. *inside* some filesystems, this is probably correct behaviour, but when the object is a filesystem, I don't think anyone want this behaviour. if they want to move data off the filesystem, they should go inside, mark all files, and drag (or ^X ^V) the files wherever they should go. > So, I removed cloud/movies with zfs destroy <--- BIGGER MISTAKE I see the reasoning behind this, but as you've learnt the hard way: always double-check before using zfs destroy. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
"David Dyer-Bennet" writes: > Which is bad enough if you say "ls". And there's no option to say > "don't sort" that I know of, either. /bin/ls -f "/bin/ls" makes sure an alias for "ls" to "ls -F" or similar doesn't cause extra work. you can also write "\ls -f" to ignore a potential alias. without an argument, GNU ls and SunOS ls behave the same. if you write "ls -f *" you'll only get output for directories in SunOS, while GNU ls will list all files. (ls -f has been there since SunOS 4.0 at least) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
Steve writes: > I would like to ask a question regarding ZFS performance overhead when > having hundreds of millions of files > > We have a storage solution, where one of the datasets has a folder > containing about 400 million files and folders (very small 1K files) > > What kind of overhead do we get from this kind of thing? at least 50%. I don't think this is obvious, so I'll state it: RAID-Z will not gain you any additional capacity over mirroring in this scenario. remember each individual file gets its own stripe. if the file is 512 bytes or less, you'll need another 512 byte block for the parity (actually as a special case, it's not parity, but a copy. parity would just be an inversion of all bits, so it's not useful to spend time doing it.) what's more, even if the file is 1024 bytes or less, ZFS will allocate an additional padding block to reduce the chance of unusable single disk blocks. a 1536 byte file will also consume 2048 bytes of physical disk, however. the reasoning for RAID-Z2 is similar, except it will add a padding block even for the 1536 byte file. to summarise: net raid-z1 raidz-2 -- 512 1024 2x 1536 3x 1024 2048 2x 3072 3x 1536 2048 1½x 3072 2x 2048 3072 1½x 3072 1½x 2560 3072 1⅕x 3584 1⅖x the above assumes at least 8 (9) disks in the vdev, otherwise you'll get a little more overhead for the "larger" filesizes. > Our storage performance has degraded over time, and we have been > looking in different places for cause of problems, but now I am > wondering if its simply a file pointer issue? adding new files will fragment directories, that might cause performance degradation depending on access patterns. I don't think many files in itself will cause problems, but since you get a lot more ZFS records in your dataset (128x!), more of the disk space is "wasted" on block pointers, and you may get more block pointer writes since more levels are needed. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Miles Nordin writes: >>>>>> "kth" == Kjetil Torgrim Homme writes: > >kth> the SCSI layer handles the replaying of operations after a >kth> reboot or connection failure. > > how? > > I do not think it is handled by SCSI layers, not for SAS nor iSCSI. sorry, I was inaccurate. error reporting is done by the SCSI layer, and the filesystem handles it by retrying whatever outstanding operations it has. > Also, remember a write command that goes into the write cache is a > SCSI command that's succeeded, even though it's not actually on disk > for sure unless you can complete a sync cache command successfully and > do so with no errors nor ``protocol events'' in the gap between the > successful write and the successful sync. A facility to replay failed > commands won't help because when a drive with write cache on reboots, > successful writes are rolled back. this is true, sorry about my lack of precision. the SCSI layer can't do this on its own. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Miles Nordin writes: > There will probably be clients that might seem to implicitly make this > assuption by mishandling the case where an iSCSI target goes away and > then comes back (but comes back less whatever writes were in its write > cache). Handling that case for NFS was complicated, and I bet such > complexity is just missing without any equivalent from the iSCSI spec, > but I could be wrong. I'd love to be educated. > > Even if there is some magical thing in iSCSI to handle it, the magic > will be rarely used and often wrong until peopel learn how to test it, > which they haven't yet they way they have with NFS. I decided I needed to read up on this and found RFC 3783 which is very readable, highly recommended: http://tools.ietf.org/html/rfc3783 basically iSCSI just defines a reliable channel for SCSI. the SCSI layer handles the replaying of operations after a reboot or connection failure. as far as I understand it, anyway. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] improve meta data performance
Chris Banal writes: > We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs > ops of which about 90% are meta data. In hind sight it would have been > significantly better to use a mirrored configuration but we opted for > 4 x (9+2) raidz2 at the time. We can not take the downtime necessary > to change the zpool configuration. > > We need to improve the meta data performance with little to no > money. Does anyone have any suggestions? I believe the latest Solaris update will improve metadata caching. always good to be up-to-date on patches, no? > Is there such a thing as a Sun supported NVRAM PCI-X card compatible > with the X4500 which can be used as an L2ARC? I think they only have PCIe, and it hardly qualifies as "little to no money". http://www.sun.com/storage/disk_systems/sss/f20/specs.xml I'll second the recommendations for Intel X25-M for L2ARC if you can spare a SATA slot for it. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS slowness under domU high load
Bogdan Ćulibrk writes: > What are my options from here? To move onto zvol with greater > blocksize? 64k? 128k? Or I will get into another trouble going that > way when I have small reads coming from domU (ext3 with default > blocksize of 4k)? yes, definitely. have you considered using NFS rather than zvols for the data filesystems? (keep zvol for the domU software.) it's strange that you see so much write activity during backup -- I'd expect that to do just reads... what's going on at the domU? generally, the best way to improve performance is to add RAM for ARC (512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem to be a poor match for your concept of many small low-cost dom0's. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
[please don't top-post, please remove CC's, please trim quotes. it's really tedious to clean up your post to make it readable.] Marc Nicholas writes: > Brent Jones wrote: >> Marc Nicholas wrote: >>> Kjetil Torgrim Homme wrote: >>>> his problem is "lazy" ZFS, notice how it gathers up data for 15 >>>> seconds before flushing the data to disk. tweaking the flush >>>> interval down might help. >>> >>> How does lowering the flush interval help? If he can't ingress data >>> fast enough, faster flushing is a Bad Thibg(tm). if network traffic is blocked during the flush, you can experience back-off on both the TCP and iSCSI level. >>>> what are the other values? ie., number of ops and actual amount of >>>> data read/written. this remained unanswered. >> ZIL performance issues? Is writecache enabled on the LUNs? > This is a Windows box, not a DB that flushes every write. have you checked if the iSCSI traffic is synchronous or not? I don't use Windows, but other reports on the list have indicated that at least the NTFS format operation *is* synchronous. use zilstats to see. > The drives are capable of over 2000 IOPS (albeit with high latency as > its NCQ that gets you there) which would mean, even with sync flushes, > 8-9MB/sec. 2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2! NCQ doesn't help much, since the write operations issued by ZFS are already ordered correctly. the OP may also want to try tweaking metaslab_df_free_pct, this helped linear write performance on our Linux clients a lot: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229 -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Bob Friesenhahn writes: > On Wed, 10 Feb 2010, Frank Cusack wrote: > > The other three commonly mentioned issues are: > > - Disable the naggle algorithm on the windows clients. for iSCSI? shouldn't be necessary. > - Set the volume block size so that it matches the client filesystem >block size (default is 128K!). default for a zvol is 8 KiB. > - Check for an abnormally slow disk drive using 'iostat -xe'. his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds before flushing the data to disk. tweaking the flush interval down might help. >> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >> seconds of 100, and repeats. what are the other values? ie., number of ops and actual amount of data read/written. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
"Eric D. Mudama" writes: > On Tue, Feb 9 at 2:36, Kjetil Torgrim Homme wrote: >> no one is selling disk brackets without disks. not Dell, not EMC, >> not NetApp, not IBM, not HP, not Fujitsu, ... > > http://discountechnology.com/Products/SCSI-Hard-Drive-Caddies-Trays very nice, thanks. unfortunately it probably won't last: [http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html]: | | In the case of Dell's PERC RAID controllers, we began informing | customers when a non-Dell drive was detected with the introduction of | PERC5 RAID controllers in early 2006. With the introduction of the | PERC H700/H800 controllers, we began enabling only the use of Dell | qualified drives. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Neil Perrin writes: > On 02/09/10 08:18, Kjetil Torgrim Homme wrote: >> I think the above is easily misunderstood. I assume the OP means >> append, not rewrites, and in that case (with recordsize=128k): >> >> * after the first write, the file will consist of a single 1 KiB record. >> * after the first append, the file will consist of a single 5 KiB >> record. > > Good so far. > >> * after the second append, one 128 KiB record and one 7 KiB record. > > A long time ago we used to write short tail blocks, but not any more. > So after the 2nd append we actually have 2 128KB blocks. thanks a lot for the correction! -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup Questions.
Richard Elling writes: > On Feb 8, 2010, at 6:04 PM, Kjetil Torgrim Homme wrote: >> the size of [a DDT] entry is much larger: >> >> | From: Mertol Ozyoney >> | >> | Approximately it's 150 bytes per individual block. > > "zdb -D poolname" will provide details on the DDT size. FWIW, I have a > pool with 52M DDT entries and the DDT is around 26GB. wow, that's much larger than Mertol's estimate: 500 bytes per block. >$ pfexec zdb -D tank >DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core >DDT-sha256-zap-unique: 52284055 entries, size 284 on disk, 159 in core > >dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = > 1.00 how do you calculate the 26 GB size from this? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Richard Elling writes: > On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote: > >> I would have thought that if I write 1k then ZFS txg times out in >> 30secs, then the 1k will be written to disk in a 1k record block, and >> then if I write 4k then 30secs latter txg happen another 4k record >> size block will be written, and then if I write 130k a 128k and 2k >> record block will be written. >> >> Making the file have record sizes of >> 1k+4k+128k+2k > > Close. Once the max record size is achieved, it is not reduced. So > the allocation is: 1KB + 4KB + 128KB + 128KB I think the above is easily misunderstood. I assume the OP means append, not rewrites, and in that case (with recordsize=128k): * after the first write, the file will consist of a single 1 KiB record. * after the first append, the file will consist of a single 5 KiB record. * after the second append, one 128 KiB record and one 7 KiB record. in each of these operations, the *whole* file will be rewritten to a new location, but after a third append, only the tail record will be rewritten. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests
grarpamp writes: > PS: Is there any way to get a copy of the list since inception for > local client perusal, not via some online web interface? I prefer to read mailing lists using a newsreader and the NNTP interface at Gmane. a newsreader tends to be better at threading etc. than a mail client which is fed an mbox... see http://gmane.org/about.php for more information. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Damon Atkins writes: > One problem could be block sizes, if a file is re-written and is the > same size it may have different ZFS record sizes within, if it was > written over a long period of time (txg's)(ignoring compression), and > therefore you could not use ZFS checksum to compare two files. the record size used for a file is chosen when that file is created. it can't change. when the default record size for the dataset changes, only new files will be affected. ZFS *must* write a complete record even if you change just one byte (unless it's the tail record of course), since there isn't any better granularity for the block pointers. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup Questions.
Tom Hall writes: > If you enable it after data is on the filesystem, it will find the > dupes on read as well as write? Would a scrub therefore make sure the > DDT is fully populated. no. only written data is added to the DDT, so you need to copy the data somehow. zfs send/recv is the most convenient, but you could even do a loop of commands like cp -p "$file" "$file.tmp" && mv "$file.tmp" "$file" > Re the DDT, can someone outline it's structure please? Some sort of > hash table? The blogs I have read so far dont specify. I can't help here. > Re DDT size, is (data in use)/(av blocksize) * 256bit right as a worst > case (ie all blocks non identical) the size of an entry is much larger: | From: Mertol Ozyoney | Subject: Re: Dedup memory overhead | Message-ID: <00cb01caa580$a3d6f110$eb84d330$%ozyo...@sun.com> | Date: Thu, 04 Feb 2010 11:58:44 +0200 | | Approximately it's 150 bytes per individual block. > What are average block sizes? as a start, look at your own data. divide the used size in "df" with used inodes in "df -i". example from my home directory: $ /usr/gnu/bin/df -i ~ FilesystemInodes IUsed IFree IUse%Mounted on tank/home 223349423 3412777 219936646 2%/volumes/home $ df -k ~ Filesystemkbytes used avail capacity Mounted on tank/home 573898752 257644703 10996825471%/volumes/home so the average file size is 75 KiB, smaller than the recordsize of 128 KiB. extrapolating to a full filesystem, we'd get 4.9M files. unfortunately, it's more complicated than that, since a file can consist of many records even if the *average* is smaller than a single record. a pessimistic estimate, then, is one record for each of those 4.9M files, plus one record for each 128 KiB of diskspace (2.8M), for a total of 7.7M records. the size of the DDT for this (quite small!) filesystem would be something like 1.2 GB. perhaps a reasonable rule of thumb is 1 GB DDT per TB of storage. (disclaimer: I'm not a kernel hacker, I just read this list :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Daniel Carosone writes: > In that context, I haven't seen an answer, just a conclusion: > > - All else is not equal, so I give my money to some other hardware >manufacturer, and get frustrated that Sun "won't let me" buy the >parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, ... -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
Christo Kutrovsky writes: > Has anyone seen soft corruption in NTFS iSCSI ZVOLs after a power > loss? this is not from experience, but I'll answer anyway. > I mean, there is no guarantee writes will be executed in order, so in > theory, one could corrupt it's NTFS file system. I think you have that guarantee, actually. the problem is that the Windows client will think that block N has been updated, since the iSCSI server told it it was commited to stable storage. however, when ZIL is disabled, that update may get lost during power loss. if block N contains, say, directory information, this could cause weird behaviour. it may look fine at first -- the problem won't appear until NTFS has thrown block N out of its cache and it needs to re-read it from the server. when the re-read stale data is combined with fresh data from RAM, it's panic time... > Would best practice be to rollback the last snapshot before making > those iSCSI available again? I think you need to reboot the client so that its RAM cache is cleared before any other writes are made. a rollback shouldn't be necessary. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Tim Cook writes: > Kjetil Torgrim Homme wrote: >>I don't know what the J4500 drive sled contains, but for the J4200 >>and J4400 they need to include quite a bit of circuitry to handle >>SAS protocol all the way, for multipathing and to be able to >>accept a mix of SAS and SATA drives. it's not just a piece of >>sheet metal, some plastic and a LED. >> >>the pricing does look strange, and I think it would be better to >>raise the price of the enclosure (which is silly cheap when empty >>IMHO) and reduce the drive prices somewhat. but that's just >>psychology, and doesn't really matter for total cost. > > Why exactly would that be better? people are looking at the price list and seeing that the J4200 costs 22550 NOK [1], while one sixpack of 2TB SATA disks to go with it costs 82500 NOK. on the other hand you could get six 2TB SATA disks (Ultrastars) from your friendly neighbourhood shop for 14370 NOK (7700 NOK for six Deskstars). and to add insult to injury, the neighbourhood shop offers five years warranty (required by Norwegian consumer laws), compared to Sun's three years... everyone knows the price of a harddisk since they buy them for their home computers. do they know the price of a disk storage array? not so well. yes, it's a matter of perception for the buyer, but perception can matter. > Then it's a high cost of entry. What if an SMB only needs 6 drives > day one? Why charge them an arm and a leg for the enclosure, and > nothing for the drives? Again, the idea is that you're charging based > on capacity. see my numbers above. the chassis itself is just 12% of the cost (22% when half full). some middle ground could be found. anyway, we're buying these systems and are very happy with them. when disks fail, Sun replace them very expediently and with a minimum of fuss. [1] all prices include VAT to simplify comparison. prices are current from shop.sun.com and komplett.no. Sun list prices are subject to haggling. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
Darren J Moffat writes: > That disables the ZIL for *all* datasets on *all* pools on the system. > Doing this means that for NFS client or other applications (maybe > local) that rely on the POSIX synchronus requirements of fsync they > may see data loss on a crash. Note that the ZFS pool is still > consistent on disk but the application that is flushing writes > synchonusly may have missing data on recovery from the crash. > > NEVER turn off the ZIL other than for testing on dummy data whether or > not the ZIL is your bottleneck. NEVER turn off the ZIL on live data > pools. I think that is a bit too strong, I'd say "NEVER turn off the ZIL when external clients depend on stable storage promises". that is, don't turn ZIL off when your server is used for CIFS, NFS, iSCSI or even services on a higher level, such as a web server running a storefront, where customers expect a purchase to be followed through... I've disabled ZIL on my server which is running our backup software since there are no promises made to external clients. the backup jobs will be aborted, and the PostgreSQL database may lose a few transactions after the reboot, but I won't lose more than 30 seconds of stored data. this is quite unproblematic, especially compared to the disruption of the crash itself. (in my setup I potentially have to manually invalidate any backup jobs which finished within the last 30 seconds of the crash. this is due to the database running on a different storage pool than the backup data, so the point in time for the commitment of backup data and backup metadata to stable storage may/will diverge. "luckily" the database transactions usually take 30+ seconds, so this is not a problem in practice...) of course, doing this analysis of your software requires in-depth knowledge of both the software stack and the workings of ZFS, so I can understand Sun^H^H^HOracle employees stick to the simple advice "NEVER disable ZIL". -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] list new files/activity monitor
"Nilsen, Vidar" writes: > And once an hour I run a script that checks for new dirs last 60 > minutes matching some criteria, and outputs the path to an > IRC-channel. Where we can see if someone else has added new stuff. > > Method used is “find –mmin -60”, which gets horrible slow when more > data is added. > > My question is if there exists some method I can get the same results > but based on events rather than seeking through everything once an > hour. yes, File Events Notification (FEN) http://blogs.sun.com/praks/entry/file_events_notification you access this through the event port API. http://developers.sun.com/solaris/articles/event_completion.html gnome-vfs uses FEN, but unfortunately gnomevfs-monitor will only watch a specific directory. I think you'll need to write your own code to watch all directories in a tree. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to get a list of changed files between two snapshots?
Frank Cusack writes: > On 2/4/10 8:00 AM +0100 Tomas Ögren wrote: >> The "find -newer blah" suggested in other posts won't catch newer >> files with an old timestamp (which could happen for various reasons, >> like being copied with kept timestamps from somewhere else). > > good point. that is definitely a restriction with find -newer. but > if you meet that restriction, and don't need to find added or deleted > files, it will be faster since only 1 directory tree has to be walked. FWIW, GNU find has -cnewer -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
matthew patton writes: > true. but I buy a Ferrari for the engine and bodywork and chassis > engineering. It is totally criminal what Sun/EMC/Dell/Netapp do > charging customers 10x the open-market rate for standard drives. A > RE3/4 or NS drive is the same damn thing no matter if I buy it from > ebay or my local distributor. Dell/Sun/Netapp buy drives by the > container load. Oh sure, I don't mind paying an extra couple > pennies/GB for all the strenuous efforts the vendors spend on firmware > verification (HA!). I don't know what the J4500 drive sled contains, but for the J4200 and J4400 they need to include quite a bit of circuitry to handle SAS protocol all the way, for multipathing and to be able to accept a mix of SAS and SATA drives. it's not just a piece of sheet metal, some plastic and a LED. the pricing does look strange, and I think it would be better to raise the price of the enclosure (which is silly cheap when empty IMHO) and reduce the drive prices somewhat. but that's just psychology, and doesn't really matter for total cost. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3ware 9650 SE
Alexandre MOREL writes: > It's a few day now that I try to use a 9650SE 3ware controller to work > on opensolaris and I found the following problem : the tw driver seems > work, I can see my controller whith the tw_cli of 3ware. I can see > that 2 drives are created with the controller, but when I try to use > "pfexec format", it doesn't detect the drive. did you create logical devices using tw_cli? a pity none of these cards seem to support proper JBOD mode. in 9650's case it's especially bad, since pulling a drive will *renumber* the logical devices without notifying the OS. quite scary! if you've created the devices, try running devfsadm once more. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compressed ration inconsistency
antst writes: > I'm more than happy by the fact that data consumes even less physical > space on storage. But I want to understand why and how. And want to > know to what numbers I can trust. my guess is sparse files. BTW, I think you should compare the size returned from "du -bx" with "refer", not "used". in this case it's not snapshots which makes the difference, though. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3ware 9650 SE
Tiernan O'Toole writes: > looking at the 3ware 9650 SE raid controller for a new build... anyone > have any luck with this card? their site says they support > OpenSolaris... anyone used one? didn't work too well for me. it's fast and nice for a couple of days, then the driver gets slower and slower, and eventually it gets stuck and all I/O freezes. preventive reboots were needed. I used the newest driver from 3ware/AMCC with 2008.11 and 2009.05. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
Mark Bennett writes: > Update: > > For the WD10EARS, the blocks appear to be aligned on the 4k boundary > when zfs uses the whole disk (whole disk as EFI partition). > > Part TagFlag First Sector Size Last Sector > 0usrwm256 931.51Gb 1953508750 > > calc256*512/4096=32 I'm afraid this isn't enough. if you enable compression, any ZFS write can be unaligned. also, if you're using raid-z with an odd number of data disks, some of (most of?) your stripes will be misaligned. ZFS needs to use 4096 octets as the basic block to fully exploit the performance of these disks. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building big cheap storage system. What hardware to use?
Freddie Cash writes: > We use the following for our storage servers: > [...] > 3Ware 9650SE PCIe RAID controller (12-port, muli-lane) > [...] > Fully supported by FreeBSD, so everything should work with > OpenSolaris. FWIW, I've used the 9650SE with 16 ports in OpenSolaris 2008.11 and 2009.06, and had problems with the driver just hanging after 4-5 days of use. iostat would report 100% busy on all drives connected to the card, and even "uadmin 1 1" (low-level reboot command) was ineffective. I had to break into the debugger and do the reboot from there. I was using the newest driver from AMCC. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
Mike Gerdts writes: > Kjetil Torgrim Homme wrote: >> Mike Gerdts writes: >> >>> John Hoogerdijk wrote: >>>> Is there a way to zero out unused blocks in a pool? I'm looking for >>>> ways to shrink the size of an opensolaris virtualbox VM and using the >>>> compact subcommand will remove zero'd sectors. >>> >>> I've long suspected that you should be able to just use mkfile or "dd >>> if=/dev/zero ..." to create a file that consumes most of the free >>> space then delete that file. Certainly it is not an ideal solution, >>> but seems quite likely to be effective. >> >> you'll need to (temporarily) enable compression for this to have an >> effect, AFAIK. >> >> (dedup will obviously work, too, if you dare try it.) > > You are missing the point. Compression and dedup will make it so that > the blocks in the devices are not overwritten with zeroes. The goal > is to overwrite the blocks so that a back-end storage device or > back-end virtualization platform can recognize that the blocks are not > in use and as such can reclaim the space. aha, I was assuming the OP's VirtualBox image was stored on ZFS, but I realise now that it's the other way around -- he's running ZFS inside a VirtualBox image hosted on a traditional filesystem. in that case you're right, and I'm wrong :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
Mike Gerdts writes: > John Hoogerdijk wrote: >> Is there a way to zero out unused blocks in a pool? I'm looking for >> ways to shrink the size of an opensolaris virtualbox VM and using the >> compact subcommand will remove zero'd sectors. > > I've long suspected that you should be able to just use mkfile or "dd > if=/dev/zero ..." to create a file that consumes most of the free > space then delete that file. Certainly it is not an ideal solution, > but seems quite likely to be effective. you'll need to (temporarily) enable compression for this to have an effect, AFAIK. (dedup will obviously work, too, if you dare try it.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimise away COW when rewriting the same data?
David Magda writes: > On Jan 24, 2010, at 10:26, Kjetil Torgrim Homme wrote: > >> But it occured to me that this is a special case which could be >> beneficial in many cases -- if the filesystem uses secure checksums, >> it could check the existing block pointer and see if the replaced >> data matches. [...] >> >> Are there any ZFS hackers who can comment on the feasibility of this >> idea? > > There is a bug that requests an API in ZFS' DMU library to get > checksum data: > > 6856024 - DMU checksum API > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6856024 That would work, but it would require rsync to do checksum calculations itself to do the comparison. Then ZFS would recalculate the checksum if the data was actually written, so it's wasting work for local copies. It would be interesting to extend the rsync protocol to take advantage of this, though, so that the checksum can be calculated on the remote host. H... It would need very ZFS specific support, e.g., the recordsize is potentially different for each file, likewise for the checksum algorithm. Fixing a library seems easier than patching the kernel, so your approach is probably better anyhow. > It specifically mentions Lustre, and not anything like the ZFS POSIX > interface to files (ZPL). There's also: > >> Here's another: file comparison based on values derived from files' >> checksum or dnode block pointer. This would allow for very efficient >> file comparison between filesystems related by cloning. Such values >> might be made available through an extended attribute, say. > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6366224 > > It's been brought up before on zfs-discuss: the two options would be > linking against some kind of ZFS-specific library, or using an ioctl() > of some kind. As it stands, ZFS is really the only mainstream(-ish) > file system that does checksums, and so there's no standard POSIX call > for such things. Perhaps as more file systems add this functionality > something will come of it. The checksum algorithms need to be very strictly specified. Not a problem for sha256, I guess, but fletcher4 probably don't have independent implementations which are 100% compatible with ZFS -- and GPL (needed for rsync and many other applications). -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?
Tim Cook writes: > On Sat, Jan 23, 2010 at 7:57 PM, Frank Cusack wrote: > > I mean, just do a triple mirror of the 1.5TB drives rather than > say (6) .5TB drives in a raidz3. > > I bet you'll get the same performance out of 3x1.5TB drives you get > out of 6x500GB drives too. no, it will be much better. you get 3 independent disks available for reading, so 3x the IOPS. in a 6x500 GB setup all disks will need to operate in tandem, for both reading and writing. even if the larger disks are slower than small disks, the difference is not even close to such a factor. perhaps 20% fewer IOPS? > Are you really trying to argue people should never buy anything but > the largest drives available? I don't think that's Frank's point. the key here is the advantages of a (triple) mirroring over RAID-Z. it just so happens that it makes economic sense. (think about savings in power, too.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] optimise away COW when rewriting the same data?
I was looking at the performance of using rsync to copy some large files which change only a little between each run (database files). I take a snapshot after every successful run of rsync, so when using rsync --inplace, only changed portions of the file will occupy new disk space. Unfortunately, performance wasn't too good, the source server in question simply didn't have much CPU to perform the rsync delta algorithm, and in addition it creates read I/O load on the destination server. So I had to switch it off and transfer the whole file instead. In this particular case, that means I need 120 GB to store each run rather than 10, but that's the way it goes. If I had enabled deduplication, this would be a moot point, dedup would take care of it for me. Judging from early reports my server will probably not have the required oomph to handle it, so I'm holding off until I get to replace it with a server with more RAM and CPU. But it occured to me that this is a special case which could be beneficial in many cases -- if the filesystem uses secure checksums, it could check the existing block pointer and see if the replaced data matches. (Due to the (infinitesimal) potential for hash collisions this should be configurable the same way it is for dedup.) In essence, rsync's writes would become no-ops, and very little CPU would be wasted on either side of the pipe. Even in the absence of snapshots, this would leave the filesystem less fragmented, since the COW is avoided. This would be a win-win if the ZFS pipeline can communicate the correct information between layers. Are there any ZFS hackers who can comment on the feasibility of this idea? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Lutz Schumann writes: > Actually the performance decrease when disableing the write cache on > the SSD is aprox 3x (aka 66%). for this reason, you want a controller with battery backed write cache. in practice this means a RAID controller, even if you don't use the RAID functionality. of course you can buy SSDs with capacitors, too, but I think that will be more expensive, and it will restrict your choice of model severely. (BTW, thank you for testing forceful removal of power. the result is as expected, but it's good to see that theory and practice match.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz stripe size (not stripe width)
Brad writes: > Hi Adam, I'm not Adam, but I'll take a stab at it anyway. BTW, your crossposting is a bit confusing to follow, at least when using gmane.org. I think it is better to stick to one mailing list anyway? > From your the picture, it looks like the data is distributed evenly > (with the exception of parity) across each spindle then wrapping > around again (final 4K) - is this one single write operation or two? it is a single write operation per device. actually, it may be "less than" one write operation since the transaction group, which probably contains many more updates, is written as a whole. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup existing data
Anil writes: > If you have another partition with enough space, you could technically > just do: > > mv src /some/other/place > mv /some/other/place src > > Anyone see a problem with that? Might be the best way to get it > de-duped. I get uneasy whenever I see mv(1) used to move directory trees between filesystems, that is, whenever mv(1) can't do a simple rename(2), but has to do a recursive copy of files. it is essentially not restartable, if mv(1) is interrupted, you must clean up the mess with rsync or similar tools. so why not use rsync from the get go? (or zfs send/recv of course.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Darren J Moffat writes: > Kjetil Torgrim Homme wrote: > >> I don't know how tightly interwoven the dedup hash tree and the block >> pointer hash tree are, or if it is all possible to disentangle them. > > At the moment I'd say very interwoven by design. > >> conceptually it doesn't seem impossible, but that's easy for me to >> say, with no knowledge of the zio pipeline... > > Correct it isn't impossible but instead there would probably need to > be two checksums held, one of the untransformed data (ie uncompressed > and unencrypted) and one of the transformed data (compressed and > encrypted). That has different tradeoffs and SHA256 can be expensive > too see: > > http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via great work! SHA256 is more expensive than I thought, even with misc/sha2 it takes 1 ms per 128 KiB? that's roughly the same CPU usage as lzjb! in that case hashing the (smaller) compressed data is more efficient than doing an additional hash of the full uncompressed block. it's interesting to note that 64 KiB looks faster (a bit hard to read the chart accurately), L1 cache size coming into play, perhaps? > Note also that the compress/encrypt/checksum and the dedup are > separate pipeline stages so while dedup is happening for block N block > N+1 can be getting transformed - so this is designed to take advantage > of multiple scheduling units (threads,cpus,cores etc). nice. are all of them separate stages, or are compress/encrypt/checksum done as one stage? >> oh, how does encryption play into this? just don't? knowing that >> someone else has the same block as you is leaking information, but that >> may be acceptable -- just make different pools for people you don't >> trust. > > compress, encrypt, checksum, dedup. > > You are correct that it is an information leak but only within a > dataset and its clones and only if you can observe the deduplication > stats (and you need to use zdb to get enough info to see the leak - > and that means you have access to the raw devices), the deupratio > isn't really enough unless the pool is really idle or has only one > user writing at a time. > > For the encryption case deduplication of the same plaintext block will > only work within a dataset or a clone of it - because only in those > cases do you have the same key (and the way I have implemented the IV > generation for AES CCM/GCM mode ensures that the same plaintext will > have the same IV so the ciphertexts will match). makes sense. > Also if you place a block in an unencrypted dataset that happens to > match the ciphertext in an encrypted dataset they won't dedup either > (you need to understand what I've done with the AES CCM/GCM MAC and > the zio_chksum_t field in the blkptr_t and how that is used by dedup > to see why). wow, I didn't think of that problem. did you get bitten by wrongful dedup during testing with image files? :-) > If that small information leak isn't acceptable even within the > dataset then don't enable both encryption and deduplication on those > datasets - and don't delegate that property to your users either. Or > you can frequently rekey your per dataset data encryption keys 'zfs > key -K' but then you might as well turn dedup off - other there are > some very good usecases in multi level security where doing > dedup/encryption and rekey provides a nice effect. indeed. ZFS is extremely flexible. thank you for your response, it was very enlightening. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Darren J Moffat writes: > Kjetil Torgrim Homme wrote: >> Andrey Kuzmin writes: >> >>> Downside you have described happens only when the same checksum is >>> used for data protection and duplicate detection. This implies sha256, >>> BTW, since fletcher-based dedupe has been dropped in recent builds. >> >> if the hash used for dedup is completely separate from the hash used >> for data protection, I don't see any downsides to computing the dedup >> hash from uncompressed data. why isn't it? > > It isn't separate because that isn't how Jeff and Bill designed it. thanks for confirming that, Darren. > I think the design the have is great. I don't disagree. > Instead of trying to pick holes in the theory can you demonstrate a > real performance problem with compression=on and dedup=on and show > that it is because of the compression step ? compression requires CPU, actually quite a lot of it. even with the lean and mean lzjb, you will get not much more than 150 MB/s per core or something like that. so, if you're copying a 10 GB image file, it will take a minute or two, just to compress the data so that the hash can be computed so that the duplicate block can be identified. if the dedup hash was based on uncompressed data, the copy would be limited by hashing efficiency (and dedup tree lookup). I don't know how tightly interwoven the dedup hash tree and the block pointer hash tree are, or if it is all possible to disentangle them. conceptually it doesn't seem impossible, but that's easy for me to say, with no knowledge of the zio pipeline... oh, how does encryption play into this? just don't? knowing that someone else has the same block as you is leaking information, but that may be acceptable -- just make different pools for people you don't trust. > Otherwise if you want it changed code it up and show how what you have > done is better in all cases. I wish I could :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin writes: > Downside you have described happens only when the same checksum is > used for data protection and duplicate detection. This implies sha256, > BTW, since fletcher-based dedupe has been dropped in recent builds. if the hash used for dedup is completely separate from the hash used for data protection, I don't see any downsides to computing the dedup hash from uncompressed data. why isn't it? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin writes: > Darren J Moffat wrote: >> Andrey Kuzmin wrote: >>> Resilvering has noting to do with sha256: one could resilver long >>> before dedupe was introduced in zfs. >> >> SHA256 isn't just used for dedup it is available as one of the >> checksum algorithms right back to pool version 1 that integrated in >> build 27. > > 'One of' is the key word. And thanks for code pointers, I'll take a > look. I didn't mention sha256 at all :-). the reasoning is the same no matter what hash algorithm you're using (fletcher2, fletcher4 or sha256. dedup doesn't require sha256 either, you can use fletcher4. the question was: why does data have to be compressed before it can be recognised as a duplicate? it does seem like a waste of CPU, no? I attempted to show the downsides to identifying blocks by their uncompressed hash. (BTW, it doesn't affect storage efficiency, the same duplicate blocks will be discovered either way.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin writes: > Yet again, I don't see how RAID-Z reconstruction is related to the > subject discussed (what data should be sha256'ed when both dedupe and > compression are enabled, raw or compressed ). sha256 has nothing to do > with bad block detection (may be it will when encryption is > implemented, but for now sha256 is used for duplicate candidates > look-up only). how do you think RAID-Z resilvering works? please correct me where I'm wrong. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin writes: > Kjetil Torgrim Homme wrote: >> for some reason I, like Steve, thought the checksum was calculated on >> the uncompressed data, but a look in the source confirms you're right, >> of course. >> >> thinking about the consequences of changing it, RAID-Z recovery would be >> much more CPU intensive if hashing was done on uncompressed data -- > > I don't quite see how dedupe (based on sha256) and parity (based on > crc32) are related. I tried to hint at an explanation: >> every possible combination of the N-1 disks would have to be >> decompressed (and most combinations would fail), and *then* the >> remaining candidates would be hashed to see if the data is correct. the key is that you don't know which block is corrupt. if everything is hunky-dory, the parity will match the data. parity in RAID-Z1 is not a checksum like CRC32, it is simply XOR (like in RAID 5). here's an example with four data disks and one paritydisk: D1 D2 D3 D4 PP 00 01 10 10 01 this is a single stripe with 2-bit disk blocks for simplicity. if you XOR together all the blocks, you get 00. that's the simple premise for reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and so on. so what happens if a bit flips in D4 and it becomes 00? the total XOR isn't 00 anymore, it is 10 -- something is wrong. but unless you get a hardware signal from D4, you don't know which block is corrupt. this is a major problem with RAID 5, the data is irrevocably corrupt. the parity discovers the error, and can alert the user, but that's the best it can do. in RAID-Z the hash saves the day: first *assume* D1 is bad and reconstruct it from parity. if the hash for the block is OK, D1 *was* bad. otherwise, assume D2 is bad. and so on. so, the parity calculation will indicate which stripes contain bad blocks. but the hashing, the sanity check for which disk blocks are actually bad must be calculated over all the stripes a ZFS block (record) consists of. >> this would be done on a per recordsize basis, not per stripe, which >> means reconstruction would fail if two disk blocks (512 octets) on >> different disks and in different stripes go bad. (doing an exhaustive >> search for all possible permutations to handle that case doesn't seem >> realistic.) actually this is the same for compression before/after hashing. it's just that each permutation is more expensive to check. >> in addition, hashing becomes slightly more expensive since more data >> needs to be hashed. >> >> overall, my guess is that this choice (made before dedup!) will give >> worse performance in normal situations in the future, when dedup+lzjb >> will be very common, at a cost of faster and more reliable resilver. in >> any case, there is not much to be done about it now. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Robert Milkowski writes: > On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote: >> Because if you can de-dup anyway why bother to compress THEN check? >> This SEEMS to be the behaviour - i.e. I would suspect many of the >> files I'm writing are dups - however I see high cpu use even though >> on some of the copies I see almost no disk writes. > > First, the checksum is calculated after compression happens. for some reason I, like Steve, thought the checksum was calculated on the uncompressed data, but a look in the source confirms you're right, of course. thinking about the consequences of changing it, RAID-Z recovery would be much more CPU intensive if hashing was done on uncompressed data -- every possible combination of the N-1 disks would have to be decompressed (and most combinations would fail), and *then* the remaining candidates would be hashed to see if the data is correct. this would be done on a per recordsize basis, not per stripe, which means reconstruction would fail if two disk blocks (512 octets) on different disks and in different stripes go bad. (doing an exhaustive search for all possible permutations to handle that case doesn't seem realistic.) in addition, hashing becomes slightly more expensive since more data needs to be hashed. overall, my guess is that this choice (made before dedup!) will give worse performance in normal situations in the future, when dedup+lzjb will be very common, at a cost of faster and more reliable resilver. in any case, there is not much to be done about it now. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Confusion regarding 'zfs send'
Brandon High writes: > Matthew Ahrens wrote: >> Well, changing the "compression" property doesn't really interrupt >> service, but I can understand not wanting to have even a few blocks >> with the "wrong" > > I was thinking of sharesmb or sharenfs settings when I wrote that. > Toggling them for the send would interrupt any clients trying to use > the resource. Obviously disabling compression or dedup on the source > doesn't interrupt service you can avoid the problem by making sure the target filesystem isn't mounted, i.e., by setting canmount=noauto on the source. it's a bit ugly, since you'll get an outage if the source server reboots before you set it back. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
Adam Leventhal writes: > Unfortunately, dedup will only apply to data written after the setting > is enabled. That also means that new blocks cannot dedup against old > block regardless of how they were written. There is therefore no way > to "prepare" your pool for dedup -- you just have to enable it when > you have the new bits. thank you for the clarification! -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] will deduplication know about old blocks?
I'm planning to try out deduplication in the near future, but started wondering if I can prepare for it on my servers. one thing which struck me was that I should change the checksum algorithm to sha256 as soon as possible. but I wonder -- is that sufficient? will the dedup code know about old blocks when I store new data? let's say I have an existing file img0.jpg. I turn on dedup, and copy it twice, to img0a.jpg and img0b.jpg. will all three files refer to the same block(s), or will only img0a and img0b share blocks? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Accidentally added disk instead of attaching
Daniel Carosone writes: >>> Not if you're trying to make a single disk pool redundant by adding >>> .. er, attaching .. a mirror; then there won't be such a warning, >>> however effective that warning might or might not be otherwise. >> >> Not a problem because you can then detach the vdev and add it. > > It's a problem if you're trying to do that, but end up adding instead > of attaching, which you can't (yet) undo. at least in that case the amount of data shuffling you have to do is limited to one disk (it's unlikely you do this mistake for a multi device vdev). in any case, the block rewrite implementation isn't *that* far away, is it? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] nodiratime support in ZFS?
I was catching up on old e-mail on this list, and came across a blog posting from Henrik Johansson: http://sparcv9.blogspot.com/2009/10/curious-case-of-strange-arc.html it tells of his woes with a fragmented /var/pkg/downloads combined with atime updates. I see the same problem on my servers, e.g. $ time du -s /var/pkg/download 1614308 /var/pkg/download real11m50.682s $ time du -s /var/pkg/download 1614308 /var/pkg/download real12m03.395s on this server, increasing arc_meta_limit wouldn't help, but I think a newer kernel would be more aggressive (this is 2008.11). arc_meta_used = 262 MB arc_meta_limit = 2812 MB arc_meta_max = 335 MB turning off atime helps: real 8m06.563s in this test case, running du(1), turning off atime altogether isn't really needed, it would suffice to turn off atime updates on directories. in Linux, this can be achieved with the mount option "nodiratime". if ZFS had it, I guess it would be a new value for the atime property, "nodir" or somesuch. I quite often find it useful to have access to atime information to see if files have been read, for forensic purposes, for debugging, etc. so I am loath to turn it off. however, atime on directories can hardly ever be used for anything -- you have to take really good care not to trigger an update just checking the atime, and even if you do get a proper reading, there are so many tree traversing utilities that the information value is low. it is quite unlikely that any applications break in a nodiratime mode, and few people should have any qualms enabling it. Santa, are you listening? :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Daniel Carosone writes: >> you can fetch the "cr_txg" (cr for creation) for a >> snapshot using zdb, > > yes, but this is hardly an appropriate interface. agreed. > zdb is also likely to cause disk activity because it looks at many > things other than the specific item in question. I'd expect meta-information like this to fit comfortably in RAM over extended amounts of time. haven't tried, though. >> but the very creation of a snapshot requires a new >> txg to note that fact in the pool. > > yes, which is exactly what we're trying to avoid, because it requires > disk activity to write. you missed my point: you can't compare the current txg to an old cr_txg directly, since the current txg value will be at least 1 higher, even if no changes have been made. >> if the snapshot is taken recursively, all snapshots will have the >> same cr_txg, but that requires the same configuration for all >> filesets. > > again, yes, but that's irrelevant - the important knowledge at this > moment is that the txg has not changed since last time, and that thus > there will be no benefit in taking further snapshots, regardless of > configuration. yes, that's what we're trying to establish, and it's easier when all snapshots are commited in the same txg. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Daniel Carosone writes: >> I don't think it is easy to do, the txg counter is on >> a pool level, >> [..] >> it would help when the entire pool is idle, though. > > .. which is exactly the scenario in question: when the disks are > likely to be spun down already (or to spin down soon without further > activity), and you want to avoid waking them up (or keeping them > awake) with useless snapshot activity. good point! > However, this highlights that a (pool? fs?) property that exposes the > current txg id (frozen in snapshots, as normal, if an fs property) > might be enough for the userspace daemon to make its own decision to > avoid requesting snapshots, without needing a whole discretionary > mechanism in zfs itself. you can fetch the "cr_txg" (cr for creation) for a snapshot using zdb, but the very creation of a snapshot requires a new txg to note that fact in the pool. if there are several filesystems to snapshot, you'll get a sequence of cr_txg, and they won't be adjacent. # zdb tank/te...@snap1 Dataset tank/te...@snap1 [ZVOL], ID 78, cr_txg 872401, 4.03G, 3 objects # zdb -u tank txg = 872402 timestamp = 1259064201 UTC = Tue Nov 24 13:03:21 2009 # sync # zdb -u tank txg = 872402 # zfs snapshot tank/te...@snap1 # zdb tank/te...@snap1 Dataset tank/te...@snap1 [ZVOL], ID 80, cr_txg 872419, 4.03G, 3 objects # zdb -u tank txg = 872420 timestamp = 1259064641 UTC = Tue Nov 24 13:10:41 2009 if the snapshot is taken recursively, all snapshots will have the same cr_txg, but that requires the same configuration for all filesets. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
sundeep dhall writes: > Q) How do I simulate a sudden 1-disk failure to validate that zfs / > raidz handles things well without data errors > > Options considered > 1. suddenly pulling a disk out > 2. using zpool offline > > I think both these have issues in simulating a sudden failure why not take a look at what HP's test department is doing and fire a round through the disk with a rifle? oh, I guess that won't be a *simulation*. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Daniel Carosone writes: > Would there be a way to avoid taking snapshots if they're going to be > zero-sized? I don't think it is easy to do, the txg counter is on a pool level, AFAIK: # zdb -u spool Uberblock magic = 00bab10c version = 13 txg = 1773324 guid_sum = 16611641539891595281 timestamp = 1258992244 UTC = Mon Nov 23 17:04:04 2009 it would help when the entire pool is idle, though. > (posted here, rather than in response to the mailing list reference > given, because I'm not subscribed [...] ditto. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Basic question about striping and ZFS
Kjetil Torgrim Homme writes: > Cindy Swearingen writes: >> You might check the slides on this page: >> >> http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs >> >> Particularly, slides 14-18. >> >> In this case, graphic illustrations are probably the best way >> to answer your questions. > > thanks, Cindy. can you explain the meaning of the blocks marked X in > the illustration on page 18? I found the explanation in an older (2009-09-03) message to this list from Adam Leventhal: | RAID-Z writes full stripes every time; note that without careful | accounting it would be possible to effectively fragment the vdev | such that single sectors were free but useless since single-parity | RAID-Z requires two adjacent sectors to store data (one for data, | one for parity). To address this, RAID-Z rounds up its allocation to | the next (nparity + 1). This ensures that all space is accounted | for. RAID-Z will thus skip sectors that are unused based on this | rounding. For example, under raidz1 a write of 1024 bytes would | result in 512 bytes of parity, 512 bytes of data on two devices and | 512 bytes skipped. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick drive slicing madness question
Darren J Moffat writes: > Mauricio Tavares wrote: >> If I have a machine with two drives, could I create equal size slices >> on the two disks, set them up as boot pool (mirror) and then use the >> remaining space as a striped pool for other more wasteful >> applications? > > You could but why bother ? Why not just create one mirrored pool. you get half the space available... even if you don't forego redundancy and use mirroring on both slices, you can't extend the data pool later. > Having two pools on the same disk (or mirroring to the same disk) is > asking for performance pain if both are being written to heavily. not too common with heavy writing to rpool, is it? the main source of writing is syslog, I guess. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to destroy a pool by id?
Cindy Swearingen writes: > I wish we had a zpool destroy option like this: > > # zpool destroy -really_dead tank2 I think it would be clearer to call it zpool export --clear-name tank # or 3280066346390919920 or alternatively, zpool destroy --exported 3280066346390919920 I guess the reason you don't want to allow operations on exported pools is that they can be residing on shared storage. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression at zfs filesystem creation
"Monish Shah" writes: >> I'd be interested to see benchmarks on MySQL/PostgreSQL performance >> with compression enabled. my *guess* would be it isn't beneficial >> since they usually do small reads and writes, and there is little >> gain in reading 4 KiB instead of 8 KiB. > > OK, now you have switched from compressibility of data to > performance advantage. As I said above, this kind of data usually > compresses pretty well. the thread has been about I/O performance since the first response, as far as I can tell. > I agree that for random reads, there wouldn't be any gain from > compression. For random writes, in a copy-on-write file system, > there might be gains, because the blocks may be arranged in > sequential fashion anyway. We are in the process of doing some > performance tests to prove or disprove this. > > Now, if you are using SSDs for this type of workload, I'm pretty > sure that compression will help writes. The reason is that the > flash translation layer in the SSD has to re-arrange the data and > write it page by page. If there is less data to write, there will > be fewer program operations. > > Given that write IOPS rating in an SSD is often much less than read > IOPS, using compression to improve that will surely be of great > value. not necessarily, since a partial SSD write is much more expensive than a full block write (128 KiB?). in a write intensive application, that won't be an issue since the data is flowing steadily, but for the right mix of random reads and writes, this may exacerbate the bottleneck. > At this point, this is educated guesswork. I'm going to see if I > can get my hands on an SSD to prove this. that'd be great! -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression at zfs filesystem creation
"Fajar A. Nugraha" writes: > Kjetil Torgrim Homme wrote: >> indeed. I think only programmers will see any substantial benefit >> from compression, since both the code itself and the object files >> generated are easily compressible. > >>> Perhaps compressing /usr could be handy, but why bother enabling >>> compression if the majority (by volume) of user data won't do >>> anything but burn CPU? > > How do you define "substantial"? My opensolaris snv_111b installation > has 1.47x compression ratio for "/", with the default compression. > It's well worthed for me. I don't really care if my "/" is 5 GB or 3 GB. how much faster is your system operating? what's the compression rate on your data areas? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression at zfs filesystem creation
"David Magda" writes: > On Tue, June 16, 2009 15:32, Kyle McDonald wrote: > >> So the cache saves not only the time to access the disk but also >> the CPU time to decompress. Given this, I think it could be a big >> win. > > Unless you're in GIMP working on JPEGs, or doing some kind of MPEG > video editing--or ripping audio (MP3 / AAC / FLAC) stuff. All of > which are probably some of the largest files in most people's > homedirs nowadays. indeed. I think only programmers will see any substantial benefit from compression, since both the code itself and the object files generated are easily compressible. > 1 GB of e-mail is a lot (probably my entire personal mail collection > for a decade) and will compress well; 1 GB of audio files is > nothing, and won't compress at all. > > Perhaps compressing /usr could be handy, but why bother enabling > compression if the majority (by volume) of user data won't do > anything but burn CPU? > > So the correct answer on whether compression should be enabled by > default is "it depends". (IMHO :) ) I'd be interested to see benchmarks on MySQL/PostgreSQL performance with compression enabled. my *guess* would be it isn't beneficial since they usually do small reads and writes, and there is little gain in reading 4 KiB instead of 8 KiB. what other uses cases can benefit from compression? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] disabling showmount -e behaviour
Roman V Shaposhnik writes: > I must admit that this question originates in the context of Sun's > Storage 7210 product, which impose additional restrictions on the > kind of knobs I can turn. > > But here's the question: suppose I have an installation where ZFS is > the storage for user home directories. Since I need quotas, each > directory gets to be its own filesystem. Since I also need these > homes to be accessible remotely each FS is exported via NFS. Here's > the question though: how do I prevent showmount -e (or a manually > constructed EXPORT/EXPORTALL RPC request) to disclose a list of > users that are hosted on a particular server? I think the best you can do is to reject mount protocol requests coming from "high" ports (1024+) in your firewall. this means you need root priveleges (or more specific capability) on the client to fetch the list. another option is to make the usernames opaque and anonymous, e.g., "u4233". -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
Frank Middleton writes: > Exactly. My whole point. And without ECC there's no way of knowing. > But if the data is damaged /after/ checksum but /before/ write, then > you have a real problem... we can't do much to protect ourselves from damage to the data itself (an extra copy in RAM will help little and ruin performance). damages to the bits holding the computed checksum before it is written can be alleviated by doing the calculation independently for each written copy. in particular, this will help if the bit error is transient. since the number of octets in RAM holding the checksum dwarves the number of octets occupied by data by a large ratio (256 bits vs. one mebibit for a full default sized record), such a paranoia mode will most likely tell you that the *data* is corrupt, not the checksum. but today you don't know, so it's an improvement in my book. > Quoting the ZFS admin guide: "The failmode property ... provides the > failmode property for determining the behavior of a catastrophic > pool failure due to a loss of device connectivity or the failure of > all devices in the pool. ". Has this changed since the ZFS admin > guide was last updated? If not, it doesn't seem relevant. I guess checksum error handling is orthogonal to this and should have its own property. it sure would be nice if the admin could ask the OS to deliver the bits contained in a file, no matter what, and just log the problem. > Cheers -- Frank thank you for pointing out this potential weakness in ZFS' consistency checking, I didn't realise it was there. also thank you, all ZFS developers, for your great job :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss