Re: [zfs-discuss] RFE: Un-dedup for unique blocks
IIRC dump is special. As for swap... really, you don't want to swap. If you're swapping you have problems. Any swap space you have is to help you detect those problems and correct them before apps start getting ENOMEM. There *are* exceptions to this, such as Varnish. For Varnish and any other apps like it I'd dedicate an entire flash drive to it, no ZFS, no nothing. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
Bloom filters are very small, that's the difference. You might only need a few bits per block for a Bloom filter. Compare to the size of a DDT entry. A Bloom filter could be cached entirely in main memory. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
I've wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It's very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It's also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)
On Mon, Jan 14, 2013 at 1:48 PM, Tomas Forsman wrote: >> https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=15852599 > > Host oraclecorp.com not found: 3(NXDOMAIN) > > Would oracle.internal be a better domain name? Things like that cannot be changed easily. They (Oracle) are stuck with that domainname for the forseeable future. Also, whoever thought it up probably didn't consider leakage of internal URIs to the outside. *shrug* ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
The copies thing is a really only for laptops, where the likelihood of redundancy is very low (there are some high-end laptops with multiple drives, but those are relatively rare) and where this idea is better than nothing. It's also nice that copies can be set on a per-dataset manner (whereas RAID-Zn and mirroring are for pool-wide redundancy, not per-dataset), so you could set it > 1 on home directories but not /. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
You can treat whatever hash function as an idealized one, but actual hash functions aren't. There may well be as-yet-undiscovered input bit pattern ranges where there's a large density of collisions in some hash function, and indeed, since our hash functions aren't ideal, there must be. We just don't know where these potential collisions are -- for cryptographically secure hash functions that's enough (plus 2nd pre-image and 1st pre-image resistance, but allow me to handwave), but for dedup? *shudder*. Now, for some content types collisions may not be a problem at all. Think of security camera recordings: collisions will show up as bad frames in a video stream that no one is ever going to look at, and if they should need it, well, too bad. And for other content types collisions can be horrible. Us ZFS lovers love to talk about how silent bit rot means you may never know about serious corruption in other filesystems until it's too late. Now, if you disable verification in dedup, what do you get? The same situation as other filesystems are in relative to bit rot, only with different likelihoods. Disabling verification is something to do after careful deliberation, not something to do by default. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov wrote: > It's also possible to set "dedup=verify" with "checksum=sha256", > however, that makes little sense (as the chances of getting a random > hash collision are essentially nil). IMO dedup should always verify. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 9:48 AM, wrote: >>Huge space, but still finite=85 > > Dan Brown seems to think so in "Digital Fortress" but it just means he > has no grasp on "big numbers". I couldn't get past that. I had to put the book down. I'm guessing it was as awful as it threatened to be. IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn wrote: > On Tue, 3 Jul 2012, James Litchfield wrote: >> Agreed - msync/munmap is the only guarantee. > > I don't see that the munmap definition assures that anything is written to > "disk". The system is free to buffer the data in RAM as long as it likes > without writing anything at all. Oddly enough the manpages at the Open Group don't make this clear. So I think it may well be advisable to use msync(3C) before munmap() on MAP_SHARED mappings. However, I think all implementors should, and probably all do (Linux even documents that it does) have an implied msync(2) when doing a munmap(2). I really makes no sense at all to have munmap(2) not imply msync(3C). (That's another thing, I don't see where the standard requires that munmap(2) be synchronous. I think it'd be nice to have an mmap(2) option for requesting whether munmap(2) of the same mapping be synchronous or asynchronous. Async munmap(2) -> no need to mount cross-calls, instead allowing to mapping to be torn down over time. Doing a synchronous msync(3C), then a munmap(2) is a recipe for going real slow, but if munmap(2) does not portably guarantee an implied msync(3C), then would it be safe to do an async msync(2) then munmap(2)??) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield wrote: > On 07/02/12 15:00, Nico Williams wrote: >> You can't count on any writes to mmap(2)ed files hitting disk until >> you msync(2) with MS_SYNC. The system should want to wait as long as >> possible before committing any mmap(2)ed file writes to disk. >> Conversely you can't expect that no writes will hit disk until you >> msync(2) or munmap(2). > > Driven by fsflush which will scan memory (in chunks) looking for dirty, > unlocked, non-kernel pages to flush to disk. Right, but one just cannot count on that -- it's not part of the API specification. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn wrote: > On Mon, 2 Jul 2012, Iwan Aucamp wrote: >> I'm interested in some more detail on how ZFS intent log behaves for >> updated done via a memory mapped file - i.e. will the ZIL log updates done >> to an mmap'd file or not ? > > > I would to expect these writes to go into the intent log unless msync(2) is > used on the mapping with the MS_SYNC option. You can't count on any writes to mmap(2)ed files hitting disk until you msync(2) with MS_SYNC. The system should want to wait as long as possible before committing any mmap(2)ed file writes to disk. Conversely you can't expect that no writes will hit disk until you msync(2) or munmap(2). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 8:12 AM, Lionel Cons wrote: > On 26 June 2012 14:51, wrote: > We've already asked our Netapp representative. She said it's not hard > to add that. Did NetApp tell you that they'll add support for using the NFSv4 LINK operation on source objects that are directories?! I'd be extremely surprised! Or did they only tell you that they'll add support for using the NFSv4 REMOVE operation on non-empty directories? The latter is definitely feasible (although it could fail due to share deny OPENs of files below, say, but hey). The former is... not sane. >> I'd suggest whether you can restructure your code and work without this. > > It would require touching code for which we don't have sources anymore > (people gone, too). It would also require to create hard links to the > results files directly, which means linking 15000+ files per directory > with a minimum of 3 directories. Each day (this is CERN after > all). Oh, I see. But you still don't want hardlinks to directories! Instead you might be able to use LD_PRELOAD to emulate the behavior that the application wants. The app is probably implementing rename(), so just detect the sequence and map it to an actual rename(2). > The other way around would be to throw the SPARC machines away and go > with Netapp. So Solaris is just a fileserver here? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith wrote: > On 06/26/12 05:46 AM, Lionel Cons wrote: >> On 25 June 2012 11:33, wrote: >>> To be honest, I think we should also remove this from all other >>> filesystems and I think ZFS was created this way because all modern >>> filesystems do it that way. >> >> This may be wrong way to go if it breaks existing applications which >> rely on this feature. It does break applications in our case. > > Existing applications rely on the ability to corrupt UFS filesystems? > Sounds horrible. My guess is that the OP just wants unlink() of an empty directory to be the same as rmdir() of the same. Or perhaps they want unlink() of a non-empty directory to result in a recursive rm... But if they really want hardlinks to directories, then yeah, that's horrible. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there an actual newsgroup for zfs-discuss?
On Mon, Jun 11, 2012 at 5:05 PM, Tomas Forsman wrote: > .. or use a mail reader that doesn't suck. Or the mailman thread view. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Terminology question on ZFS COW
COW goes back at least to the early days of virtual memory and fork(). On fork() the kernel would arrange for writable pages in the parent process to be made read-only so that writes to them could be caught and then the page fault handler would copy the page (and restore write access) so the parent and child each have their own private copies. COW as used in ZFS is not the same, but the concept was introduced very early also, IIRC in the mid-80s -- certainly no later than BSD4.4's log structure filesystem (which ZFS resembles in many ways). So, is COW a misnomer? Yes and no, and anyways, it's irrelevant. The important thing is that when you say COW people understand that you're not saving a copy of the old thing but rather writing the new thing to a new location. (The old version of whatever was copied-on-write is stranded, unless -of course- you have references left to it from things like snapshots.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] current status of SAM-QFS?
On Wed, May 2, 2012 at 7:59 AM, Paul Kraus wrote: > On Wed, May 2, 2012 at 7:46 AM, Darren J Moffat > wrote: > If Oracle is only willing to share (public) information about the > roadmap for products via official sales channels then there will be > lots of FUD in the market. Now, as to sharing futures and NDA > material, that _should_ only be available via direct Oracle channels > (as it was under Sun as well). Sun was tight lipped too, yes, but information leaked through the open or semi-open software development practices in Solaris. If you saw some feature pushed to some gate you had no guarantee that it would remain there or be supported, but you had a pretty good inkling as to whether the engineers working on it intended it to remain there. If you can't get something out of your rep, you might try reading the tea leaves (sketchy business). But ultimately you need to be prepared for any product's EOL. You can expect some amount of warning time about EOLs, but legacy has a way of sticking around, so write plan for how to migrate data and where to, then put the plan in a drawer somewhere (and update it as necessary). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling wrote: > [...] NFSv4 had migration in the protocol (excluding protocols between servers) from the get-go, but it was missing a lot (FedFS) and was not implemented until recently. I've no idea what clients and servers support it adequately besides Solaris 11, though that's just my fault (not being informed). It's taken over a decade to get to where we have any implementations of NFSv4 migration. >> For me one of the exciting things about Lustre was/is the idea that >> you could just have a single volume where all new data (and metadata) >> is distributed evenly as you go. Need more storage? Plug it in, >> either to an existing head or via a new head, then flip a switch and >> there it is. No need to manage allocation. Migration may still be >> needed, both within a cluster and between clusters, but that's much >> more manageable when you have a protocol where data locations can be >> all over the place in a completely transparent manner. > > > Many distributed file systems do this, at the cost of being not quite > POSIX-ish. Well, Lustre does POSIX semantics just fine, including cache coherency (as opposed to NFS' close-to-open coherency, which is decidedly non-POSIX). > In the brave new world of storage vmotion, nosql, and distributed object > stores, > it is not clear to me that coding to a POSIX file system is a strong > requirement. Well, I don't quite agree. I'm very suspicious of eventually-consistent. I'm not saying that the enormous DBs that eBay and such run should sport SQL and ACID semantics -- I'm saying that I think we can do much better than eventually-consistent (and no-language) while not paying the steep price that ACID requires. I'm not alone in this either. The trick is to find the right compromise. Close-to-open semantics works out fine for NFS, but O_APPEND is too wonderful not to have (ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not O_APPEND). Whoever first delivers the right compromise in distributed DB semantics stands to make a fortune. > Perhaps people are so tainted by experiences with v2 and v3 that we can > explain > the non-migration to v4 as being due to poor marketing? As a leader of NFS, > Sun > had unimpressive marketing. Sun did not do too much to improve NFS in the 90s, not compared to the v4 work that only really started paying off only too recently. And then since Sun had lost the client space by then it doesn't mean all that much to have the best server if the clients aren't able to take advantage of the server's best features for lack of client implementation. Basically, Sun's ZFS, DTrace, SMF, NFSv4, Zones, and other amazing innovations came a few years too late to make up for the awful management that Sun was saddled with. But for all the decidedly awful things Sun management did (or didn't do), the worst was terminating Sun PS (yes, worse that all the non-marketing, poor marketing, poor acquisitions, poor strategy, and all the rest including truly epic mistakes like icing Solaris on x86 a decade ago). One of the worst outcomes of the Sun debacle is that now there's a bevy of senior execs who think the worst thing Sun did was to open source Solaris and Java -- which isn't to say that Sun should have open sourced as much as it did, or that open source is an end in itself, but that open sourcing these things was legitimate a business tool with very specific goals in mind in each case, and which had nothing to do with the sinking of the company. Or maybe that's one of the best outcomes, because the good news about it is that those who learn the right lessons (in that case: that open source is a legitimate business tool that is sometimes, often even, a great mind-share building tool) will be in the minority, and thus will have a huge advantage over their competition. That's another thing Sun did not learn until it was too late: mind-share matters enormously to a software company. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar wrote: > On 4/26/12 2:17 PM, J.P. King wrote: >> I don't know SnapMirror, so I may be mistaken, but I don't see how you >> can have non-synchronous replication which can allow for seamless client >> failover (in the general case). Technically this doesn't have to be >> block based, but I've not seen anything which wasn't. Synchronous >> replication pretty much precludes DR (again, I can think of theoretical >> ways around this, but have never come across anything in practice). > > "seamless" is an over-statement, I agree. NetApp has synchronous SnapMirror > (which is only mostly synchronous...). Worst case, clients may see a > filesystem go backwards in time, but to a point-in-time consistent state. Sure, if we assume apps make proper use of O_EXECL, O_APPEND, link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and can roll their own state back on their own. Databases typically know how to do that (e.g., SQLite3). Most apps? Doubtful. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling wrote: > On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: > Reboot requirement is a lame client implementation. And lame protocol design. You could possibly migrate read-write NFSv3 on the fly by preserving FHs and somehow updating the clients to go to the new server (with a hiccup in between, no doubt), but only entire shares at a time -- you could not migrate only part of a volume with NFSv3. Of course, having migration support in the protocol does not equate to getting it in the implementation, but it's certainly a good step in that direction. > You are correct, a ZFS send/receive will result in different file handles on > the receiver, just like > rsync, tar, ufsdump+ufsrestore, etc. That's understandable for NFSv2 and v3, but for v4 there's no reason that an NFSv4 server stack and ZFS could not arrange to preserve FHs (if, perhaps, at the price of making the v4 FHs rather large). Although even for v3 it should be possible for servers in a cluster to arrange to preserve devids... Bottom line: live migration needs to be built right into the protocol. For me one of the exciting things about Lustre was/is the idea that you could just have a single volume where all new data (and metadata) is distributed evenly as you go. Need more storage? Plug it in, either to an existing head or via a new head, then flip a switch and there it is. No need to manage allocation. Migration may still be needed, both within a cluster and between clusters, but that's much more manageable when you have a protocol where data locations can be all over the place in a completely transparent manner. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus wrote: > On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams wrote: >> Nothing's changed. Automounter + data migration -> rebooting clients >> (or close enough to rebooting). I.e., outage. > > Uhhh, not if you design your automounter architecture correctly > and (as Richard said) have NFS clients that are not lame to which I'll > add, automunters that actually work as advertised. I was designing > automount architectures that permitted dynamic changes with minimal to > no outages in the late 1990's. I only had a little over 100 clients > (most of which were also servers) and NIS+ (NIS ver. 3) to distribute > the indirect automount maps. Further below you admit that you're talking about read-only data, effectively. But the world is not static. Sure, *code* is by and large static, and indeed, we segregated data by whether it was read-only (code, historical data) or not (application data, home directories). We were able to migrated *read-only* data with no outages. But for the rest? Yeah, there were always outages. Of course, we had a periodic maintenance window, with all systems rebooting within a short period, and this meant that some data migration outages were not noticeable, but they were real. > I also had to _redesign_ a number of automount strategies that > were built by people who thought that using direct maps for everything > was a good idea. That _was_ a pain in the a** due to the changes > needed at the applications to point at a different hierarchy. We used indirect maps almost exclusively. Moreover, we used hierarchical automount entries, and even -autofs mounts. We also used environment variables to control various things, such as which servers to mount what from (this was particularly useful for spreading the load on read-only static data). We used practically every feature of the automounter except for executable maps (and direct maps, when we eventually stopped using those). > It all depends on _what_ the application is doing. Something that > opens and locks a file and never releases the lock or closes the file > until the application exits will require a restart of the application > with an automounter / NFS approach. No kidding! In the real world such applications exist and get used. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling wrote: > On Apr 25, 2012, at 3:36 PM, Nico Williams wrote: > > I disagree vehemently. automount is a disaster because you need to > > synchronize changes with all those clients. That's not realistic. > > Really? I did it with NIS automount maps and 600+ clients back in 1991. > Other than the obvious problems with open files, has it gotten worse since > then? Nothing's changed. Automounter + data migration -> rebooting clients (or close enough to rebooting). I.e., outage. > Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc. But not with AFS. And spec-wise not with NFSv4 (though I don't know if/when all NFSv4 clients will properly support migration, just that the protocol and some servers do). > With server-side, referral-based namespace construction that problem > goes away, and the whole thing can be transparent w.r.t. migrations. Yes. > Agree, but we didn't have NFSv4 back in 1991 :-) Today, of course, this > is how one would design it if you had to design a new DFS today. Indeed, that's why I built an automounter solution in 1996 (that's still in use, I'm told). Although to be fair AFS existed back then and had global namespace and data migration back then, and was mature. It's taken NFS that long to catch up... > >[...] > > Almost any of the popular nosql databases offer this and more. > The movement away from POSIX-ish DFS and storing data in > traditional "files" is inevitable. Even ZFS is a object store at its core. I agree. Except that there are applications where large octet streams are needed. HPC, media come to mind. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins wrote: > Aren't those general considerations when specifying a file server? There are Lustre clusters with thousands of nodes, hundreds of them being servers, and high utilization rates. Whatever specs you might have for one server head will not meet the demand that hundreds of the same can. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling wrote: > Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents). > FWIW, > automounters were invented 20+ years ago to handle this in a nearly seamless > manner. > Today, we have DFS from Microsoft and NFS referrals that almost eliminate > the need > for automounter-like solutions. I disagree vehemently. automount is a disaster because you need to synchronize changes with all those clients. That's not realistic. I've built a large automount-based namespace, replete with a distributed configuration system for setting the environment variables available to the automounter. I can tell you this: the automounter does not scale, and it certainly does not avoid the need for outages when storage migrates. With server-side, referral-based namespace construction that problem goes away, and the whole thing can be transparent w.r.t. migrations. For my money the key features a DFS must have are: - server-driven namespace construction - data migration without having to restart clients, reconfigure them, or do anything at all to them - aggressive caching - striping of file data for HPC and media environments - semantics that ultimately allow multiple processes on disparate clients to cooperate (i.e., byte range locking), but I don't think full POSIX semantics are needed (that said, I think O_EXCL is necessary, and it'd be very nice to have O_APPEND, though the latter is particularly difficult to implement and painful when there's contention if you stripe file data across multiple servers) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer wrote: > 2:20pm, Richard Elling wrote: >> Ignoring lame NFS clients, how is that architecture different than what >> you would have >> with any other distributed file system? If all nodes share data to all >> other nodes, then...? > > Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, > each node would have to mount from each other node. With 16 nodes, that's > what, 240 mounts? Not to mention your data is in 16 different > mounts/directory structures, instead of being in a unified filespace. To be fair NFSv4 now has a distributed namespace scheme so you could still have a single mount on the client. That said, some DFSes have better properties, such as striping of data across sets of servers, aggressive caching, and various choices of semantics (e.g., Lustre tries hard to give you POSIX cache coherency semantics). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
I agree, you need something like AFS, Lustre, or pNFS. And/or an NFS proxy to those. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Linux vs FreeBSD
As I understand it LLNL has very large datasets on ZFS on Linux. You could inquire with them, as well as http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/topics?pli=1 . My guess is that it's quite stable for at least some use cases (most likely: LLNL's!), but that may not be yours. You could always... test it, but if you do then please tell us how it went :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?
Hi. Going with Dell PV MD1200 and Dell PE R510 or R710 is no problem at all. But you should be aware if you go with Dell R510 or R710 that the internal storage PCIe slot will not work with any other HBA's than Dell H200 or H700 and also remeber to order extra Intel nics and disable the broadcom nics in bios. So if you plan to flash the H200 with IT firmware be ware that you have to move the controller to another PCIe slot, but also that your support is not valid anymore. Maybe you should keep the H200 IR firmware on the H200 controller and use Hardware raid for the syspool and order a Dell 6Gbps SAS HBA with IT firmware. This way you'll keep the hardware support for your Dell equipment. /Nico ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data loss by memory corruption?
On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov wrote: > 2012-01-18 1:20, Stefan Ring wrote: >> I don’t care too much if a single document gets corrupted – there’ll >> always be a good copy in a snapshot. I do care however if a whole >> directory branch or old snapshots were to disappear. > > Well, as far as this problem "relies" on random memory corruptions, > you don't get to choose whether your document gets broken or some > low-level part of metadata tree ;) Other filesystems tend to be much more tolerant of bit rot of all types precisely because they have no block checksums. But I'd rather have ZFS -- *with* redundancy, of course, and with ECC. It might be useful to have a way to recover from checksum mismatches by involving a human. I'm imagining a tool that tests whether accepting a block's actual contents results in making data available that the human thinks checks out, and if so, then rewriting that block. Some bit errors might simply result in meaningless metadata, but in some cases this can be corrected (e.g., ridiculous block addresses). But if ECC takes care of the problem then why waste the effort? (Partial answer: because it'd be a very neat GSoC type project!) > Besides, what if that document you don't care about is your account's > entry in a banking system (as if they had no other redundancy and > double-checks)? And suddenly you "don't exist" because of some EIOIO, > or your balance is zeroed (or worse, highly negative)? ;) This is why we have paper trails, logs, backups, redundancy at various levels, ... Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov wrote: > I've recently had a sort of an opposite thought: yes, > ZFS redundancy is good - but also expensive in terms > of raw disk space. This is especially bad for hardware > space-constrained systems like laptops and home-NASes, > where doubling the number of HDDs (for mirrors) or > adding tens of percent of storage for raidZ is often > not practical for whatever reason. Redundancy through RAID-Z and mirroring is expensive for home systems and laptops, but mostly due to the cost of SATA/SAS ports, not the cost of the drives. The drives are cheap, but getting an extra disk in a laptop is either impossible or expensive. But that doesn't mean you can't mirror slices or use ditto blocks. For laptops just use ditto blocks and either zfs send or external mirror that you attach/detach. > Current ZFS checksums allow us to detect errors, but > in order for recovery to actually work, there should be > a redundant copy and/or parity block available and valid. > > Hence the question: why not put ECC info into ZFS blocks? RAID-Zn *is* an error correction system. But what you are asking for is a same-device error correction method that costs less than ditto blocks, with error correction data baked into the blkptr_t. Are there enough free bits left in the block pointer for error correction codes for large blocks? (128KB blocks, but eventually ZFS needs to support even larger blocks, so keep that in mind.) My guess is: no. Error correction data might have to get stored elsewhere. I don't find this terribly attractive, but maybe I'm just not looking at it the right way. Perhaps there is a killer enterprise feature for ECC here: stretching MTTDL in the face of a device failure in a mirror or raid-z configuration (but if failures are typically of whole drives rather than individual blocks, then this wouldn't help). But without a good answer for where to store the ECC for the largest blocks, I don't see this happening. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Thu, Jan 5, 2012 at 8:53 AM, sol wrote: >> if a bug fixed in Illumos is never reported to Oracle by a customer, >> it would likely never get fixed in Solaris either > > :-( > > I would have liked to think that there was some good-will between the ex- and > current-members of the zfs team, in the sense that the people who created zfs > but then left Oracle still care about it enough to want the Oracle version to > be as bug-free as possible. My intention was to encourage users to report bugs to both, Oracle and Illumos. It's possible that Oracle engineers pay attention to the Illumos bug database, but I expect that for legal reasons the will not look at Illumos code that has any new copyright notices relative to Oracle code. The simplest way for Oracle engineers to avoid all possible legal problems is to simply ignore at least the Illumos source repositories, possibly more. I'm speculating, sure; I might be wrong. As for good will, I'm certain that there is, at least at the engineer level, and probably beyond. But that doesn't mean that there will be bug parity, much less feature parity. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens wrote: > On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble wrote: >> (1) when constructing the stream, every time a block is read from a fileset >> (or volume), its checksum is sent to the receiving machine. The receiving >> machine then looks up that checksum in its DDT, and sends back a "needed" or >> "not-needed" reply to the sender. While this lookup is being done, the >> sender must hold the original block in RAM, and cannot write it out to the >> to-be-sent-stream. > ... >> you produce a huge amount of small network packet >> traffic, which trashes network throughput > > This seems like a valid approach to me. When constructing the stream, > the sender need not read the actual data, just the checksum in the > indirect block. So there is nothing that the sender "must hold in > RAM". There is no need to create small (or synchronous) network > packets, because sender need not wait for the receiver to determine if > it needs the block or not. There can be multiple asynchronous > communication streams: one where the sender sends all the checksums > to the receiver; another where the receiver requests blocks that it > does not have from the sender; and another where the sender sends > requested blocks back to the receiver. Implementing this may not be > trivial, and in some cases it will not improve on the current > implementation. But in others it would be a considerable improvement. Right, you'd want to let the socket/transport buffer/flow control writes of "I have this new block checksum" messages from the zfs sender and "I need the block with this checksum" messages from the zfs receiver. I like this. A separate channel for bulk data definitely comes recommended for flow control reasons, but if you do that then securing the transport gets complicated: you couldn't just zfs send .. | ssh ... zfs receive. You could use SSH channel multiplexing, but that will net you lousy performance (well, no lousier than one already gets with SSH anyways)[*]. (And SunSSH lacks this feature anyways) It'd then begin to pay to have have a bonafide zfs send network protocol, and now we're talking about significantly more work. Another option would be to have send/receive options to create the two separate channels, so one would do something like: % zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs receive --dedup-control-channel ... & % zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive --dedup-bulk-channel % wait The second zfs receive would rendezvous with the first and go from there. [*] The problem with SSHv2 is that it has flow controlled channels layered over a flow controlled congestion channel (TCP), and there's not enough information flowing from TCP to SSHv2 to make this work well, but also, the SSHv2 channels cannot have their window shrink except by the sender consuming it, which makes it impossible to mix high-bandwidth bulk and small control data over a congested link. This means that in practice SSHv2 channels have to have relatively small windows, which then forces the protocol to work very synchronously (i.e., with effectively synchronous ACKs of bulk data). I now believe the idea of mixing bulk and non-bulk data over a single TCP connection in SSHv2 is a failure. SSHv2 over SCTP, or over multiple TCP connections, would be much better. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Thu, Dec 29, 2011 at 2:06 PM, sol wrote: > Richard Elling wrote: >> many of the former Sun ZFS team >> regularly contribute to ZFS through the illumos developer community. > > Does this mean that if they provide a bug fix via illumos then the fix won't > make it into the Oracle code? If you're an Oracle customer you should report any ZFS bugs you find to Oracle if you want fixes in Solaris. You may want to (and I encourage you to) report such bugs to Illumos if at all possible (i.e., unless your agreement with Oracle or your employer's policies somehow prevent you from doing so). The following is complete speculation. Take it with salt. With reference to your question, it may mean that Oracle's ZFS team would have to come up with their own fixes to the same bugs. Oracle's legal department would almost certainly have to clear the copying of any non-trivial/obvious fix from Illumos into Oracle's ON tree. And if taking a fix from Illumos were to require opening the affected files (because they are under CDDL in Illumos) then executive management approval would also be required. But the most likely case is that the issue simply wouldn't come up in the first place because Oracle's ZFS team would almost certainly ignore the Illumos repository (perhaps not the Illumos bug tracker, but probably that too) as that's simply the easiest way for them to avoid legal messes. Think about it. Besides, I suspect that from Oracle's point of view what matters are bug reports by Oracle customers to Oracle, so if a bug fixed in Illumos is never reported to Oracle by a customer, it would likely never get fixed in Solaris either except by accident, as a result of another change. Also, the Oracle ZFS team is not exactly devoid of clue, even with the departures from it to date. I suspect they will be able to fix bugs in Oracle's ZFS and completely independently of the open ZFS community, even if it means duplicating effort. That said, Illumos is a fork of OpenSolaris, and as such it and Solaris will necessarily diverge as at least one of the two (and probably both, for a while) gets plenty of bug fixes and enhancements. This is a good thing, not a bad thing, at least for now. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs wrote: > Jim, > > You are spot on. I was hoping that the writes would be close enough to > identical that > there would be a high ratio of duplicate data since I use the same record > size, page size, > compression algorithm, … etc. However, that was not the case. The main > thing that I > wanted to prove though was that if the data was the same the L1 ARC only > caches the > data that was actually written to storage. That is a really cool thing! I > am sure there will > be future study on this topic as it applies to other scenarios. > > With regards to directory engineering investing any energy into optimizing > ODSEE DS > to more effectively leverage this caching potential, that won't happen. OUD > far out > performs ODSEE. That said OUD may get some focus in this area. However, > time will > tell on that one. Databases are not as likely to benefit from dedup as virtual machines, indeed, DBs are likely to not benefit at all from dedup. The VM use case benefits from dedup for the obvious reason that many VMs will have the same exact software installed most of the time, using the same filesystems, and the same patch/update installation order, so if you keep data out of their root filesystems then you can expect enormous deduplicatiousness. But databases, not so much. The unit of deduplicable data in a VM use case is the guest's preferred block size, while in a DB the unit of deduplicable data might be a variable-sized table row, or even smaller: a single row/column value -- and you have no way to ensure alignment of individual deduplicable units nor ordering of sets of deduplicable units into larger ones. When it comes to databases your best bets will be: a) database-level compression or dedup features (e.g., Oracle's column-level compression feature) or b) ZFS compression. (Dedup makes VM management easier, because the alternative is to patch one master guest VM [per-guest type] then re-clone and re-configure all instances of that guest type, in the process possibly losing any customizations in those guests. But even before dedup, the ability to snapshot and clone datasets was an impressive dedup-like tool for the VM use-case, just not as convenient as dedup.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Wed, Dec 28, 2011 at 3:14 PM, Brad Diggs wrote: > > The two key takeaways from this exercise were as follows. There is > tremendous caching potential > through the use of ZFS deduplication. However, the current block level > deduplication does not > benefit directory as much as it perhaps could if deduplication occurred at > the byte level rather than > the block level. It very could be that even byte level deduplication doesn't > work as well either. > Until that option is available, we won't know for sure. How would byte-level dedup even work? My best idea would be to apply the rsync algorithm and then start searching for little chunks of data with matching rsync CRCs, rolling the rsync CRC over the data until a match is found for some chunk (which then has to be read and compared), and so on. The result would be incredibly slow on write and would have huge storage overhead. On the read side you could have many more I/Os too, so read would get much slower as well. I suspect any other byte-level dedup solutions would be similarly lousy. There'd be no real savings to be had, making the idea not worthwhile. Dedup is for very specific use cases. If your use case doesn't benefit from block-level dedup, then don't bother with dedup. (The same applies to compression, but compression is much more likely to be useful in general, which is why it should generally be on.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack wrote: > So with a de facto fork (illumos) now in place, is it possible that two > zpools will report the same version yet be incompatible across > implementations? Not likely: the Illumos community has developed a method for managing ZFS extensions in a way other than linear chronology. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Tue, Dec 27, 2011 at 2:20 PM, Frank Cusack wrote: > <http://sparcv9.blogspot.com/2011/12/solaris-11-illumos-and-source.html> > >> If I "upgrade" ZFS to use the new features in Solaris 11 I will be unable >> to import my pool using the free ZFS implementation that is available in >> illumos based distributions > > > Is that accurate? I understand if the S11 version is ahead of illumos, of > course I can't use the same pools in both places, but that is the same > problem as using an S11 pool on S10. The author is implying a much worse > situation, that there are zfs "tracks" in addition to versions and that S11 > is now on a different track and an S11 pool will not be usable elsewhere, > "ever". I hope it's just a misrepresentation. Hard to say. Suppose Oracle releases no details on any additions to the on-disk ZFS format since build 147... then either the rest of the ZFS developer community forks for good, or they have to reverse engineer Oracle's additions. Even if Oracle does release details on their additions, what if the external ZFS developer community disagrees vehemently with any of those? And what if the open source community adds extensions that Oracle never adopts? A fork is not yet a reality, but IMO it sure looks likely. Of course, you can still manage to have pools that will work on all implementations -- until the day that implementations start removing older formats anyways, which not only could happen, but I think will happen, though probably not until S10 is EOLed, and in any case probably not for a few years yet, likely not even within the next half decade. It's hard to predict such things though, so take the above with some (or lots!) of salt. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Dec 11, 2011 5:12 AM, "Nathan Kroenert" wrote: > > On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote: >> >> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: >>> >>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >>> >>> The only vendor i know that can do this is Netapp >> >> And you really work at Oracle?:) >> >> The answer is definiately yes. ARC caches on-disk blocks and dedup just >> reference those blocks. When you read dedup code is not involved at all. >> Let me show it to you with simple test: >> >> Create a file (dedup is on): >> >># dd if=/dev/random of=/foo/a bs=1m count=1024 >> >> Copy this file so that it is deduped: >> >># dd if=/foo/a of=/foo/b bs=1m >> >> Export the pool so all cache is removed and reimport it: >> >># zpool export foo >># zpool import foo >> >> Now let's read one file: >> >># dd if=/foo/a of=/dev/null bs=1m >>1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) >> >> We read file 'a' and all its blocks are in cache now. The 'b' file >> shares all the same blocks, so if ARC caches blocks only once, reading >> 'b' should be much faster: >> >># dd if=/foo/b of=/dev/null bs=1m >>1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) >> >> Now look at it, 'b' was read 12.5 times faster than 'a' with no disk >> activity. Magic?:) >> > > Hey all, > > That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. > > Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? The second file may gave the same data, but not the same metadata -the inode number at least must be different- so the znode for it must get read in, and that will slow reading the copy down a bit. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)
On Tue, Nov 29, 2011 at 12:17 PM, Cindy Swearingen wrote: > I think the "too many open files" is a generic error message about running > out of file descriptors. You should check your shell ulimit > information. Also, see how many open files you have: echo /proc/self/fd/* It'd be quite weird though to have a very low fd limit or a very large number of file descriptors open in the shell. That said, as Casper says, utilities like mv(1) should be able to cope with reasonably small fd limits (i.e., not as small as 3, but perhaps as small as 10). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'
On Mon, Nov 28, 2011 at 11:28 AM, Smith, David W. wrote: > You could list by inode, then use find with rm. > > # ls -i > 7223 -O > > # find . -inum 7223 -exec rm {} \; This is the one solution I'd recommend against, since it would remove hardlinks that you might care about. Also, this thread is getting long, repetitive, tiring. Please stop. This is a standard issue Unix beginner question, just like "my test program does nothing". Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualbox rawdisk discrepancy
Moving boot disks from one machine to another used to work as long as the machines were of the same architecture. I don't recall if it was *supported* (and wouldn't want to pretend to speak for Oracle now), but it was meant to work (unless you minimized the install and removed drivers not needed on the first system that are needed on the other system). You did have to do a reconfigure boot though! Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] aclmode=mask
On Mon, Nov 14, 2011 at 6:20 PM, Nico Williams wrote: > I see, with great pleasure, that ZFS in Solaris 11 has a new > aclmode=mask property. Also, congratulations on shipping. And thank you for implementing aclmode=mask. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] aclmode=mask
I see, with great pleasure, that ZFS in Solaris 11 has a new aclmode=mask property. http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbscy.html#gkkkp http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbchf.html#gljyz http://download.oracle.com/docs/cd/E23824_01/html/821-1462/zfs-1m.html#scrolltoc (search for aclmode) May this be the last word in ACL/chmod interactions (knocks on wood, crosses fingers, ...). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Mon, Nov 14, 2011 at 8:33 AM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Paul Kraus >> >> Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls >> apart (in terms of performance) when you get too many objects in one >> FS, which is specifically what drove us to ZFS. We had 4.5 TB of data > > According to wikipedia, btrfs is a b-tree. > I know in ZFS, the DDT is an AVL tree, but what about the rest of the > filesystem? ZFS directories are hashed. Aside from this, the filesystem (and volume) have a tree structure, but that's not what's interesting here -- what's interesting is how directories are indexed. > B-trees should be logarithmic time, which is the best O() you can possibly > achieve. So if HFS+ is dog slow, it's an implementation detail and not a > general fault of b-trees. Hash tables can do much better than O(log N) for searching: O(1) for best case, and O(n) for the worst case. Also, b-trees are O(log_b N), where b is the number of entries per-node. 6e7 entries/directory probably works out to 2-5 reads (assuming 0% cache hit rate) depending on the size of each directory entry and the size of the b-tree blocks. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Fri, Nov 11, 2011 at 4:27 PM, Paul Kraus wrote: > The command syntax paradigm of zfs (command sub-command object > parameters) is not unique to zfs, but seems to have been the "way of > doing things" in Solaris 10. The _new_ functions of Solaris 10 were > all this way (to the best of my knowledge)... > > zonecfg > zoneadm > svcadm > svccfg > ... and many others are written this way. To boot the zone named foo > you use the command "zoneadm -z foo boot", to disable the service > named sendmail, "svcadm disable sendmail", etc. Someone at Sun was > thinking :-) I'd have preferred "zoneadm boot foo". The -z zone command thing is a bit of a sore point, IMO. But yes, all these new *adm(1M( and *cfg(1M) commands in S10 are wonderful, especially when compared to past and present alternatives in the Unix/Linux world. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
To some people "active-active" means all cluster members serve the same filesystems. To others "active-active" means all cluster members serve some filesystems and can serve all filesystems ultimately by taking over failed cluster members. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Wed, Oct 19, 2011 at 7:24 AM, Garrett D'Amore wrote: > I'd argue that from a *developer* point of view, an fsck tool for ZFS might > well be useful. Isn't that what zdb is for? :-) > > But ordinary administrative users should never need something like this, > unless they have encountered a bug in ZFS itself. (And bugs are as likely to > exist in the checker tool as in the filesystem. ;-) zdb can be useful for admins -- say, to gather stats not reported by the system, to explore the fs/vol layout, for educational purposes, and so on. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Tue, Oct 18, 2011 at 9:35 AM, Brian Wilson wrote: > I just wanted to add something on fsck on ZFS - because for me that used to > make ZFS 'not ready for prime-time' in 24x7 5+ 9s uptime environments. > Where ZFS doesn't have an fsck command - and that really used to bug me - it > does now have a -F option on zpool import. To me it's the same > functionality for my environment - the ability to try to roll back to a > 'hopefully' good state and get the filesystem mounted up, leaving the > corrupted data objects corrupted. [...] Yes, that's exactly what it is. There's no point calling it fsck because fsck fixes individual filesystems, while ZFS fixups need to happen at the volume level (at volume import time). It's true that this should have been in ZFS from the word go. But it's there now, and that's what matters, IMO. It's also true that this was never necessary with hardware that doesn't lie, but it's good to have it anyways, and is critical for personal systems such as laptops. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Also, it's not worth doing a clustered ZFS thing that is too application-specific. You really want to nail down your choices of semantics, explore what design options those yield (or approach from the other direction, or both), and so on. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov wrote: > Thanks to Nico for concerns about POSIX locking. However, > hopefully, in the usecase I described - serving images of > VMs in a manner where storage, access and migration are > efficient - whole datasets (be it volumes or FS datasets) > can be dedicated to one VM host server at a time, just like > whole pools are dedicated to one host nowadays. In this > case POSIX compliance can be disregarded - access > is locked by one host, not avaialble to others, period. > Of course, there is a problem of capturing storage from > hosts which died, and avoiding corruptions - but this is > hopefully solved in the past decades of clustering tech's. It sounds to me like you need horizontal scaling more than anything else. In that case, why not use pNFS or Lustre? Even if you want snapshots, a VM should be able to handle that on its own, and though probably not as nicely as ZFS in some respects, having the application be in control of the exact snapshot boundaries does mean that you don't have to quiesce your VMs just to snapshot safely. > Nico also confirmed that "one node has to be a master of > all TXGs" - which is conveyed in both ideas of my original > post. Well, at any one time one node would have to be the master of the next TXG, but it doesn't mean that you couldn't have some cooperation. There are lots of other much more interesting questions. I think the biggest problem lies in requiring full connectivity from every server to every LUN. I'd much rather take the Lustre / pNFS model (which, incidentally, don't preclude having snapshots). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov wrote: > So, one version of the solution would be to have a single host > which imports the pool in read-write mode (i.e. the first one > which boots), and other hosts would write thru it (like iSCSI > or whatever; maybe using SAS or FC to connect between > "reader" and "writer" hosts). However they would read directly > from the ZFS pool using the full SAN bandwidth. You need to do more than simply assign a node for writes. You need to send write and lock requests to one node. And then you need to figure out what to do about POSIX write visibility rules (i.e., when a write should be visible to other readers). I think you'd basically end up not meeting POSIX in this regard, just like NFS, though perhaps not with close-to-open semantics. I don't think ZFS is the beast you're looking for. You want something more like Lustre, GPFS, and so on. I suppose someone might surprise us one day with properly clustered ZFS, but I think it'd be more likely that the filesystem would be ZFS-like, not ZFS proper. > Second version of the solution is more or less the same, except > that all nodes can write to the pool hardware directly using some > dedicated block ranges "owned" by one node at a time. This > would work like much a ZIL containing both data and metadata. > Perhaps these ranges would be whole metaslabs or some other > ranges as "agreed" between the master node and other nodes. This is much hairier. You need consistency. If two processes on different nodes are writing to the same file, then you need to *internally* lock around all those writes so that the on-disk structure ends up being sane. There's a number of things you could do here, such as, for example, having a per-node log and one node coalescing them (possibly one node per-file, but even then one node has to be the master of every txg). And still you need to be careful about POSIX semantics. That does not come for free in any design -- you will need something like the Lustre DLM (distributed lock manager). Or else you'll have to give up on POSIX. There's a hefty price to be paid for POSIX semantics in a clustered environment. You'd do well to read up on Lustre's experience in detail. And not just Lustre -- that would be just to start. I caution you that this is not a simple project. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling wrote: > On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: >> ZFS developers have for a long time stated that ZFS is not intended, >> at least not in near term, for clustered environments (that is, having >> a pool safely imported by several nodes simultaneously). However, >> many people on forums have wished having ZFS features in clusters. > > ...and UFS before ZFS… I'd wager that every file system has this RFE in its > wish list :-) Except the ones that already have it! :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs diff" performance disappointing
Ah yes, of course. I'd misread your original post. Yes, disabling atime updates will reduce the number of superfluous transactions. It's *all* transactions that count, not just the ones the app explicitly caused, and atime implies lots of transactions. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs diff" performance disappointing
On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea wrote: > I just upgraded to Solaris 10 Update 10, and one of the improvements > is "zfs diff". > > Using the "birthtime" of the sectors, I would expect very high > performance. The actual performance doesn't seems better that an > standard "rdiff", though. Quite disappointing... > > Should I disable "atime" to improve "zfs diff" performance? (most data > doesn't change, but "atime" of most files would change). atime has nothing to do with it. How much work zfs diff has to do depends on how much has changed between snapshots. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs scripts
On Fri, Sep 9, 2011 at 5:33 AM, Sriram Narayanan wrote: > Plus, you'll need an & character at the end of each command. And a wait command, if you want the script to wait for the sends to finish (which you should). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On Wed, Jul 27, 2011 at 9:22 PM, Daniel Carosone wrote: > Absent TRIM support, there's another way to do this, too. It's pretty > easy to dd /dev/zero to a file now and then. Just make sure zfs > doesn't prevent these being written to the SSD (compress and dedup are > off). I have a separate "fill" dataset for this purpose, to avoid > keeping these zeros in auto-snapshots too. Nice. Seems to me that it'd be nicer to have an interface to raw flash (no wear leveling, direct access to erasure, read, write, read-modify-write [as an optimization]). Then the filesystem could do a much better job of using flash efficiently. But a raw interface wouldn't be a disk-compatible interface. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Jul 9, 2011 1:56 PM, "Edward Ned Harvey" < opensolarisisdeadlongliveopensola...@nedharvey.com> wrote: > > Given the abysmal performance, I have to assume there is a significant > number of "overhead" reads or writes in order to maintain the DDT for each > "actual" block write operation. Something I didn't mention in the other > email is that I also tracked iostat throughout the whole operation. It's > all writes (or at least 99.9% writes.) So I am forced to conclude it's a > bunch of small DDT maintenance writes taking place and incurring access time > penalties in addition to each intended single block access time penalty. > > The nature of the DDT is that it's a bunch of small blocks, that tend to be > scattered randomly, and require maintenance in order to do anything else. > This sounds like precisely the usage pattern that benefits from low latency > devices such as SSD's. The DDT should be written to in COW fashion, and asynchronously, so there should be no access time penalty. Or so ISTM it should be. Dedup is necessarily slower for writing because of the deduplication table lookups. Those are synchronous lookups, but for async writes you'd think that total write throughput would only be affected by a) the additional read load (which is zero in your case) and b) any inability to put together large transactions due to the high latency of each logical write, but (b) shouldn't happen, particularly if the DDT fits in RAM or L2ARC, as it does in your case. So, at first glance my guess is ZFS is leaving dedup write performance on the table most likely due to implementation reasons, not design reasons. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On Jun 27, 2011 4:15 PM, "David Magda" wrote: > The (Ultra)SPARC T-series processors do, but to a certain extent it goes > against a CPU manufacturers best (financial) interest to provide this: > crypto is very CPU intensive using 'regular' instructions, so if you need > to do a lot of it, it would force you to purchase a manufacturer's > top-of-the-line CPUs, and to have as many sockets as you can to handle a > load (and presumably you need to do "useful" work besides just > en/decrypting traffic). I hope no CPU vendor thinks about the economics of chip making that way. I actually doubt any do. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On Jun 27, 2011 9:24 PM, "David Magda" wrote: > AESNI is certain better than nothing, but RSA, SHA, and the RNG would be nice as well. It'd also be handy for ZFS crypto in addition to all the network IO stuff. The most important reason for AES-NI might be not performance but defeating side-channel attacks. Also, really fast AES HW makes AES-based hash functions quite tempting. No, AES-NI is nothing to sneeze at. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
IMO a faster processor with built-in AES and other crypto support is most likely to give you the most bang for your buck, particularly if you're using closed Solaris 11, as Solaris engineering is likely to add support for new crypto instructions faster than Illumos (but I don't really know enough about Illumos to say for sure). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
That said, losing committed transactions when you needed and thought you had ACID semantics... is bad. But that's implied in any restore-from-backups situation. So you replicate/distribute transactions so that restore from backups (or snapshots) is an absolutely last resort matter, and if you ever have to restore from backups you also spend time manually tracking down (from counterparties, "paper" trails kept elsewhere, ...) any missing transactions. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On Thu, Jun 16, 2011 at 8:51 AM, wrote: > If a database engine or another application keeps both the data and the > log in the same filesystem, a snapshot wouldn't create inconsistent data > (I think this would be true with vim and a large number of database > engines; vim will detect the swap file and datbase should be able to > detect the inconsistency and rollback and re-apply the log file.) Correct. SQLite3 will be able to recover automatically from restores of mid-transaction snapshots. VIM does not recover automatically, but it does notice the swap file and warns the user and gives them a way to handle the problem. (When you save a file, VIM renames the old one out of the way, creates a new file with the original name, writes the new contents to it, closes it, then unlinks the swap file. On recovery VIM notices the swap file and gives the user a menu of choices.) I believe this is the best solution: write applications so they can recover from being restarted with data restored from a mid-transaction snapshot. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Versioning FS was: question about COW and snapshots
As Casper pointed out, the right thing to do is to build applications such that they can detect mid-transaction state and roll it back (or forward, if there's enough data). Then mid-transaction snapshots are fine, and the lack of APIs by which to inform the filesystem of application transaction boundaries becomes much less of an issue (adding such APIs is not a good solution, since it'd take many years for apps to take advantage of them and more years still for legacy apps to be upgraded or decomissioned). The existing FS interfaces provide enough that one can build applications this way. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
And, without a sub-shell: find . -type f \! -links 1 | xargs stat -c " %b %B *+p" /dev/null | dc 2>/dev/null | tail -1 (The stderr redirection is because otherwise dc whines once that the stack is empty, and the tail is because we print interim totals as we go.) Also, this doesn't quit work, since it counts every link, when we want to count all but one links. This, then, is what will tell you how much space you saved due to hardlinks: find . -type f \! -links 1 | xargs stat -c " 8k %b %B * %h 1 - * %h /+p" /dev/null 2>/dev/null | dc Excuse my earlier brainfarts :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams wrote: > Try this instead: > > (echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | > dc s/\$p// ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk wrote: >> If anyone has any ideas be it ZFS based or any useful scripts that >> could help here, I am all ears. > > Something like this one-liner will show what would be allocated by everything > if hardlinks weren't used: > > # size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do > size=$(( $size + $i )); done; echo $size Oh, you don't want to do that: you'll run into max argument list size issues. Try this instead: (echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | dc ;) xargs is your friend (and so is dc... RPN FTW!). Note that I'm not printing the number of links because find will print a name for every link (well, if you do the find from the root of the relevant filesystem), so we'd be counting too much space. You'll need the GNU stat(1). Or you could do something like this using the ksh stat builtin: ( echo 0 find . -type f \! -links 1 | while read p; do xargs stat -c " %b %B *+" $p done echo p ) | dc Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson wrote: > I have an interesting question that may or may not be answerable from some > internal > ZFS semantics. This is really standard Unix filesystem semantics. > [...] > > So total storage used is around ~7.5MB due to the hard linking taking place > on each store. > > If hard linking capability had been turned off, this same message would have > used 1500 x 2MB =3GB > worth of storage. > > My question is there any simple ways of determining the space savings on > each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On May 25, 2011 7:15 AM, "Garrett D'Amore" wrote: > > You are welcome to your beliefs. There are many groups that do standards that do not meet in public. [...] True. > [...] In fact, I can't think of any standards bodies that *do* hold open meetings. I can: the IETF, for example. All business of the IETF is transacted or confirmed on open participation mailing lists, and IETF meetings are known as NOTE WELL meetings because of the notice given at their opening regarding the fact that meeting is public and resulting considerations regarding, e.g., trade secrets. Mind you, there are many more standards setting organizations that don't have open participation, such as OASIS, ISO, and so on. I don't begrudge you starting closed, our even staying closed, though I would prefer that at least the output of any ZFS standards org be open. Also, I would recommend that you eventually consider creating a new open participation list for non-members (separate from any members-only list). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 1:52 PM, Nico Williams wrote: > [...] Or perhaps you'll argue that no one should ever need bi-di > replication, that if one finds oneself wanting that then one has taken > a wrong turn somewhere. You could also grant the premise and argue instead that nothing the filesystem can do to speed up remote bi-di sync is worth the cost -- an argument that requires a lot more analysis. For example, if the filesystem were to compute and store rsync rolling CRC signatures, well, that would require significant compute and storage resources, and it might not speed up synchronization enough to ever be worthwhile. Similarly, a Merkle hash tree based on rolling hash functions (and excluding physical block pointer details) might require each hash output to grow linearly with block size in order to retain the rolling hash property (I'm not sure about this; I know very little about rolling hash functions), in which case the added complexity would be a complete non-starter. Whereas a Merkle hash tree built with regular hash functions would not be able to resolve insertions/deletions of data chunks of size that is not a whole multiple of block size. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 10:20 AM, Richard Elling wrote: > ZFS already tracks the blocks that have been written, and the time that > they were written. So we already know when something was writtem, though > that does not answer the question of whether the data was changed. I think > it is a pretty good bet that newly written data is different :-) Not really. There's bp rewrite (assuming that ever ships, or gets implemented elsewhere), for example. >> Then, the filesystem should make this Merkle Tree available to >> applications through a simple query. > > Something like "zfs diff" ? That works within a filesystem. And zfs send/recv works when you have one filesystem faithfully tracking another. When you have two filesystems with similar contents, and the history of each is useless in deciding how to do a bi-directional synchronization, then you need a way to diff files that is not based on intra-filesystem history. The rsync algorithm is the best high-performance algorithm that we have for determining differences between files separated by a network. My proposal (back then, and Zooko's now) is to leverage work that the filesystem does anyways to implement a high-performance remote diff that is faster than rsync for the simple reason that some of the rsync algorithm essentially gets pre-computed. >> This would enable applications—without needing any further >> in-filesystem code—to perform a Merkle Tree sync, which would range >> from "noticeably more efficient" to "dramatically more efficient" than >> rsync or zfs send. :-) > > Since ZFS send already has an option to only send the changed blocks, > I disagree with your assertion that your solution will be "dramatically > more efficient" than zsf send. We already know zfs send is much more > efficient than rsync for large file systems. You missed Zooko's point completely. It might help to know that Zooko works on a project called Tahoe Least-Authority Filesystem, which is by nature distributed. Once you lose the constraints of not having a network or having uni-directional replication only, I think you'll get it. Or perhaps you'll argue that no one should ever need bi-di replication, that if one finds oneself wanting that then one has taken a wrong turn somewhere. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
Then again, Windows apps may be doing seek+write to pre-allocate storage. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
On Mon, May 2, 2011 at 3:56 PM, Eric D. Mudama wrote: > Yea, kept googling and it makes sense. I guess I am simply surprised > that the application would have done the seek+write combination, since > on NTFS (which doesn't support sparse) these would have been real > 1.5GB files, and there would be hundreds or thousands of them in > normal usage. It could have been smbd compressing long runs of zeros. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
Also, sparseness need not be apparent to applications. Until recent improvements to lseek(2) to expose hole/non-hole offsets, the only way to know about sparseness was to notice that a file's reported size is more than the file's reported filesystem blocks times the block size. Sparse files in Unix go back at least to the early 80s. If a filesystem protocol, such as CIFS (I've no idea if it supports sparse files), were to not support sparse files, all that would mean is that the server must report a number of blocks that matches a file's size (assuming the protocol in question even supports any notion of reporting a file's size in blocks). There's really two ways in which a filesystem protocol could support sparse files: a) by reporting file size in bytes and blocks, b) by reporting lists of file offsets demarcating holes from non-holes. (b) is a very new idea; Lustre may be the only filesystem that I know that supports this (see the Linux FIEMAP APIs)., though work is in progress to add this to NFSv4. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] disable zfs/zpool destroy for root user
On Thu, Feb 17, 2011 at 3:07 PM, Richard Elling wrote: > On Feb 17, 2011, at 12:44 PM, Stefan Dormayer wrote: > >> Hi all, >> >> is there a way to disable the subcommand destroy of zpool/zfs for the root >> user? > > Which OS? Heheh. Great answer. The real answer depends also on what the OP meant by "root". "root" in Solaris isn't the all-powerful thing it used to be, or, rather, it is, but its power can be limited. And not just on Solaris either. The OP's question is difficult to answer because the question isn't the one the OP really wants to ask -- we must tease out that real question, or guess. I'd start with: just what is it that you want to accomplish? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 14, 2011 6:56 AM, "Paul Kraus" wrote: > P.S. I am measuring number of objects via `zdb -d` as that is faster > than trying to count files and directories and I expect is a much > better measure of what the underlying zfs code is dealing with (a > particular dataset may have lots of snapshot data that does not > (easily) show up). It's faster because; a) no atime updates, b) no ZPL overhead. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang wrote: > On Mon, Feb 7, 2011 at 1:51 PM, Brandon High wrote: > Maybe I didn't make my intention clear. UFS with directio is > reasonably close to a raw disk from my application's perspective: when > the app writes to a file location, no buffering happens. My goal is to > find a way to duplicate this on ZFS. You're still mixing directio and O_DSYNC. O_DSYNC is like calling fsync(2) after every write(2). fsync(2) is useful to obtain some limited transactional semantics, as well as for durability semantics. In ZFS you don't need to call fsync(2) to get those transactional semantics, but you do need to call fsync(2) get those durability semantics. Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly more than just the data blocks you wrote to. Which means that O_DSYNC on ZFS is significantly slower than on UFS. You can address this in one of two ways: a) you might realize that you don't need every write(2) to be durable, then stop using O_DSYNC, b) you might get a fast ZIL device. I'm betting that if you look carefully at your application's requirements you'll probably conclude that you don't need O_DSYNC at all. Perhaps you can tell us more about your application. > Setting primarycache didn't eliminate the buffering, using O_DSYNC > (whose side effects include elimination of buffering) made it > ridiculously slow: none of the things I tried eliminated buffering, > and just buffering, on ZFS. > > From the discussion so far my feeling is that ZFS is too different > from UFS that there's simply no way to achieve this goal... You've not really stated your application's requirements. You may be convinced that you need O_DSYNC, but chances are that you don't. And yes, it's possible that you'd need O_DSYNC on UFS but not on ZFS. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export+import doesn't maintain snapshot
On Wednesday 14 January 2009 16:49:48 cindy.swearin...@sun.com wrote: > Nico, > > If you want to enable snapshot display as in previous releases, > then set this parameter on the pool: > > # zpool set listsnapshots=on pool-name > > Cind > thanks, it works as I need. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export+import doesn't maintain snapshot
On Wednesday 14 January 2009 11:44:56 Peter Tribble wrote: > On Wed, Jan 14, 2009 at 10:11 AM, Nico Sabbi wrote: > > Hi, > > I wanted to migrate a virtual disk from a S10U6 to OpenSolaris > > 2008.11. > > In the first machine I rebooted to single-user and ran > > $ zpool export disco > > > > then copied the disk files to the target VM, rebooted as > > single-user and ran > > $ zpool import disco > > > > The disc was mounted, but none of the hundreds of snapshots was > > there. > > > > Did Imiss something? > > How do you know the snapshots are gone? > > Note that the zfs list command no longer shows snapshots by > default. You need 'zfs list -t all' for that. now I see them, but why this change? what do I have to do to list them by default as on the old server? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool export+import doesn't maintain snapshot
Hi, I wanted to migrate a virtual disk from a S10U6 to OpenSolaris 2008.11. In the first machine I rebooted to single-user and ran $ zpool export disco then copied the disk files to the target VM, rebooted as single-user and ran $ zpool import disco The disc was mounted, but none of the hundreds of snapshots was there. Did Imiss something? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS boot and data on same disk - is this supported?
On Friday 19 December 2008 03:32:01 Ian Collins wrote: > On Fri 19/12/08 14:52 , Shawn Joy shawn@sun.com sent: > > I have read the ZFS best practice guide located at > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices > >_Guide However I have questions whether we support using slices > > for data on the same disk as we use for ZFS boot. > > Why would you want to do this instead of giving ZFS the whole disk? > Do you have compelling reasons to use UFS rather than ZFS > filesystems for data? I find ZFS's eager to monopolize the disk quite irritating: sometimes there OSs on the same disk. BTW, how much does ZFS slow down (on average) when using slices instead of the whole disk? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some basic questions about getting the best performance for database usage
On Monday 30 June 2008 11:14:10 James C. McPherson wrote: > Christiaan Willemsen wrote: > ... > > > And that is exactly where ZFS comes in, at least as far as I > > read. > > > > The question is: how can we maximize IO by using the best > > possible combination of hardware and ZFS RAID? > > ... > > > For what I read, mirroring and striping should get me better > > performance than raidz of RAID5. But I guess you might give me > > some pointer on how to distribute the disk. My biggest question > > is what I should leave to the HW raid, and what to ZFS? > > Hi Christiaan, > If you haven't found it already, I highly recommend going > through the information at these three urls: > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Gu >ide > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_G >uide > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guid >e > > > I'll defer to Richard Elling and Roch Bourbonnais for specific > suggestions based on your email - as far as I'm concerned they're > the experts when it comes to ZFS tuning and database performance. > > > James C. McPherson > -- I want to save you some time and sufference: I had to add set zfs:zfs_nocacheflush = 1 to /etc/system and reboot to cure the horrible slowness I experienced with all of Myisam engines on ZFS, especially Innodb. I had never seen a DB going so slow until that moment ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] memory hog
On Monday 23 June 2008 09:39:13 Kaiwai Gardiner wrote: > Erik Trimble wrote: > > Edward wrote: > >> So does that mean ZFS is not for consumer computer? > >> If ZFS require 4GB of Ram for operation, that means i will need > >> 8GB+ Ram if i were to use Photoshop or any other memory > >> intensive application? > > > > No. It works fine on desktops - I'm writing this on an older > > Athlon64 with 1GB. Memory pressure does seem to become a bit > > more of an issue when I'm doing more I/O on the box (which, I'm > > assuming, is due to the various caches), so for things like > > compiling, I feel a little cramped. > > > > Personally, (in my experience only), I'd say that ZFS works well > > for use on the desktop, ASSUMING you dedicate 1GB of RAM to > > solely the OS (and ZFS). For very heavy I/O work, I think at > > least 2GB is a better idea. > > > > So, size your total memory accordingly. > > I've got a Dell Dimension 8400 w/ 2.5gb ram and p4 3.2Ghz > processor; I haven't noticed any slow downs either. Memory is so > cheap, adding an extra 2gb is only around NZ$100 these days anyway. > > Matthew this is the kind of reasoning that hides problems rather than correcting them. Sooner or later problems will show up in other - maybe worse - forms ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss