Re: [zfs-discuss] ZFS: unreliable for professional usage?
Bob Friesenhahn wrote: > On Fri, 13 Feb 2009, Ross wrote: >> >> Something like that will have people praising ZFS' ability to >> safeguard their data, and the way it recovers even after system >> crashes or when hardware has gone wrong. You could even have a >> "common causes of this are..." message, or a link to an online help >> article if you wanted people to be really impressed. > > I see a career in politics for you. Barring an operating system > implementation bug, the type of problem you are talking about is due to > improperly working hardware. Irreversibly reverting to a previous > checkpoint may or may not obtain the correct data. Perhaps it will > produce a bunch of checksum errors. Actually that's a lot like FMA replies when it sees a problem, telling the person what happened and pointing them to a web page which can be updated with the newest information on the problem. That's a good spot for "This pool was not unmounted cleanly due to a hardware fault and data has been lost. The "" line contains the date which can be recovered to. Use the command # zfs reframbulocate -t to revert to --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest dav...@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does your device honor write barriers?
Peter Schuller wrote: > It would actually be nice in general I think, not just for ZFS, to > have some standard "run this tool" that will give you a check list of > successes/failures that specifically target storage > correctness. Though correctness cannot be proven, you can at least > test for common cases of systematic incorrect behavior. A tiny niggle: for an operation set of moderate size, you can generate an exhaustive set of tests. I've done so for APIs, but unless you have infinite spare time, you want to generate the test set with a tool (;-)) --dave (who hasn't even Copious Spare Time, much less Infinite) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest dav...@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS core contributor nominations
+1 utterly! Mark Shellenbaum wrote: > Neelakanth Nadgir wrote: >> +1. >> >> I would like to nominate roch.bourbonn...@sun.com for his work on >> improving the performance of ZFS over the last few years. >> >> thanks, >> -neel >> >> > +1 on Roch being a core contributor. > > >> On Feb 2, 2009, at 4:02 PM, Neil Perrin wrote: >> >>> Looks reasonable >>> +1 >>> >>> Neil. >>> >>> On 02/02/09 08:55, Mark Shellenbaum wrote: >>>> The time has come to review the current Contributor and Core >>>> contributor >>>> grants for ZFS. Since all of the ZFS core contributors grants are >>>> set >>>> to expire on 02-24-2009 we need to renew the members that are still >>>> contributing at core contributor levels. We should also add some >>>> new >>>> members to both Contributor and Core contributor levels. >>>> >>>> First the current list of Core contributors: >>>> >>>> Bill Moore (billm) >>>> Cindy Swearingen (cindys) >>>> Lori M. Alt (lalt) >>>> Mark Shellenbaum (marks) >>>> Mark Maybee (maybee) >>>> Matthew A. Ahrens (ahrens) >>>> Neil V. Perrin (perrin) >>>> Jeff Bonwick (bonwick) >>>> Eric Schrock (eschrock) >>>> Noel Dellofano (ndellofa) >>>> Eric Kustarz (goo)* >>>> Georgina A. Chua (chua)* >>>> Tabriz Holtz (tabriz)* >>>> Krister Johansen (johansen)* >>>> >>>> All of these should be renewed at Core contributor level, except for >>>> those with a "*". Those with a "*" are no longer involved with ZFS >>>> and >>>> we should let their grants expire. >>>> >>>> I am nominating the following to be new Core Contributors of ZFS: >>>> >>>> Jonathan W. Adams (jwadams) >>>> Chris Kirby >>>> Lin Ling >>>> Eric C. Taylor (taylor) >>>> Mark Musante >>>> Rich Morris >>>> George Wilson >>>> Tim Haley >>>> Brendan Gregg >>>> Adam Leventhal >>>> Pawel Jakub Dawidek >>>> Ricardo Correia >>>> >>>> For Contributor I am nominating the following: >>>> Darren Moffat >>>> Richard Elling >>>> >>>> I am voting +1 for all of these (including myself) >>>> >>>> Feel free to nominate others for Contributor or Core Contributor. >>>> >>>> >>>>-Mark >>>> >>>> >>>> >>>> ___ >>>> zfs-discuss mailing list >>>> zfs-discuss@opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> ___ >>> zfs-discuss mailing list >>> zfs-discuss@opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest dav...@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is st_size of a zfs directory equal to the
"Richard L. Hamilton" wrote: >> I did find the earlier discussion on the subject (someone e-mailed me that >> there had been >> such). It seemed to conclude that some apps are statically linked with old >> scandir() code >> that (incorrectly) assumed that the number of directory entries could be >> estimated as >> st_size/24; and worse, that some such apps might be seeing the small st_size >> that zfs >> offers via NFS, so they might not even be something that could be fixed on >> Solaris at all. >> But I didn't see anything in the discussion that suggested that this was >> going to be changed. >> Nor did I see a compelling argument for leaving it the way it is, either. >> In the face of >> "undefined", all arguments end up as pragmatism rather than principle, IMO. > Joerg Schilling wrote: > This is a problem I had to fix for some customers in 1992 when people started > to use NFS > servers based on the Novell OS. > Jörg > Oh bother, I should have noticed this back in 1999/2001 (;-)) Joking aside, we were looking at the Solaris ABI (application Binary interface) and working on ensuring binary stability. The size of a directory entry was supposed to be undefined and in principle *variable*, but Novell et all seem to have assumed that the size they used was guaranteed to be the same for all time. And no machine needs more than 640 KB of memory, either... Ah well, at least the ZFS folks found it for us, so I can add it to my database of porting problems. What OSs did you folks find it on? --dave (an external consultant, these days) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest dav...@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)
Marcelo Leal <[EMAIL PROTECTED]> wrote: > Hello all, > I think he got some point here... maybe that would be an interesting > feature for that kind of workload. Caching all the metadata, would make t > the rsync task more fast (for many files). Try to cache the data is really > waste of time, because the data will not be read again, and will just send > away the "good" metadata cached. That is what i understand when he said > about the 96k being descarded soon. He wants to "configure" an area to > "copy the data", and that´s it. Leave my metadata cache alone. ;-) That's a common enough behavior pattern that Per Brinch Hansen defined a distinct filetype for it in, if memory serves, the RC 4000. As soon as it's read, it's gone. We saw this behavior on NFS servers in the Markham ACE lab, and absolutely with Samba almost everywhere. My Smarter Colleagues[tm] explained it as a normal pattern whenever you have front-end caching, as backend caching is then rendered far less effective, and sometimes directly disadvantageous. It sounded like, from the previous discussion, one could tune for it with the level 1 and 2 caches, although if I understood it properly, the particular machine also had to narrow a stripe for the particular load being discussed... --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sidebar re ABI stability (was Segmentation fault / core dump)
[EMAIL PROTECTED] wrote > Linux does not implement stable kernel interfaces. It may be that there is > an intention to do so but I've seen problems on Linux resulting from > self-incompatibility on a regular base. To be precise, Linus tries hard to prevent ABI changes in the system call interfaces exported from the kernel, but the glibc team had defeated him in the past. For example, they accidentally started returning ENOTSUP from getgid when one had a library version mis- match (!). Sun stabilizes both library and system call interfaces: I used to work on that with David J. Brown's team, back when I was an employee. --dave (who's a contractor) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
> > But I think the bigger problem is that unless you can solve for the general > case, you *will* get nailed. I might even argue that we need a way for > storage devices to notify hosts of their characteristics, which would > require > protocol adoption and would take years to implement. Fortunately, the critical metric, latency, is easy to measure. Noisy! Indeed, very noisy, but easy for specific cases, as noted above. The general case you describe below is indeed harder. I suspect we may need to statically annotate certain devices with critical behavior information... > Consider two scenarios: > > Case 1. Fully redundant storage array with active/active controllers. >A failed controller should cause the system to recover on the >surviving controller. I have some lab test data for this sort of > thing >and some popular arrays can take on the order of a minute to >complete the failure detection and reconfiguration. You don't >want to degrade the vdev when this happens, you just want to >wait until the array is again ready for use (this works ok today.) >I would further argue that no "disk failure prediction" code would >be useful for this case. > > Case 2. Power on test. I had a bruise (no scar :-) once from an >integrated product we were designing > http://docs.sun.com/app/docs/coll/cluster280-3 >which had a server (or two) and raid array (or two). If you build >such a system from scratch, then it will fail a power-on test. > If you >power on the rack containing these systems, then the time required >for the RAID array to boot was longer than the time required for >the server to boot *and* timeout probes of the array. The result >was that the volume manager will declare the disks bad and >system administration intervention is required to regain access to > the data in the array. Since this was an integrated product, we >solved it by inducing a delay loop in the server boot cycle to >slow down the server. Was it the best possible solution? No, but >it was the only solution which met our other design constraints. > > In both of these cases, the solutions imply multi-minute timeouts are > required to maintain a stable system. For 101-level insight to this sort > of problem see the Sun BluePrint article (an oldie, but goodie): > http://www.sun.com/blueprints/1101/clstrcomplex.pdf > --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sidebar to ZFS Availability discussion
Re Availability: ZFS needs to handle disk removal / driver failure better >> A better option would be to not use this to perform FMA diagnosis, but >> instead work into the mirror child selection code. This has already >> been alluded to before, but it would be cool to keep track of latency >> over time, and use this to both a) prefer one drive over another when >> selecting the child and b) proactively timeout/ignore results from one >> child and select the other if it's taking longer than some historical >> standard deviation. This keeps away from diagnosing drives as faulty, >> but does allow ZFS to make better choices and maintain response times. >> It shouldn't be hard to keep track of the average and/or standard >> deviation and use it for selection; proactively timing out the slow I/Os >> is much trickier. Interestingly, tracking latency has come under discussion in the Linux world, too, as they start to deal with developing resource management for disks as well as CPU. In fact, there are two cases where you can use a feedback loop to adjust disk behavior, and a third to detect problems. The first loop is the one you identified, for dealing with near/far and fast/slow mirrors. The second is for resource management, where one throttles disk-hog projects when one discovers latency growing without bound on disk saturation, and the third is in case of a fault other than the above. For the latter to work well, I'd like to see the resource management and fast/slow mirror adaptation be something one turns on explicitly, because then when FMA discovered that you in fact have a fast/slow mirror or a Dr. Evil program saturating the array, the "fix" could be to notify the sysadmin that they had a problem and suggesting built-in tools to ameliorate it. Ian Collins writes: > One solution (again, to be used with a remote mirror) is the three way > mirror. If two devices are local and one remote, data is safe once the > two local writes return. I guess the issue then changes from "is my > data safe" to "how safe is my data". I would be reluctant to deploy a > remote mirror device without local redundancy, so this probably won't be > an uncommon setup. There would have to be an acceptable window of risk > when local data isn't replicated. And in this case too, I'd prefer the sysadmin provide the information to ZFS about what she wants, and have the system adapt to it, and report how big the risk window is. This would effectively change the FMA behavior, you understand, so as to have it report failures to complete the local writes in time t0 and remote in time t1, much as the resource management or fast/slow cases would need to be visible to FMA. --dave (at home) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
>> >>As others have said, "zfs status" should not hang. ZFS has to know >>the state of all the drives and pools it's currently using, "zfs >>status" should simply report the current known status from ZFS' >>internal state. It shouldn't need to scan anything. ZFS' internal >>state should also be checking with cfgadm so that it knows if a disk >>isn't there. It should also be updated if the cache can't be flushed >>to disk, and "zfs list / zpool list" needs to borrow state information >>from the status commands so that they don't say 'online' when the pool >>has problems. >> >>ZFS needs to deal more intelligently with mount points when a pool has >>problems. Leaving the folder lying around in a way that prevents the >>pool mounting properly when the drives are recovered is not good. >>When the pool appears to come back online without errors, it would be >>very easy for somebody to assume the data was lost from the pool >>without realising that it simply hasn't mounted and they're actually >>looking at an empty folder. Firstly ZFS should be removing the mount >>point when problems occur, and secondly, ZFS list or ZFS status should >>include information to inform you that the pool could not be mounted >>properly. >> >>ZFS status really should be warning of any ZFS errors that occur. >>Including things like being unable to mount the pool, CIFS mounts >>failing, etc... >> >>And finally, if ZFS does find problems writing from the cache, it >>really needs to log somewhere the names of all the files affected, and >>the action that could not be carried out. ZFS knows the files it was >>meant to delete here, it also knows the files that were written. I >>can accept that with delayed writes files may occasionally be lost >>when a failure happens, but I don't accept that we need to loose all >>knowledge of the affected files when the filesystem has complete >>knowledge of what is affected. If there are any working filesystems >>on the server, ZFS should make an attempt to store a log of the >>problem, failing that it should e-mail the data out. The admin really >>needs to know what files have been affected so that they can notify >>users of the data loss. I don't know where you would store this >>information, but wherever that is, "zpool status" should be reporting >>the error and directing the admin to the log file. >> >>I would probably say this could be safely stored on the system drive. >>Would it be possible to have a number of possible places to store this >>log? What I'm thinking is that if the system drive is unavailable, >>ZFS could try each pool in turn and attempt to store the log there. >> >>In fact e-mail alerts or external error logging would be a great >>addition to ZFS. Surely it makes sense that filesystem errors would >>be better off being stored and handled externally? >> >>Ross >> >> >> >> >>>Date: Mon, 28 Jul 2008 12:28:34 -0700 >>>From: [EMAIL PROTECTED] >>>Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive >> >>removed >> >>>To: [EMAIL PROTECTED] >>> >>>I'm trying to reproduce and will let you know what I find. >>>-- richard >>> >> >> >> >>Win £3000 to spend on whatever you want at Uni! Click here to WIN! >><http://clk.atdmt.com/UKM/go/101719803/direct/01/> >> >> >>___ >>zfs-discuss mailing list >>zfs-discuss@opensolaris.org >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Peak every 4-5 second
Brandon High wrote: > On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <[EMAIL PROTECTED]> > wrote: > >>And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes??? > > > Or perhaps 4 RAID1 mirrors concatenated? > I wondered that too, but he insists he doesn't have 0+1 or 1+0... Tharindu. could you clarify this for us? It significantly affects what advice we give! --dave (former tech lead, performance engineering at ACE) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Peak every 4-5 second
And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes??? --dave Robert Milkowski wrote: > Hello Tharindu, > > > Thursday, July 24, 2008, 6:02:31 AM, you wrote: > > >> > > > > We do not use raidz*. Virtually, no raid or stripe through OS. > > > We have 4 disk RAID1 volumes. RAID1 was created from CAM on 2540. > > > 2540 does not have RAID 1+0 or 0+1. > > > > > Of course it does 1+0. Just add more drives to RAID-1 > > > > > -- > > Best regards, > > Robert Milkowski mailto:[EMAIL PROTECTED] > >http://milek.blogspot.com > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Peak every 4-5 second
Hmmn, that *sounds* as if you are saying you've a very-high-redundancy RAID1 mirror, 4 disks deep, on an 'enterprise-class tier 2 storage' array that doesn't support RAID 1+0 or 0+1. That sounds weird: the 2540 supports RAID levels 0, 1, (1+0), 3 and 5, and deep mirrors are normally only used on really fast equipment in mission-critical tier 1 storage... Are you sure you don't mean you have raid 0 (stripes) 4 disks wide, each stripe presented as a LUN? If you really have 4-deep RAID 1, you have a configuration that will perform somewhat slower than any single disk, as the array launches 4 writes to 4 drives in parallel, and returns success when they all complete. If you had 4-wide RAID 0, with mirroring done at the host, you would have a configuration that would (probabilistically) perform better than a single drive when writing to each side of the mirror, and the write would return success when the slowest side of the mirror completed. --dave (puzzled!) c-b Tharindu Rukshan Bamunuarachchi wrote: > We do not use raidz*. Virtually, no raid or stripe through OS. > > We have 4 disk RAID1 volumes. RAID1 was created from CAM on 2540. > > 2540 does not have RAID 1+0 or 0+1. > > cheers > tharindu > > Brandon High wrote: > >>On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi >><[EMAIL PROTECTED]> wrote: >> >> >>>Dear Mark/All, >>> >>>Our trading system is writing to local and/or array volume at 10k >>>messages per second. >>>Each message is about 700bytes in size. >>> >>>Before ZFS, we used UFS. >>>Even with UFS, there was evey 5 second peak due to fsflush invocation. >>> >>>However each peak is about ~5ms. >>>Our application can not recover from such higher latency. >>> >>> >> >>Is the pool using raidz, raidz2, or mirroring? How many drives are you using? >> >>-B >> >> >> > > > > *** > > "The information contained in this email including in any attachment is > confidential and is meant to be read only by the person to whom it is > addressed. If you are not the intended recipient(s), you are prohibited from > printing, forwarding, saving or copying this email. If you have received this > e-mail in error, please immediately notify the sender and delete this e-mail > and its attachments from your computer." > > *********** > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: Formatting Problem of ZFS Adm Guide (pdf)
One can carve furniture with an axe, especially if it's razor-sharp, but that doesn't make it a spokeshave, plane and saw. I love star office, and use it every day, but my publisher uses Frame, so that's what I use for books. --dave W. Wayne Liauh wrote: >>I doubt so. Star/OpenOffice are word processors... >>and like Word they are not suitable for typesetting >>documents. >> >>SGML, FrameMaker & TeX/LateX are the only ones >>capable of doing that. > > > This was pretty much true about a year ago. However, after version 2.3, > which adds the kerning feature, OpenOffice.org can produce very > professionally looking documents. > > All of the OOo User Guides, which are every bit as complex as if not more so > than our own user guides, are now "self-generated". Solveig Haugland, a > highly respected OpenOffice.org consultant, published her book > "OpenOffice.org 2 Guidebook" (a 527-page book complete with drawings, table > of contents, multi-column index, etc.) entirely on OOo. > > Another key consideration, in addition to perhaps a desire to support our > sister product, is that the documents so generated are guaranteed to be > displayable on the OS they are intended to serve. This is a pretty important > consideration IMO. :-) > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Hmmn, you might want to look at Andrew Tridgell's' thesis (yes, Andrew of Samba fame), as he had to solve this very question to be able to select an algorithm to use inside rsync. --dave Darren J Moffat wrote: > [EMAIL PROTECTED] wrote: > >>[EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM: >> >> >>>>Does anyone know a tool that can look over a dataset and give >>>>duplication statistics? I'm not looking for something incredibly >>>>efficient but I'd like to know how much it would actually benefit our >>> >>>Check out the following blog..: >>> >>>http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool >> >>Just want to add, while this is ok to give you a ballpark dedup number -- >>fletcher2 is notoriously collision prone on real data sets. It is meant to >>be fast at the expense of collisions. This issue can show much more dedup >>possible than really exists on large datasets. > > > Doing this using sha256 as the checksum algorithm would be much more > interesting. I'm going to try that now and see how it compares with > fletcher2 for a small contrived test. > -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some basic questions about getting the best performance for database usage
David Collier-Brown wrote: >> ZFS copy-on-write results in tables' contents being spread across >> the full width of their stripe, which is arguably a good thing >> for transaction processing performance (or at least can be), but >> makes sequential table-scan speed degrade. >> >> If you're doing sequential scans over large amounts of data >> which isn't changing very rapidly, such as older segments, you >> may want to re-sequentialize that data. Richard Elling <[EMAIL PROTECTED]> wrote > There is a general feeling that COW, as used by ZFS, will cause > all sorts of badness for database scans. Alas, there is a dearth of > real-world data on any impacts (I'm anxiously awaiting...) > There are cases where this won't be a problem at all, but it will > depend on how you use the data. I quite agree: at some point, the experts on Oracle, MySQL and PostgreSQL will get a clear understanding of how to get the best performance for random database I/O and ZFS. I'll be interested to see what the behavior is for large, high-performance systems. In the meantime... > In this particular case, it would be cost effective to just buy a > bunch of RAM and not worry too much about disk I/O during > scans. In the future, if you significantly outgrow the RAM, then > there might be a case for a ZFS (L2ARC) cache LUN to smooth > out the bumps. You can probably defer that call until later. ... it's a Really Nice Thing that large memories only cost small dollars (;-)) --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Some basic questions about getting the best performance for database usage
This is a bit of a sidebar to the discussion about getting the best performance for PostgreSQL from ZFS, but may affect you if you're doing sequential scans through the 70GB table or its segments. ZFS copy-on-write results in tables' contents being spread across the full width of their stripe, which is arguably a good thing for transaction processing performance (or at least can be), but makes sequential table-scan speed degrade. If you're doing sequential scans over large amounts of data which isn't changing very rapidly, such as older segments, you may want to re-sequentialize that data. I was talking to one of the Slony developers back whern this first came up, and he suggested a process to do this in PostgreSQL. He suggested doing a "cluster" operation, relative to a specific index, then dropping and recreating the index. This results in the relation being rewritten in the order the index is sorted by, which should defragment/linearize it. The dropping and recreating the index rewrites it sequentially too. Neither he nor I know the cost if the relation has more than one index: we speculate they should be dropped before the clustering and recreated last. --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?
Chris Siebenmann wrote: | Speaking as one of those pesky university people (although we don't use | quotas): one of the reasons this happens is that servers are a lot less | expensive than disk space. With disk space you have to factor in the | cost of backups and ongoing maintenance, wheras another server is just N | thousand dollars in one time costs and some rack space. This is also common in organizations where IT is a cost center, including some *very* large ones I've encountered in the past and several which are just, well, conservative. --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Chris Siebenmann <[EMAIL PROTECTED]> wrote: | Speaking as a sysadmin (and a Sun customer), why on earth would I have | to provision 8 GB+ of RAM on my NFS fileservers? I would much rather | have that memory in the NFS client machines, where it can actually be | put to work by user programs. | | (If I have decently provisioned NFS client machines, I don't expect much | from the NFS fileserver's cache. Given that the clients have caches too, | I believe that the server's cache will mostly be hit for things that the | clients cannot cache because of NFS semantics, like NFS GETATTR requests | for revalidation and the like.) That's certainly true for the NFS part of the NFS fileserver, but to get the ZFS feature-set, you trade off cycles and memory. If we investigate this a bit, we should be able to figure out a rule of thumb for how little memory we need for an NFS->home-directories workload without cutting into performance. --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Darren J Moffat <[EMAIL PROTECTED]> wrote: > Chris Siebenmann wrote: >>| Still, I'm curious -- why lots of pools? Administration would be >>| simpler with a single pool containing many filesystems. >> >> The short answer is that it is politically and administratively easier >>to use (at least) one pool per storage-buying group in our environment. > > > I think the root cause of the issue is that multiple groups are buying > physical rather than virtual storage yet it is all being attached to a > single system. I will likely be a huge up hill battle but: if all the > physical storage could be purchased by one group and a combination of > ZFS reservations and quotas used on "top level" (eg one level down from > the pool) datasets to allocate the virtual storage, and appropriate > amounts charged to the groups, you could technical be able to use ZFS > how it was intended with much fewer (hopefully 1 or 2) pools. The scenario Chris describes is one I see repeatedly at customers buying SAN storage (as late as last month!) and is considered a best practice on the business side. We may want to make this issue and it's management visible, as people moving from SAN to ZFS are likely to trip over it. In particular, I'd like to see a blueprint or at least a wiki discussion by someone from the SAN world on how to map those kinds of purchases to ZFS pools, how few one wants to have, what happens when it goes wrong, and how to mitigate it (;-)) --dave ps: as always, having asked for something, I'm also volunteering to help provide it: I'm not a storage or ZFS guy, but I am an author, and will happily help my Smarter Colleagues[tm] to write it up. -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?
We've discussed this in considerable detail, but the original question remains unanswered: if an organization *must* use multiple pools, is there an upper bound to avoid or a rate of degradation to be considered? --dave -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?
Chris Siebenmann wrote: > Every university department has to face the issue of how to allocate > disk space to people. Here, we handle storage allocation decisions > through the relatively simple method of selling fixed-size chunks of > storage to faculty (either single professors or groups of them) for a > small one-time fee. I would expect all sorts of organizations would want to allocate space to their customers as a concatenation of "reasonable sized" chunks, where the definition of reasonable would vary in size depending on the business. I was recently in a similar discussion about how best to do this allocation on a 9990v, so I expect it's not peculiar to the UofT (:-)) --dave (about 6 miles north of Chris) c-b -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583, bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss