Re: [zfs-discuss] Long resilver time
On Sep 26, 2010, at 1:16 PM, Roy Sigurd Karlsbakk wrote: >>> Upgrading is definitely an option. What is the current snv favorite >>> for ZFS stability? I apologize, with all the Oracle/Sun changes I >>> haven't been paying as close attention to big reports on zfs-discuss >>> as I used to. >> >> OpenIndiana b147 is the latest binary release, but it also includes >> the fix for >> CR6494473, ZFS needs a way to slow down resilvering >> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 >> http://www.openindiana.org > > Are you sure upgrading to OI is safe at this point? 134 is stable unless you > start fiddling with dedup, and OI is hardly tested. For a production setup, > I'd recommend 134 For a production setup? For production I'd recommend something that is supported, preferably NexentaStor 3 (which is b134 + important ZFS fixes :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
> > Upgrading is definitely an option. What is the current snv favorite > > for ZFS stability? I apologize, with all the Oracle/Sun changes I > > haven't been paying as close attention to big reports on zfs-discuss > > as I used to. > > OpenIndiana b147 is the latest binary release, but it also includes > the fix for > CR6494473, ZFS needs a way to slow down resilvering > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 > http://www.openindiana.org Are you sure upgrading to OI is safe at this point? 134 is stable unless you start fiddling with dedup, and OI is hardly tested. For a production setup, I'd recommend 134 Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
On Sep 26, 2010, at 11:03 AM, Jason J. W. Williams wrote: > Upgrading is definitely an option. What is the current snv favorite for ZFS > stability? I apologize, with all the Oracle/Sun changes I haven't been paying > as close attention to big reports on zfs-discuss as I used to. OpenIndiana b147 is the latest binary release, but it also includes the fix for CR6494473, ZFS needs a way to slow down resilvering http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 http://www.openindiana.org -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
Upgrading is definitely an option. What is the current snv favorite for ZFS stability? I apologize, with all the Oracle/Sun changes I haven't been paying as close attention to big reports on zfs-discuss as I used to. -J Sent via iPhone Is your e-mail Premiere? On Sep 26, 2010, at 10:22, Roy Sigurd Karlsbakk wrote: > > I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x > raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No > checksum errors. > > It seems like an exorbitantly long time. The other 5 disks in the stripe with > the replaced disk were at 90% busy and ~150io/s each during the resilver. > Does this seem unusual to anyone else? Could it be due to heavy fragmentation > or do I have a disk in the stripe going bad? Post-resilver no disk is above > 30% util or noticeably higher than any other disk. > > Thank you in advance. (kernel is snv123) > It surely seems a long time for 27 gigs. Scrub takes its time, but for this > 50TB setup with currently ~29TB used, on WD Green drives (yeah, I know > they're bad, but I didn't know that at the time I installed the box, and they > have worked flawlessly for a year or so), scrub takes a bit of time, but > nothing comparible to what you're reporting > >scrub: scrub completed after 47h57m with 0 errors on Fri Sep 3 16:57:26 > 2010 > > Also, snv123 is quite old, is upgrading to 134 an option? > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > r...@karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det > er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av > idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og > relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
On 9/26/2010 8:06 AM, devsk wrote: On 9/23/2010 at 12:38 PM Erik Trimble wrote: | [snip] |If you don't really care about ultra-low-power, then there's absolutely |no excuse not to buy a USED server-class machine which is 1- or 2- |generations back. They're dirt cheap, readily available, | [snip] = Anyone have a link or two to a place where I can buy some dirt-cheap, readily available last gen servers? I would love some links as well. I have heard a lot about "dirt cheap last gen servers" but nobody ever provides a link. http://www.serversupply.com/products/part_search/pid_lookup.asp?pid=105676 http://www.canvassystems.com/products/c-16-ibm-servers.aspx?_vsrefdom=PPCIBM&gclid=CLTF7NHNpaQCFRpbiAodoSxK5Q&; http://www.glcomp.com/Products/IBM-SystemX-x3500-Server.aspx Lots, and Lots of stuff from eBay - use them to see which companies are in the recycling business, then deal with them directly, rather than through eBay. http://computers.shop.ebay.com/i.html?_nkw=ibm+x3500&_sacat=58058&_sop=12&_dmd=1&_odkw=ibm+x3500&_osacat=0&_trksid=p3286.c0.m270.l1313 Companies specializing in used (often off-lease) business computers: http://compucycle.net/ http://www.andovercg.com/ http://www.recurrent.com/ http://www.lapkosoft.com/ http://www.weirdstuff.com/ http://synergy.ships2day.com/ http://www.useddell.net/ http://www.vibrant.com/ There's hordes more. I've dealt with all of the above, and have no problems recommending them. The thing here is that you need to educate yourself *before* going out and looking. You need to spend a non-trivial amount of time reading the Support pages for Sun, IBM, HP, and Dell, and be able to either *ask* for specific part/model numbers, or be able to interpret what is advertised. The key thing here is that many places will advertised/sell you some server, and all the info they have is the model number off the front. If you can understand what this means in terms of hardware, then you can get a bang-up deal. I've bought computers from recycling places that were 25% or less of the value I could get by *immediately* turning around and selling the system somewhere else. All because I could understand the part numbers enough to know what I was getting, and the original seller couldn't (or, in most cases, didn't have the time to bother). In particular, what is usually the best way to get a deal is to look for a machine which has (a) very little info about it in the advertisement, other than model number, and (b) seems to be noticeably higher in price than what you've seen a "stripped" version of that model go for. Some of those will be stripped systems which the seller doesn't understand the going rate, but many are actually better equipped versions which the additional money is more than made up for the significantly better system. Here's an example using the IBM x3500: Model 7977-72y seems to be the most commonly available one right now - and the config it's generally sold in is (2) x dual-core E5140 2.33Ghz Xeons, plus 2GB of RAM. No disk, one power supply. Tends to go for $400-500. I just got a model 7977- F2x, which was advertised as a 2.0Ghz model, with nothing else in the ad except the word "loaded". I paid $750 for it, and got a system with (2) quad-core E5335 Xeons, 8GB of RAM, and 8x73GB SAS drives, plus the redundant power supply. The extra $300 more than covers the cost I would pay for the additional RAM, power supply, and drives, and I get twice the CPU core-count for "free". Be an educated buyer, and the recycled marketplace can be your oyster. I've actually made enough doing "arbitrage" to cover my costs of buying a nice SOHO machine each year. That is, I can buy and sell 10-12 systems per year, and make $2000-3000 in profit for not much effort. I'd estimate you can sustain a 20-30% profit margin by being a smart buyer/seller. At least on a small scale. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
On Sun, 26 Sep 2010, Edward Ned Harvey wrote: 27G on a 6-disk raidz2 means approx 6.75G per disk. Ideally, the disk could write 7G = 56 Gbit in a couple minutes if it were all sequential and no other activity in the system. So you're right to suspect something is suboptimal, but the root cause is inefficient resilvering code in zfs specifically for raidzN. The resilver code spends a *lot* of time seeking, because it's not optimized by disk layout. This may change some day, but not in the near future. Part of the problem is that the zfs designers decided that the filesystems should remain up and usable during a resilver. Without this requirement things would be a lot easier. For example, we could just run some utility and wait many hours (perhaps fewer hours than zfs resilver) before the filesystems are allowed to be usable. Few of us want to return to that scenario. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
- Original Message - I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No checksum errors. It seems like an exorbitantly long time. The other 5 disks in the stripe with the replaced disk were at 90% busy and ~150io/s each during the resilver. Does this seem unusual to anyone else? Could it be due to heavy fragmentation or do I have a disk in the stripe going bad? Post-resilver no disk is above 30% util or noticeably higher than any other disk. Thank you in advance. (kernel is snv123) It surely seems a long time for 27 gigs. Scrub takes its time, but for this 50TB setup with currently ~29TB used, on WD Green drives (yeah, I know they're bad, but I didn't know that at the time I installed the box, and they have worked flawlessly for a year or so), scrub takes a bit of time, but nothing comparible to what you're reporting scrub: scrub completed after 47h57m with 0 errors on Fri Sep 3 16:57:26 2010 Also, snv123 is quite old, is upgrading to 134 an option? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jason J. W. Williams > > I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x > raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No > checksum errors. 27G on a 6-disk raidz2 means approx 6.75G per disk. Ideally, the disk could write 7G = 56 Gbit in a couple minutes if it were all sequential and no other activity in the system. So you're right to suspect something is suboptimal, but the root cause is inefficient resilvering code in zfs specifically for raidzN. The resilver code spends a *lot* of time seeking, because it's not optimized by disk layout. This may change some day, but not in the near future. Mirrors don't suffer the same effect. At least, if they do, it's far less dramatic. For now, all you can do is: (a) factor this into your decision to use mirror versus raidz, and (b) ensure no snapshots, and minimal IO during the resilver, and (c) if you opt for raidz, keep the number of disks in a raidz to a minimum. It is preferable to use 3 vdev's each of 7-disk raidz, instead of using a 21-disk raidz3. Your setup of 3x raidz2 is pretty reasonable, and 4h resilver, although slow, is successful. Which is more than you could say if you had a 21-disk raidz3. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
On Sep 26, 2010, at 4:41 AM, "Edward Ned Harvey" wrote: >> From: Richard Elling [mailto:richard.ell...@gmail.com] >> >> It is relatively easy to find the latest, common snapshot on two file >> systems. >> Once you know the latest, common snapshot, you can send the >> incrementals >> up to the latest. > > I've always relied on the snapshot names matching. Is there a way to find > the latest common snapshot if the names don't match? If the snapshot names don't match, then the snapshots are not of the same data, by definition. The actual comparison is easy: given two (time sorted) lists, find the latest, common entry. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
> > > On 9/23/2010 at 12:38 PM Erik Trimble wrote: > > | [snip] > |If you don't really care about ultra-low-power, then > there's > absolutely > |no excuse not to buy a USED server-class machine > which is 1- or 2- > |generations back. They're dirt cheap, readily > available, > | [snip] > = > > > Anyone have a link or two to a place where I can buy > some dirt-cheap, > readily available last gen servers? I would love some links as well. I have heard a lot about "dirt cheap last gen servers" but nobody ever provides a link. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Long resilver time
I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No checksum errors. It seems like an exorbitantly long time. The other 5 disks in the stripe with the replaced disk were at 90% busy and ~150io/s each during the resilver. Does this seem unusual to anyone else? Could it be due to heavy fragmentation or do I have a disk in the stripe going bad? Post-resilver no disk is above 30% util or noticeably higher than any other disk. Thank you in advance. (kernel is snv123) -J Sent via iPhone Is your e-mail Premiere?___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
Richard L. Hamilton wrote: Typically on most filesystems, the inode number of the root directory of the filesystem is 2, 0 being unused and 1 historically once invisible and used for bad blocks (no longer done, but kept reserved so as not to invalidate assumptions implicit in ufsdump tapes). However, my observation seems to be (at least back at snv_97), the inode number of ZFS filesystem root directories (including at the top level of a spool) is 3, not 2. If there's any POSIX/SUS requirement for the traditional number 2, I haven't found it. So maybe there's no reason founded in official standards for keeping it the same. But there are bound to be programs that make what was with other filesystems a safe assumption. Perhaps a warning is in order, if there isn't already one. Is there some _reason_ why the inode number of filesystem root directories in ZFS is 3 rather than 2? If you look at zfs_create_fs(), you will see the first 3 items created are: Create zap object used for SA attribute registration Create a delete queue. Create root znode. Hence, inode 3. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
"Richard L. Hamilton" wrote: > Typically on most filesystems, the inode number of the root > directory of the filesystem is 2, 0 being unused and 1 historically > once invisible and used for bad blocks (no longer done, but kept > reserved so as not to invalidate assumptions implicit in ufsdump tapes). > > However, my observation seems to be (at least back at snv_97), the > inode number of ZFS filesystem root directories (including at the > top level of a spool) is 3, not 2. This was traditionally the lost+found inode number. > If there's any POSIX/SUS requirement for the traditional number 2, > I haven't found it. So maybe there's no reason founded in official > standards for keeping it the same. But there are bound to be programs > that make what was with other filesystems a safe assumption. POSIX only requires that ino(1) == ino(..) if you have a root directory. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
>Typically on most filesystems, the inode number of the root >directory of the filesystem is 2, 0 being unused and 1 historically >once invisible and used for bad blocks (no longer done, but kept >reserved so as not to invalidate assumptions implicit in ufsdump tapes). > >However, my observation seems to be (at least back at snv_97), the >inode number of ZFS filesystem root directories (including at the >top level of a spool) is 3, not 2. Buggy files may have all types bad assumptions; this problem isn't new: the root filesystem of a zone is typically in a simple directory of a filesystem with ufs. I seem to remember that flexlm wanted that the root was an actual root directory (so you can run only one copy). They didn't realize that faking the hostid is just too simple Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] fs root inode number?
Typically on most filesystems, the inode number of the root directory of the filesystem is 2, 0 being unused and 1 historically once invisible and used for bad blocks (no longer done, but kept reserved so as not to invalidate assumptions implicit in ufsdump tapes). However, my observation seems to be (at least back at snv_97), the inode number of ZFS filesystem root directories (including at the top level of a spool) is 3, not 2. If there's any POSIX/SUS requirement for the traditional number 2, I haven't found it. So maybe there's no reason founded in official standards for keeping it the same. But there are bound to be programs that make what was with other filesystems a safe assumption. Perhaps a warning is in order, if there isn't already one. Is there some _reason_ why the inode number of filesystem root directories in ZFS is 3 rather than 2? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
> From: Richard Elling [mailto:richard.ell...@gmail.com] > > It is relatively easy to find the latest, common snapshot on two file > systems. > Once you know the latest, common snapshot, you can send the > incrementals > up to the latest. I've always relied on the snapshot names matching. Is there a way to find the latest common snapshot if the names don't match? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users
On 25 Sep 2010, at 19:56, Giovanni Tirloni wrote: > We have correctable memory errors on ECC systems on a monthly basis. It's not > if they'll happen but how often. "DRAM Errors in the wild: a large-scale field study" is worth a read if you have time. http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf Alex (@alblue on Twitter) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
>hi all > >I'm using a custom snaopshot scheme which snapshots every hour, day, >week and month, rotating 24h, 7d, 4w and so on. What would be the best >way to zfs send/receive these things? I'm a little confused about how >this works for delta udpates... > >Vennlige hilsener / Best regards The initial backup should look like this: zfs snapshot -r exp...@backup-2010-07-12 zfs send -R exp...@backup-2010-07-12 | zfs receive -F -u -d portable/export (portable is a "portable" pool; the export filesystem needs to exist; I use one zpool to receive different zpools, each in their own directory) A incremental backup: zfs snapshot -r exp...@backup-2010-07-13 zfs send -R -I exp...@backup-2010-07-12 exp...@backup-2010-07-13 | zfs receive -v -u -d -F portable/export You need to make sure you keep the last backup snapshot; when receiving the incremental backup, destroyed filesystems and snapshots are also destroyed in the backup. Typically, I remove some of the snapshot *after* the backup; they are only destroyed during the next backup. I did notice that send/receive gets confused when older snapshots are destroyed by time-slider during the backup. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss