Re: [zfs-discuss] Performance drop during scrub?
> On Sun, 2 May 2010, Dave Pooser wrote: > > > > If my system is going to fail under the stress of a > scrub, it's going to > > fail under the stress of a resilver. From my > perspective, I'm not as scared > > I don't disagree with any of the opinions you stated > except to point > out that resilver will usually hit the (old) hardware > less severely > than scrub. Resilver does not have to access any of > the redundant > copies of data or metadata, unless they are the only > remaining good > copy. > > Bob Adding the perspective that scrub could consume my hard disks life may sound like a really good point why I should avoid scrub on my system as far as possible, and thus avoid experiencing performance issues in the first place, while using scrub. I just don't buy this. Sorry. It's too far-fetched. I'd still prefer if the original issue could be fixed. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
Hi Bob, > > It is necessary to look at all the factors which > might result in data > loss before deciding what the most effective steps > are to minimize > the probability of loss. > > Bob I am under the impression that exactly those were the considerations for both the ZFS designers to implement a scrub function to ZFS and the author of Best Practises to recommend performing this function frequently. I am hearing you are coming to a different conclusion and I would be interested in learning what could possibly be so highly interpretable in this. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
> In my opinion periodic scrubs are most useful for > pools based on > mirrors, or raidz1, and much less useful for pools > based on raidz2 or > raidz3. It is useful to run a scrub at least once on > a well-populated > new pool in order to validate the hardware and OS, > but otherwise, the > scrub is most useful for discovering bit-rot in > singly-redundant > pools. > > Bob Hi, for once, well populated pools are rarely new. Second, Best Practises recommendations on scrubbing intervals are based on disk product line (Enterprise monthly vs. Consumer weekly), not on redundancy level or pool configuration. Obviously, the issue under discussion affects all imaginable configurations, though. It may only vary in the degree. Recommending to not using scrub doesn't even qualify as a workaround, in my regard. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
Hi Eric, > While there may be some possible optimizations, i'm > sure everyone > would love the random performance of mirror vdevs, > combined with the > redundancy of raidz3 and the space of a raidz1. > However, as in all > ystems, there are tradeoffs. I think we all may agree that the topic here is scrub trade-offs, specifically. My question is if manageability of the pool, and that includes periodical scrubs, is a trade-off as well. It would be very bad news, if it were. Maintenance functions should be practicable on any supported configuration, if possible. > You can choose to bias your workloads so that > foreground IO takes > priority over scrub, but then you've got the cases > where people > complain that their scrub takes too long. There may > be knobs for > individuals to use, but I don't think overall there's > a magic answer. The priority balance only works as long as the IO is within ZFS. As soon as the request is in the pipe of the controller/disk, no further bias will occur, as that subsystem is agnostic to ZFS rules. This is where Richards answer, just above if you read this from jive, kicks in. This leads to the pool being basically not operational from a production POV during scrub pass. From that perspective, any scrub pass exceeding a periodically acceptable service window is "too long". In such a situation, a "pause" option for resuming scrub passes upon the next service window might help. The advantage: such an option would be usable on any hardware. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
> Zfs scrub needs to access all written data on all > disks and is usually > disk-seek or disk I/O bound so it is difficult to > keep it from hogging > the disk resources. A pool based on mirror devices > will behave much > more nicely while being scrubbed than one based on > RAIDz2. Experience seconded entirely. I'd like to repeat that I think we need more efficient load balancing functions in order to keep housekeeping payload manageable. Detrimental side effects of scrub should not be a decision point for choosing certain hardware or redundancy concepts in my opinion. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help:Is zfs-fuse's performance is not good
I wonder if this is the right place to ask, as the Filesystem in User Space implementation is a separate project. In Solaris ZFS runs in kernel. FUSE implementations are slow, no doubt. Same goes for other FUSE implementations, such as for NTFS. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?
Don't copy the netiquette issue you are seeing, as I am talking about nothing but an issue in a post on this forum. Why should I contact the OP off record about this? There is no need to read intentions either. I just made clear once more what is obvious from board metadata anyhow. Besides that, if we are having a dispute about netiquette, that highlights the potential substance of the topic more than anything else. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?
> you talking about and to whom were you > responding? My intention was a response to the OP, which I guess from what I am seeing in the jive forum, happened as well. Indeed, my concern was the broken link in the first post which would be simple to fix if intended. That not being the case increases the smell of FUD. -Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?
Why don't you just fix the apparently broken link to your source, then? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
Your adapter read-outs look quite different than mine. I am on ICH-9, snv_133. Maybe that's why. But I thought I should ask on that occasion: -build? -do the drives currently support SATA-2 standard (by model, by jumper settings?) - could it be that the Areca controller has done something to them partition-wise? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up ZFS on AHCI disks
Hi, are the drives properly configured in cfgadm? Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
My understanding of "passthrough disk" from the Areca documentation is that single drives are exempted from the RAID controller regime and that the port will behave just like a plain HBA port. Now, on my Areca controller (r.i.p.) that mode always created the biggest havoc with ZFS/Opensolaris, including zpool states just like yours. That was on a older firmware, though. 12xRAID0 was only marginally better than pass-through. What I maybe did not mention is that we tried with Ubuntu/dmraid on the same HW for an afternoon, but here the initialisation of the RAID crashed with a reproducible Kernel Panic. I think I mentioned it before: the only thing that worked decently was putting the whole controller in JBOD mode. Yes, it is an expensive way of providing a bunch of SATA ports... in my case it wasn't that bad as I got a 1170 for app. 400 Euros, but it was still too expensive given the performance under ZFS, so I swapped it against a full re-fund for a pair of LSIs. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
> > I would be really interested how you got past this > > > http://defect.opensolaris.org/bz/show_bug.cgi?id=11371 > > which I was so badly bitten by that I considered > giving up on OpenSolaris. > > > I don't get random hangs in normal use; so I haven't > done anything to "get > past" this. > > I DO get hangs when funny stuff goes on, which may > well be related to that > problem (at least they require a reboot). Hmmm; I > get hangs sometimes > when trying to send a full replication stream to an > external backup drive, > and I have to reboot to recover from them. I can > live with this, in the > short term. But now I'm feeling hopeful that they're > fixed in what I'm > likely to be upgrading to next. That sounds that the only difference probably was the amount of data transferred on your and my system. We are working with media files here, each multiple Gigabytes, hence the varying mileage, I assume. FW 2010.x is concerned, my expectations are from past experience with last release. I test 2010 maybe even more rigidly before I will jump to it. "Technical" stability as you put it before, is basically the same for Dev and Release builds both from phenomenon and consequence perspective in a OpenSolaris environment. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
8 hot swap bays is not too much. The rest looks like a cake walk for OSol. But with this HW you can't go for 2009.06 anyhow, as ICH-10 won't be recognized. (I tried this on x58) I have a 2U enclosure as well (12-bay), but I'd opt for at least 3U next time, as there are too many restrictions for LP add-in cards, let alone bays, bays, bays... Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
> > On Wed, April 14, 2010 08:52, Tonmaus wrote: > > safe to say: 2009.06 (b111) is unusable for the > purpose, ans CIFS is dead > > in this build. > > That's strange; I run it every day (my home Windows > "My Documents" folder > and all my photos are on 2009.06). > > > -bash-3.2$ cat /etc/release > OpenSolaris 2009.06 snv_111b > X86 > Copyright 2009 Sun Microsystems, Inc. All > Rights Reserved. > Use is subject to license > terms. > Assembled 07 May 2009 I would be really interested how you got past this http://defect.opensolaris.org/bz/show_bug.cgi?id=11371 which I was so badly bitten by that I considered giving up on OpenSolaris. > not sure if this is best choice. I'd like to > hear from others as well. > Well, it's technically not a stable build. > > I'm holding off to see what 2010.$Spring ends up > being; I'll convert to > that unless it turns into a disaster. > > Is it possible to switch to b132 now, for example? I > don't think the old > builds are available after the next one comes out; I > haven't been able to > find them. There are methods to upgrade to any dev build by pkg. Can't tell you from the top of my head, but I have done it with success. I wouldn't know why to go to 132 instead of 133, though. 129 seems to be an option. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] casesensitivity mixed and CIFS
was b130 also the version that created the data set? -Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
safe to say: 2009.06 (b111) is unusable for the purpose, ans CIFS is dead in this build. I am using B133, but I am not sure if this is best choice. I'd like to hear from others as well. -Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
Upgrading the firmware a good idea, as there are other issues with Areca controllers that only have been solved recently. i.e. 1.42 is probably still affected by a problem with SCSI labels that may give problems importing a pool. -Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why would zfs have too many errors when underlying raid array is fine?
Hi, > I started off my setting up all the disks to be > pass-through disks, and tried to make a raidz2 array > using all the disks. It would work for a while, then > suddenly every disk in the array would have too many > errors and the system would fail. I had exactly the same experience with my Areca controller. Actually, I couldn't get it to work unless I put the whole controller in jbod mode. Neither 12 x "Raid-0 arrays" with single disks nor pass-through was workable. I had kernel panic and pool corruption all over the place, sometimes with, sometimes without additional corruption messages from the areca panel.I am not sure if this relates to the rest of your problem, though. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Areca ARC-1680 on OpenSolaris 2009.06?
That was a while back when I was shopping for my own HBAs. There were compatibility warnings all over the place with some Adpatec controllers and LSI SAS expanders. AFAIK, even the 106x need to be operated in IT mode to properly work with SAS expanders. IT mode disables all RAID functions of the 106x. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Areca ARC-1680 on OpenSolaris 2009.06?
As far as I have read, that problem has been reported to be a compatibility problem of the Adaptec controller and the expander chipset, e.g. LSI SASx which is also on the mentioned Chenbro expander. There is no problem with 106x chipset and sas expanders that I know of. People sceptical about expanders: quite a couple of th Areca cards actually have expander chips on board. Don't know about the 1680 specifically, though. Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Areca ARC-1680 on OpenSolaris 2009.06?
Hi David, why not just use a couple of SAS expanders? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to destroy iscsi dataset?
Hi, even if you didn't specify so below (both, Comstar and legacy target services are inactive) I assume that you have been using Comstar, right? In that case, the questions are: - is there still a view on the targets? (check stmfadm) - is there still a LU mapped? (check sbdadm) cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What about this status report
Both are driver modules for storage adapters Properties can be reviewed in the documentation: ahci: http://docs.sun.com/app/docs/doc/816-5177/ahci-7d?a=view mpt: http://docs.sun.com/app/docs/doc/816-5177/mpt-7d?a=view ahci has a man entry on b133, as well. cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What about this status report
Yes. Basically working here. All fine under ahci, some problems under mpt (smartctl says that WD1002fbys wouldn't allow to store smart events, which I think is probably nonsense.) Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Usage of hot spares and hardware allocation capabilities.
save. Bottom line: you will have to find out. What the "warning" is concerned: migrating a whole pool is not the same thing as swapping slots within a pool. I.e., if you pull more than the allowed number (failover resilience) from your pool at the same time while the pool is hot, you will simply destroy the pool. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Usage of hot spares and hardware allocation capabilities.
> So, is there a > sleep/hibernation/standby mode that the hot spares > operate in or are they on all the time regardless of > whether they are in use or not? This depends on the power-save options of your hardware, not on ZFS. Arguably, there is less ware on the heads for a hot spare. I guess that many modern disks will park the heads after a certain time, or spin even down, unless the controller prevents that. The question is if the disk comes back fast enough when required - your bets are on the controller supporting that properly. As it seems, there is little focus on that matter at SUN and among community members. At least my own investigations how to best make use of power save options like most SoHo NAS boxes offer returned only dire results. > Usually the hot spare is on a not so well-performing > SAS/SATA controller, There is no room for "not so well-performing" controllers in my servers. I would not allow wasting PCIe slots, backplanes for anything that doesn't live up to specs (my requirements). That being said, JBOD HBAs are those that perform best with ZFS and those happen to be not very expensive. Additionally, I will avoid a checker-board of components striving for keeping things as simple as possible. > To be more general; are the hard drives in the pool > "hard coded" to their SAS/SATA channels or can I swap > their connections arbitrarily if I would want to do > that? Will zfs automatically identify the association > of each drive of a given pool or tank and > automatically reallocate them to put the > pool/tank/filesystem back in place? This works very well, given your controller properly supports it. I tried that on an Areca 1170 a couple of weeks ago, with interesting results that turned out to be an Areca firmware flaw. You may find the thread on this list. I would recommend that you do such tests when implementing your array before going in production with it. Analogue aspects may apply for - Hotswapping - S.M.A.R.T. - replace failing components or change configuration - transfer a whole array to another host (list is not comprehensive) I think at this moment you have two choices to be sure that all "advertised" ZFS features will be available in your system: - learning it the hard way by try and error - use SUN hardware, or another turnkey solution that offers ZFS, such as NexentaStore A popular approach is following along the rails of what is being used by SUN, a prominent example being the LSI 106x SAS HBAs in "IT" mode. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
> > > > sata > > > disks don't understand the prioritisation, so > > Er, the point was exactly that there is no > discrimination, once the > request is handed to the disk. So, do you say that SCSI drives do understand prioritisation (i.e. TCQ supports the schedule from ZFS), while SATA/NCQ drives don't, or is it just boiling down to what Richard told us, SATA disks being too slow? > If the > internal-to-disk queue is > enough to keep the heads saturated / seek bound, then > a new > high-priority-in-the-kernel request will get to the > disk sooner, but > may languish once there. Thanks. That makes sense to me. > > You can shorten the number of outstanding IO's per > vdev for the pool > overall, or preferably the number scrub will generate > (to avoid > penalising all IO). That sounds like a meaningful approach to addressing bottlenecks caused by zpool scrub to me. >The tunables for each of these > should be found > readily, probably in the Evil Tuning Guide. I think I should try to digest the Evil Tuning Guide occasionally with respect to this topic. Thanks for pointing me to a direction. Maybe what you have suggested above (shorten the number of I/Os issued by scrub) is already possible? If not, I think it would be a meaningful improvement to request. > Disks with write cache effectively do this [command cueing] for > writes, by pretending > they complete immediately, but reads would block the > channel until > satisfied. (This is all for ATA which lacked this, > before NCQ. SCSI > has had these capabilities for a long time). As scrub is about reads, are you saying that this is still a problem with SATA/NCQ drives, or not? I am unsure what you mean at this point. > > > limiting the number of concurrent IO's handed to > the disk to try > > > and avoid saturating the heads. > > > > Indeed, that was what I had in mind. With the > addition that I think > > it is as well necessary to avoid saturating other > components, such > > as CPU. > > Less important, since prioritisation can be applied > there too, but > potentially also an issue. Perhaps you want to keep > the cpu fan > speed/noise down for a home server, even if the scrub > runs longer. Well, the only thing that was really remarkable while scrubbing was CPU load constantly near 100%. I still think that is at least contributing to the collapse of concurrent payload. I.e., it's all about services that take place in Kernel: CIFS, ZFS, iSCSI Mostly, about concurrent load within ZFS itself. That means an implicit trade-off while a file is being provided over CIFS, i.e.. > > AHCI should be fine. In practice if you see actv > 1 > (with a small > margin for sampling error) then ncq is working. Ok, and how is that in respect to mpt? My assertion that mpt will support NCQ is mainly based on the marketing information provided by LSI that these controllers offer NCQ support with SATA drives. How (by which tool) do I get to this "actv" parameter? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hello Dan, Thank you very much for this interesting reply. > roughly speaking, reading through the filesystem does > the least work > possible to return the data. A scrub does the most > work possible to > check the disks (and returns none of the data). Thanks for the clarification. That's what I had thought. > > For the OP: scrub issues low-priority IO (and the > details of how much > and how low have changed a few times along the > version trail). Is there any documentation about this, besides source code? > However, that prioritisation applies only within the > kernel; sata disks > don't understand the prioritisation, so once the > requests are with the > disk they can still saturate out other IOs that made > it to the front > of the kernel's queue faster. I am not sure what you are hinting at. I initially thought about TCQ vs. NCQ when I read this. But I am not sure which detail of TCQ would allow for I/O discrimination that NCQ doesn't have. All I know about command cueing is that it is about optimising DMA strategies and optimising the handling of the I/O requests currently issued in respect to what to do first to return all data in the least possible time. (??) > If you're looking for > something to > tune, you may want to look at limiting the number of > concurrent IO's > handed to the disk to try and avoid saturating the > heads. Indeed, that was what I had in mind. With the addition that I think it is as well necessary to avoid saturating other components, such as CPU. > > You also want to confirm that your disks are on an > NCQ-capable > controller (eg sata rather than cmdk) otherwise they > will be severely > limited to processing one request at a time, at least > for reads if you > have write-cache on (they will be saturated at the > stop-and-wait > channel, long before the heads). I have two systems here, a production system that is on LSI SAS (mpt) controllers, and another one that is on ICH-9 (ahci). Disks are SATA-2. The plan was that this combo will have NCQ support. On the other hand, do you know if there a method to verify if its functioning? Best regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
> On that > occasion: does anybody know if ZFS reads all parities > during a scrub? > > Yes > > > Wouldn't it be sufficient for stale corruption > detection to read only one parity set unless an error > occurs there? > > No, because the parity itself is not verified. Aha. Well, my understanding was that a scrub basically means reading all data, and compare with the parities, which means that these have to be re-computed. Is that correct? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hi, I got a message from you off-list that doesn't show up in the thread even after hours. As you mentioned the aspect here as well I'd like to respond to, I'll do it from here: > Third, as for ZFS scrub prioritization, Richard > answered your question about that. He said it is > low priority and can be tuned lower. However, he was > answering within the context of an 11 disk RAIDZ2 > with slow disks His exact words were: > > > This could be tuned lower, but your storage > is slow and *any* I/O activity will be > noticed. Richard told us two times that scrub already is as low in priority as can be. From another message: "Scrub is already the lowest priority. Would you like it to be lower?" = As much as the comparison goes between "slow" and "fast" storage. I have understood that Richard's message was that with storage providing better random I/O zfs priority scheduling will perform significantly better, providing less degradation of concurrent load. While I am even inclined to buy that, nobody will be able to tell me how a certain system will behave until it was tested, and to what degree concurrent scrubbing still will be possible. Another thing: people are talking a lot about narrow vdevs and mirrors. However, when you need to build a 200 TB pool you end up with a lot of disks in the first place. You will need at least double failover resilience for such a pool. If one would do that with mirrors, ending up with app. 600 TB gross to provide 200 TB net capacity is definitely NOT an option. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
> Are you sure that you didn't also enable > something which > does consume lots of CPU such as enabling some sort > of compression, > sha256 checksums, or deduplication? None of them is active on that pool or in any existing file system. Maybe the issue is particular to RAIDZ2, which is comparably recent. On that occasion: does anybody know if ZFS reads all parities during a scrub? Wouldn't it be sufficient for stale corruption detection to read only one parity set unless an error occurs there? > The main concern that one should have is I/O > bandwidth rather than CPU > consumption since "software" based RAID must handle > the work using the > system's CPU rather than expecting it to be done by > some other CPU. > There are more I/Os and (in the case of mirroring) > more data > transferred. What I am trying to say is that CPU may become the bottleneck for I/O in case of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 impact on CPU. So far at least my observations. Moreover, x86 processors not optimized for that kind of work as much as i.e. an Areca controller with a dedicated XOR chip is, in its targeted field. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
The reason why there is not more uproar is that cost per data unit is dwindling while the gap resulting from this marketing trick is increasing. I remember a case a German broadcaster filed against a system integrator in the age of the 4 GB SCSI drive. This was in the mid-90s. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
> If CPU is maxed out then that usually indicates some > severe problem > with choice of hardware or a misbehaving device > driver. Modern > systems have an abundance of CPU. AFAICS the CPU loads are only high while scrubbing a double parity pool. I have no indication of a technical misbehaviour with the exception of dismal concurrent performance. What is not getting beyond me is the notion that even if I *had* a storage configuration with 20 times more I/O capacity it still would max out any CPU I could buy better than the single L5410 I am running from currently. I am seeing CPU performance being a pain point on any "software" based array I have used so far. From SOHO NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all exposed a nominal performance drop once parity configurations were employed. Performance of the L5410 is abundant for the typical operation of my system, btw. It can easiely saturate the dual 1000 Mbit NICs for iSCSI and CIFS services. I am slightly reluctant to buy a second L5410 just to provide more headroom during maintenance operations, as the device will be idle otherwise, consuming power. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hello, > In following this discussion, I get the feeling that > you and Richard are somewhat talking past each > other. Talking past each other is a problem I have noted and remarked earlier. I have to admit to have got frustrated about the discussion narrowing down to a certain perspective that was quite the opposite of my own observations and what I had initially described. It may be that I have been more harsh than I should. Please accept my apology. I was trying from the outset to obtain a perspective on the matter that is independent from an actual configuration. I firmly believe that the scrub function is more meaningful if it can be applied in a variety of implementations. I think however that the insight that there seems to be no specific scrub management functions is transferable from a commodity implementation to a enterprise configuration. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
> Has there been a consideration by anyone to do a > class-action lawsuit > for false advertising on this? I know they now have > to include the "1GB > = 1,000,000,000 bytes" thing in their specs and > somewhere on the box, > but just because I say "1 L = 0.9 metric liters" > somewhere on the box, > it shouldn't mean that I should be able to avertise > in huge letters "2 L > bottle of Coke" on the outside of the package... If I am not completely mistaken, 1^n/1,024^n is converging against 0 for n vs infinite. That is certainly an unwarranted facilitation of Kryder's law for very large storage devices. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hi Richard, > > - scrubbing the same pool, configured as raidz1 > didn't max out CPU which is no surprise (haha, slow > storage...) the notable part is that it didn't slow > down payload that much either. > > raidz creates more, smaller writes than a mirror or > simple stripe. If the disks are slow, > then the IOPS will be lower and the scrub takes > longer, but the I/O scheduler can > manage the queue better (disks are slower). This wasn't mirror vs. raidz but raidz1 vs. raidz2, whereas the latter maxes out CPU and the former maxes out physical disc I/O. Concurrent payload degradation isn't that extreme on raidz1 pools, as it seems. Hence, the CPU theory that you still seem to be reluctant to follow. > There are several > bugs/RFEs along these lines, something like: > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu > g_id=6743992 Thanks to pointing at this. As it seems, it's a problem for a couple of years already. Obviously, the opinion is being shared that this a management problem, not a HW issue. As a Project Manager I will soon have to take a purchase decision for an archival storage system (A/V media), and one of the options we are looking into is SAMFS/QFS solution including tiers on disk with ZFS. I will have to make up my mind if the pool sizes we are looking into (typically we will need 150-200 TB) are really manageable under the current circumstances when we think about including zfs scrub in the picture. From what I have learned here it rather looks as if there will be an extra challenge, if not even a problem for the system integrator. That's unfortunate. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
> My guess is unit conversion and rounding. Your pool > has 11 base 10 TB, > which is 10.2445 base 2 TiB. > > Likewise your fs has 9 base 10 TB, which is 8.3819 > base 2 TiB. > Not quite. > > 11 x 10^12 =~ 10.004 x (1024^4). > > So, the 'zpool list' is right on, at "10T" available. Duh! I completely forgot about this. Thanks for the heads-up. Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] corruption of ZFS on iScsi storage
> Being an iscsi > target, this volume was mounted as a single iscsi > disk from the solaris host, and prepared as a zfs > pool consisting of this single iscsi target. ZFS best > practices, tell me that to be safe in case of > corruption, pools should always be mirrors or raidz > on 2 or more disks. In this case, I considered all > safe, because the mirror and raid was managed by the > storage machine. As far as I understand Best Practises, redundancy needs to be within zfs in order to provide full protection. So, actually Best Practises says that your scenario is rather one to be avoided. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
Hi Cindy, trying to reproduce this > For a RAIDZ pool, the zpool list command identifies > the "inflated" space > for the storage pool, which is the physical available > space without an > accounting for redundancy overhead. > > The zfs list command identifies how much actual pool > space is available > to the file systems. I am lacking 1 TB on my pool: u...@filemeister:~$ zpool list daten NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT daten10T 3,71T 6,29T37% 1.00x ONLINE - u...@filemeister:~$ zpool status daten pool: daten state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM daten ONLINE 0 0 0 raidz2-0ONLINE 0 0 0 c10t2d0 ONLINE 0 0 0 c10t3d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 c10t6d0 ONLINE 0 0 0 c10t7d0 ONLINE 0 0 0 c10t8d0 ONLINE 0 0 0 c10t9d0 ONLINE 0 0 0 c11t18d0 ONLINE 0 0 0 c11t19d0 ONLINE 0 0 0 c11t20d0 ONLINE 0 0 0 spares c11t21d0AVAIL errors: No known data errors u...@filemeister:~$ zfs list daten NAMEUSED AVAIL REFER MOUNTPOINT daten 3,01T 4,98T 110M /daten I am counting 11 disks 1 TB each in a raidz2 pool. This is 11 TB gross capacity, and 9 TB net. Zpool is however stating 10 TB and zfs is stating 8TB. The difference between net and gross is correct, but where is the capacity from the 11th disk going? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hello again, I am still concerned if my points are being well taken. > If you are concerned that a > single 200TB pool would take a long > time to scrub, then use more pools and scrub in > parallel. The main concern is not scrub time. Scrub time could be weeks if scrub just would behave. You may imagine that there are applications where segmentation is a pain point, too. > The scrub will queue no more > han 10 I/Os at one time to a device, so devices which > can handle concurrent I/O > are not consumed entirely by scrub I/O. This could be > tuned lower, but your storage > is slow and *any* I/O activity will be noticed. There are a couple of things I maybe don't understand, then. - zpool iostat is reporting more than 1k of outputs while scrub - throughput is as high as can be until maxing out CPU - nominal I/O capacity of a single device is still around 90, how can 10 I/Os already bring down payload - scrubbing the same pool, configured as raidz1 didn't max out CPU which is no surprise (haha, slow storage...) the notable part is that it didn't slow down payload that much either. - scrub is obviously fine with data added or deleted during a pass. So, it could be possible to pause and resume a pass, couldn't it? My conclusion from these observations is that not only disk speed counts here, but other bottlenecks may strike as well. Solving the issue by the wallet is one way, solving it by configuration of parameters is another. So, is there a lever for scrub I/O prio, or not? Is there a possibility to pause scrub passed and resume? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hi Richard, thanks for the answer. I think I am aware on the properties of my configuration and how it will scale. Let me please stress that this is not the point in the discussion. The target of this discussion should rather be if scrubbing can co-exist with payload or if we are thrown back to scrub in the after-hours. So, do I have to conclude that zfs is not able to make good decisions about load prioritisation on commodity hardware and that there are no further options available to tweak scrub load impact, or are there other options? I am thinking about managing pools capable of hundred times the capacity of mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on the double-parity pool) that practically would be un-scrub-able. (Yes, Enterprise HW is faster, but Enterprise service windows are much more narrow as well... you can't move around or offline 200 TB of live data for days only because you need to scrub the disks can you?) The only idea I could think of myself is to exchange individual drives in a round-robin fashion all the time and use re-silver instead of full scrubs. But actually I don't like the idea anymore at second glance. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to manage scrub priority or defer scrub?
Hi Richard, these are - 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group - 4 GB RAM - 1 CPU L5410 - snv_133 (where the current array was created as well) Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to manage scrub priority or defer scrub?
Dear zfs fellows, during a specific test I have got the impression that scrub may have quite an impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while scrub on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes out CPU on that machine. (1 Xeon L5410). I am running scrubs during week-ends, so this is not a problem. I am asking myself however what will happen on larger pools where a scrub pass will take days to weeks. Obviously, zfs file systems are much more scalable than CPU power ever will be. Hence, I am seeing a requirement to manage scrub activity so that trade-offs can be done to maintain availability and performance of the pool. Does anybody know how this is done? Thanks in advance for any hints, Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel SASUC8I - worth every penny
> On Mar 11, 2010, at 10:02 PM, Tonmaus wrote: > All of the other potential disk controllers line up > ahead of it. For example, > you will see controller numbers assigned for your CD, > floppy, USB, SD, CF etc. > -- richard Hi Richard, thanks for the explanation. Actually, I started to worry about controller numbers when I installed LSI cards that were replacing an Areca 1170. The Areca took place 9, and the LSI cards started from 10. Could it be that the BIOS caches configuration data that leads to this? What is btw the proper method to configure white box hardware to achieve more convenient readouts? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel SASUC8I - worth every penny
Hi, thanks for sharing. Is your LSI card running in IT or IR mode? I had some issues getting all drives connected in IR mode which is the factory default of the LSI branded cards. I am also curious why your controller shows up as "c11". Does anybody know more about the way this is enumerated? I am having two LSI controllers, one is "c10" the other "c11". Why can't controllers count from 1? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to verify ecc for ram is active and enabled?
> I'd really like to understand what OS does with > respect to ECC. In information technology ECC (Error Correction Code, Wikipedia article is worth reading.) normally protects point-to-point "channels". Hence, this is entirely a "hardware" thing here. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to verify ecc for ram is active and enabled?
> Is the nature of the scrub that it walks through > memory doing read/write/read and looking at the ECC > reply in hardware? I think ZFS has no specific mechanisms in respect to RAM integrity. It will just count on a healthy and robust foundation for any component in the machine. As far as I understand it's just a good idea to have ECC RAM once you talk a certain amount of data that will inevitably go through a certain path. Servers controlling PB of data are certainly a case for ECC memory in my regard. -Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
Hi, so, what would be a critical test size in your opinion? Are there any other side conditions? I.e., I am not using any snapshots and have also turned off automatic snapshots because I was bitten by system hangs while destroying datasets with living snapshots. I am also aware that Fishworks isn't probably on the same code level as the current dev build. Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, > > In your case, there are two other aspects: > > - if you pool small devices as JBODS below a vdev > > member, no superordinate parity will help you when > > you loose a member of the underlying JBOD. The > whole > > pool will just be broken, and you will loose a > good > > part of your data. > > No, that's not correct. The first option of pooling > smaller disks into larger, logical devices via SVM > would allow me to theoretically lose up to > [b]eight[/b] disks while still having a live zpool > (in the case where I lose 2 logical devices comprised > of four 500GB drives each; this would only kill two > actual RAIDZ2 members). You are right. I was wrong with the JBOD observation. In the worst case the array still can't tolerate more than 2 disk failures, if all disk failures are across different 2 TB building blocks. > Using slices, I'd be able to lose up to [b]five[/b] > disks (in the case where I'd lose one 2TB disk > (affecting all four vdevs) and four 500GB disks, one > from each vdev). As a single 2 TB disk is causing a failure in each group for scenario 2, the worst case here is as well "3 disks and you are out". This circumstance reduces the options to play with grouping to not less than 4 groups with that setup. The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot spare) Doesn't that all point at option 1 as the better choice, as the performance will be much better, obviously when slicing the 2 TB drives will leave you at basically un-cached IO for these members, dominating the rest of the array? One more thing with SVM is unclear for me: if one of the smaller disks goes, from zfs perspective the whole JBOD has to be resilvered. But what will be the interactions between fixing the jbod in SVM and re-silvering in ZFS? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, > > In your case, there are two other aspects: > > - if you pool small devices as JBODS below a vdev > > member, no superordinate parity will help you when > > you loose a member of the underlying JBOD. The > whole > > pool will just be broken, and you will loose a > good > > part of your data. > > No, that's not correct. The first option of pooling > smaller disks into larger, logical devices via SVM > would allow me to theoretically lose up to > [b]eight[/b] disks while still having a live zpool > (in the case where I lose 2 logical devices comprised > of four 500GB drives each; this would only kill two > actual RAIDZ2 members). You are right. I was wrong with the JBOD observation. In the worst case the array still can't tolerate more than 2 disk failures, if all disk failures are across different 2 TB building blocks. > Using slices, I'd be able to lose up to [b]five[/b] > disks (in the case where I'd lose one 2TB disk > (affecting all four vdevs) and four 500GB disks, one > from each vdev). As a single 2 TB disk is causing a failure in each group for scenario 2, the worst case here is as well "3 disks and you are out". This circumstance reduces the options to play with grouping to not less than 4 groups with that setup. The payload for redundancy in both scenarios is 4 TB, consequently. (With no hot spare) Doesn't that all point at option 1 as the better choice, as the performance will be much better, obviously when slicing the 2 TB drives will leave you at basically un-cached IO for these members, dominating the rest of the array? One more thing with SVM is unclear for me: if one of the smaller disks goes, from zfs perspective the whole JBOD has to be resilvered. But what will be the interactions between fixing the jbod in SVM and re-silvering in ZFS? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
Hi, I have tried what dedup does on a test dataset that I have filled with 372 GB of partly redundant data. I have used snv_133. All in all, it was successful. The net data volume was only 120 GB. Destruction of the dataset finally took a while, but without any compromise of anything else. After this successful test I am planning to use dedup productively soon. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, the corners I am basing my previous idea on you can find here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#RAIDZ_Configuration_Requirements_and_Recommendations I can confirm some of the recommendations already from personal practise. First and foremost this sentence: "The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups." One example: I am running 11+1 disks in a single group now. I have recently changed the configuration from raidz to raidz2, and the performance while scrub dropped from 500 MB/s to app. 200 MB/s by the imposition of the second parity. I am sure that if I had chosen two groups in raidz, the performance would have been even better than the original config while I could still loose two drives in the pool unless the loss wouldn't occur within a single group. The bottom line is that while increasing the number of stripes in a group the performance, especially random I/O, will converge against the performance of a single group member. The only reason why I am sticking with the single group configuration myself is that performance is "good enough" for what I am doing for now, and that "11 is not so far from 9". In your case, there are two other aspects: - if you pool small devices as JBODS below a vdev member, no parity will help you when you loose a member of the underlying JBOD. - If you use slices as vdev members, performance will drop dramatically. Regards, tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about multiple RAIDZ vdevs using slices on the same disk
Hi, following the zfs best practise guide, my understanding is that neither choice is very good. There is maybe a third choice, that is pool --vdev1 --disk --disk . --disk ... --vdev n --disk --disk . --disk whereas the vdevs will add up in capacity. As far as I understand the option to use a parity protected stripe set (i.e. raidz) would be on the vdev layer. As far as I understand the smallest disk will limit the capacity of the vdev, not of the pool, so that the size should be constant within a pool. Potential hot spares would be universally usable for any vdev if they match the size of the largest member of any vdev. (i.e. 2 GB). The benefit of that solution are that a physical disk device failure will not affect more than one vdev, and that IO will scale across vdevs as much as capacity. The drawback is that the per-vdev redundancy has a price in capacity. I hope I am correct - I am a newbie as you. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Hello Cindy, I have got my LSI controllers and exchanged them for the Areca. The result is stunning: 1. exported pool (in this strange state I reported here) 2. changed controller and re-ordered the drives as before posting this matter (c-b-a back to a-b-c) 3. Booted Osol 4. imported pool Result: everything but the previously inactive spare drive was immediately discovered and imported. I am really impressed. The problem was clearly related to the Areca controller. (I should say that the whole procedure wasn't 1,2,3,4 as I had to solve quite a lot of hw related issues, such as writing IT firmware over IR type in order to get all drives hooked up correctly, but that's another greenhorn story.) Best , Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk Issues
Hi, If I may - you mentioned that you use ICH10 over ahci. As far as I know ICH10 is not officially supported by the ahci module. I have also tried myself on various ICH10 systems without success. OSOL wouldn't even install on pre-130 builds, and I haven't tried since. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?
Hi again, thanks for the answer. Another thing that came to my mind is that you mentioned that you mixed the disks among the controllers. Does that mean you mixed them as well among pools? Unsurprisingly, the WD20EADS is slower than the Hitachi that is a fixed 7200 rpm drive. I wonder what impact that would have if you use them as vdevs of the same pool. Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?
Hi Arnaud, which type of controller is this? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
Hi Simon > I.e. you'll have to manually intervene > if a consumer drive causes the system to hang, and > replace it, whereas the RAID edition drives will > probably report the error quickly and then ZFS will > rewrite the data elsewhere, and thus maybe not kick > the drive. IMHO the relevant aspects are if ZFS is able to give accurate account on cache flush status and even realize if a drive is not responsive. That being said, I have no seen a specific report that ZFS would kick green drives at random or at pattern, like the poor SoHo storage enclosure users do all the time. > > So it sounds preferable to have TLER in operation, if > one can find a consumer-priced drive that allows it, > or just take the hit and go with whatever non-TLER > drive you choose and expect to have to manually > intervene if a drive plays up. OK for home user where > he is not too affected, but not good for businesses > which need to have something recovered quickly. One point about TLER is that two error correction schemes concur in the case you run a consumer drive on an active RAID controller that has its own mechanisms. When you run ZFS on a RAID controller in contrast to the best practise recommendations, an analogue question arises. On the other hand, if you run a green consumer drive on a dumb HBA , I wouldn't know what is wrong with it in the first place. As much as for manual interventions, the only one I am aware of would be to re-attach a single drive. Not an option if you are really affected like those miserable Thecus N7000 users that see the entire array of only a handful of drives drop out within hours - over and over again, or not even get to finish formatting the stripe set. The dire consequences of the gossiped TLER problems let me believe that there would be much more and quite specific reports in this place if this was a systematic issue with ZFS. Other than that, we are operating outside supported specs when running consumer level drives in large arrays. So far at least the perspective of Seagate and WD. > > > That all rather points to singular issues with > > firmware bugs or similar than to a systematic > issue, > > doesn't it? > > I'm not sure. Some people in the WDC threads seem to > report problems with pauses during media streaming > etc. This was again for SoHo storage enclosures - not for ZFS, right? > when the > 32MB+ cache is empty, then it loads another 32MB into > cache etc and so on? I am not sure if any current disk will have such a simplistic cache management that will draw upon completely cycling the buffer content, let alone for reads that belong to a single file (a disk basically is agnostic of files). Moreover, such a buffer management would be completely useless for a striped array. I don't know much better what a disk cache does either, but I am afraid that direction is probably not helpful to understanding certain phenomenons people have reported. I think that at this time we are seeing a quite large amount of evolutions going on in disk storage, whereas many established assumptions are being abandoned while backwards compatibility is not always taken care of. SAS 6G (will my controller really work in a PCIe 1.1 slot?) and 4k clusters are certainly only prominent examples. It's probably even more true than ever to fall back to established technologies in such times, including of biting the bullet of cost premium on occasion. Best regards Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
Hi Simon, they are the new revision. I got the impression as well that the complaints you reported were mainly related to embedded Linux systems probably running LVM / mda. (thecus, Qnap, ) Other reports I had seen related to typical HW raids. I don't think the situation is comparable to ZFS. I have also followed some TLER related threads here. I am not sure if there was ever a clear assertion if consumer drive related Error correction will affect a ZFS pool or not. Statistically we should have a lot of "restrictive TLER settings helped me to solve my ZFS pool issues" success reports here, if it were. That all rather points to singular issues with firmware bugs or similar than to a systematic issue, doesn't it? Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Hi James, am I right to understand that in a nutshell the problem is that if page 80/83 information is present but corrupt/inaccurate/forged (name it as you want), zfs will not get to down to the GUID? regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Thanks. That fixed it. Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
Hi Simon, I am running 5 WD20EADS in a raidz-1+spare on ahci controller without any problems I could relate to TLER or head parking. Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Goog morning Cindy, > Hi, > > Testing how ZFS reacts to a failed disk can be > difficult to anticipate > because some systems don't react well when you remove > a disk. I am in the process of finding that out for my systems. That's why I am doing these tests. > On an > x4500, for example, you have to unconfigure a disk > before you can remove > it. I have made similar experience already with disks attached over ahci. Still zpool status won't recognize that they have been removed immediately or sometimes not at all. But that's stuff for another thread. > > Before removing a disk, I would consult your h/w docs > to see what the > recommended process is for removing components. Spec-wise all drives, backplanes, controllers and their drivers I am using would support hotplug. Still, ZFS seems to have difficulties. > > Swapping disks between the main pool and the spare > pool isn't an > accurate test of a disk failure and a spare kicking > in. That's correct. You may want to note that it wasn't subject of my test procedure. I have just intentionally mixed up some disks. > > If you want to test a spare in a ZFS storage pool > kicking in, then yank > a disk from the main pool (after reviewing your h/w > docs) and observe > the spare behavior. I am aware of that procedure. Thanks. > If a disk fails in real time, I > doubt it will be > when the pool is exported and the system is shutdown. Agreed. Once again: the export, reboot, import sequence was specifically followed to eliminate any side fx of hotplug behaviour. > > In general, ZFS pools don't need to be exported to > replace failed disks. > I've seen unpredictable behavior when > devices/controllers change on live > pools. I would review the doc pointer I provided for > recommended disk > replacement practices. > > I can't comment on the autoreplace behavior with a > pool exported and > a swap of disks. Maybe someone else can. The point of > the autoreplace > feature is to allow you to take a new replacement > disk and automatically > replace a failed disk without having to use the zpool > replace command. > Its not a way to swap existing disks in the same > pool. The interesting point about this is to finding out if one will be able to i.e. replace a controller with a different type in case of a hardware failure, or even just move the physical discs to a different enclosure for any imaginable reason. Once again, the naive assumption was that ZFS will automatically find the members of a previously exported pool by information (metadata) present on each of the pool members (disks, vdevs, files, whatever). The situation now after scrub has finished is that the pool reports without any "known data errors", but still with the dubious reporting of the same device c7t11d0 both in available Spare status and online pool member at the same time. The status sticks with another export/import cycle (this time without an intermediate reboot). The next steps for me will be to change the controller with a mpt driven type and rebuild the pool from scratch. Then I may repeat the test. Thanks so far for your support. I have learned a lot. Regards, Sebastian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
If I run # zdb -l /dev/dsk/c#t#d# the result is "failed to unpack label" for any disk attached to controllers running on ahci or arcmsr controllers. Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Hi again, > Follow recommended practices for replacing devices in > a live pool. Fair enough. On the other hand I guess it has become clear that the pool went offline as a part of the procedure. That was partly as I am not sure about the hotplug capabilities of the controller, partly as I wanted to simulate an incident that will force me to shut down the machine. I also assumed that a controlled procedure of atomical, legal steps (export, reboot, import) should avoid unexpected gotchas. > > In general, ZFS can handle controller/device changes > if the driver > generates or fabricates device IDs. You can view > device IDs with this > command: > > # zdb -l /dev/dsk/cvtxdysz > > If you are unsure what impact device changes will > have your pool, then > export the pool first. If you see the device ID has > changed when the > pool is exported (use prtconf -v to view device IDs > while the pool is > exported) with the hardware change, then the > resulting pool behavior is > unknown. That's interesting. I understand I should do this to get a better idea what may happen before ripping the drives from the respective slots. Now: in case of an enclosure transfer or controller change, how do I find out if the receiving configuration will be able to handle it? The test obviously will tell about the IDs the sending configuration has produced. What layer will interpret the IDs, driver or ZFS? Are the IDs written to disk? The reason I am doing this is to find out what I need to observe in respect to failover strategies for controllers, mainboards, etc. for the hardware that I am using. Which is naturally Non-SUN. Regards, Sebastian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
Hi Cindys, > I'm still > not sure if you physically swapped c7t11d0 for c7t9d0 or if c7t9d0 is > still connected and part of your pool. The latter is not the case according to status, the first is definitely the case. format reports the drive as present and correctly labelled. > ZFS has recommended ways for swapping disks so if the pool is exported, the > system > shutdown and then disks are swapped, then the behavior is unpredictable and > ZFS is > understandably confused about what happened. > It might work for some hardware, but in general, ZFS should be notified of > the device changes. For the record, ZFS seems to be only marginally confused: The pool showed no errors after the import; the rest remains to be seen after scrub is done. I can't see what would be wrong with a clean export/import. And the results of the drive swap are part of the plan to find out what impact the HW has on the transfer of this pool. > > You might experiment with the autoreplace pool > property. Enabling this > property will allow you to replace disks without > using the zpool replace > command. If autoreplace is enabled, then physically > swapping out an > active disk in the pool with a spare disk that is is > also connected to > the pool without using zpool replace is a good > approach. Does this still apply if I did a clean export before the swap? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
> Hi-- > > Were you trying to swap out a drive in your pool's > raidz1 VDEV > with a spare device? Was that your original > intention? Not really. I just wanted to see what happens if the physical controller port changes, i.e. what practical relevance it would have if I put the disks in the same order after moving them from enclosure to enclosure. It was a simulation of that principle by just swapping 3/10 drives from position ABC to CAB. The naive assumption was that the pool would just import normally. I have checked: All resources are available as before. %t0%- %t11% are attached to the system. The odd thing still is: %t9% was a member of the pool - where is it? And: I thought a spare could only be 'online' in any pool or 'available', not both at the same time. Does it make more sense now? Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool status output confusing
Hi all, this is what I get from 'zpool status pool' after swapping 3 of 10 members of a zpool for testing purpose. [i]u...@zfs2:~$ zpool status pool pool: pool state: ONLINE scrub: scrub in progress for 0h8m, 4,70% done, 2h51m to go config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c7t11d0 ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 spares c7t11d0AVAIL errors: No known data errors[/i] Observe that disk %t11% is assigned as well as a member of the pool as as spare available. The procedure was 'zpool export pool' > shutdow' > swap drives > boot > 'zpool import pool', without a hitch. As you see, scrub is running for peace of mind... Ideas? TIA. Cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
Hi James > I do not think that you are reading the data > correctly. > > The issues that we have seen via this list and > storage-discuss > have implicated downrev firmware on cards, and the > various different > disk drives that people choose to attach to those > cards. Thanks for pointing that out. I have indeed noticed such reports but I didn't see any specific plans or ack's to take these issues from mpt. Thus the question is if these reports justify the assumption that there was anything wrong with mpt in general. > > The use of SAS expanders with mpt-based cards is > *not* an issue. > The use of MPxIO with mpt-based cards is *not* an > issue. I didn't want to make the point of denying some of mpt's core features. I just saw a couple of reports that involved SAS Expanders, specifically those based on LSI silicon, reported under the "mpt problem" umbrella, and I understood that these issues were quite obstinate. > Personally, I'm quite happy with the LSISAS3081E that > I have > installed in my system, with the attached 320Gb > consumer-grade > SATA2 disks. > Excellent. That's encouraging. I am planning a similar configuration, with WD RE3 1 TB disks though. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?
> Thanks for your answer. > > I asked primarily because of the mpt timeout issues I > saw on the list. Hi Arnaud, I am looking into the LSI SAS 3081 as well. My current understanding with mpt issues is that the "sticky" part of these problems is rather related to multipath features, that is using port multipliers or sas expanders. Avoiding these one should be fine. I am quite a newbie though. Just judging from what I read here. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
> This sounds like yet another instance of > > 6910767 deleting large holey objects hangs other I/Os > > I have a module based on 130 that includes this fix > if you would like to try it. > > -tim Hi Tim, 6910767 seems to be about ZVOLs. The dataset here was not a ZVOL. I had a 1,4 TB ZVOL on the same pool that also wasn't easy to kill. It hung the machine as well - but only once: it was gone after a forced re-boot. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss