With VxFS, fortunately, the time to fsck is thankfully short. Even under 
massive corruption and a fsck -o full, you're still looking at a 
fraction of the time it would take on a similarly sized UFS or other FFS 
cylinder-group based filesystem. We had such a corruption event that was 
either attributable to the HBA causing silent corruption or being 
slightly downrev on the vxfs layout (I think we were using 3 and 4 was 
out). We fixed both potential problems by replacing the HBA and 
upgrading the filesystem layout online and no further events.

If you absolutely cannot lose everything, backups are still a good idea, 
but there are other way to do it to. One other technique is to use 
rolling snapshots. In this scenario, you have 2 mirrors each of which 
you have staggered for mounting. You snapshot (at the volume level) one 
mirror, break it off for an hour, then snapshot the second mirror and 
resynch the first. For best effect, combine this with FlashSnap license 
option or it will take WAAAAY too long to resynchronize.

This way you don't have to use tape.. Another technique would be to take 
a block/image backup of your filesystem, but you waste a very large 
amount of space unless you have a special tool that prevents you from 
backing up the unused space of the block image. (Streaming back onto the 
volume from tape should be quite fast in this situation). If your 
filesystem is mostly full anyway, then doing the optimization probably 
won't save a lot.

The main point is, sometimes corruption happens. Even in our worst case 
though, where we were experiencing silent corruption of the volume 
causing it to go offline and then panics and full fsck, we didn't lose 
everything on the filesystem, though we did lose some chunks.

One other thing that can happen is bit flips on the disk device during 
read. It is possible for 2 reads of the same block to return different 
results due to effects at the head level. Particularly if you are using 
less expensive disk technology like SATA, the chances of an undetected 
bit flip are non 0. The error rate on SATA is higher than equivalent 
enterprise FC drives, and they don't have as good of protection against 
this sort of thing. These errors *do* happen, occassionally The larger 
the filesystem, the more likely. Unless you are using a vendor like 
datadirect, who has a raid-6 type hamming code that checks integrity on 
every read, how do you notice them? (I really like SATA disks for their 
expense, but they do have some issues that one must be aware of)

* - yes, if you have silent filesystem corruption you risk checkpoint 
corruption too, but when we ran fsck -o full, it recovered most things 
sufficiently. We did lose a few tables, but most of the stuff was 
recoverable, either from the main FS or from a checkpoint

SANs are still good from a scalability and virtualization/transparency 
perspective. A SAN allows you to change your layout transparent to the 
user without any downtime (with the right software). This is a good 
thing. They don't provide you with any sorts of integrity guarantees, 
except that if there's an error that the SAN detects (very different 
than any other layer/level in the data transport) that it will 
retransmit. They also allow you to partition disks and hosts, do backups 
directly to tape, and other useful features.

one more comment below:

Aknin wrote:
> I've cross-posted this question on several places, and practically all 
> answers switched immediately to backup/restore issues. It seems that 
> no-one puts any kind of trust in filesystems, in the sense that even 
> if you have an expensive mirrored SAN, the system (the software 
> managing the data) is too stupid to cause corruption (more about that 
> below) and small amounts of data /may/ be lost without too much pain, 
> people here (and on comp.storage, and on ZFS-discuss) recommend to 
> backup the filesystem (i.e., copy all it's data to something which has 
> a different data structure than the filesystem itself, implicitly 
> because the FS /will/ get corrupt at some point) or split it into 
> smaller FSs (implicitly because then if one of them gets corrupt, we 
> can contain the damage and restore backups).
>
> I'll tell a bit more about my platform, maybe it would change some of 
> the answers:
> The system in question is made of millions (sometimes more) of small 
> files. Corruption in any particular file isn't troublesome, nor even 
> in hundreds of files. The block device is mirrored and is stored on 
> expensive SAN arrays that are trusted not to choke and die, and 
> snapshots can be taken at regular intervals. As you can probably 
> understand, the amount of files times the capacity (tens of TBs and 
> growing...) makes backups quite irrelevant, and what we're counting on 
> (maybe unjustly) is the mirroring and the snapshotting. We trust the 
> system in the sense that it's too stupid to do something wrong, it 
> works at the file level and is exceedingly unlikely to corrupt more 
> than a file (or two, or a hundred - but no more) at a time.
>
> What /is/ worrying to me is silent filesystem corruption that will at 
> some point jump and bite my arse, not in the sense that any particular 
> file will be lost, but in the sense that the filesystem will need 
> impossibly lengthy fsck (at millions+ files, it's damn lengthy) or 
> will become total-loss.If filesystem corruption is detected, it will 
> cause prompt snapshot rollback and incremental recovery*, but I'm 
> worried about rolling back only to discover the filesystem was already 
> corrupted at the time of the snap. I don't have room for much more 
> than one or two snaps. So you see that since the application I'm 
> running is as dumb as a brick (just writing it's fiendishly small 
> files, not caring much if they'll be there later as long as most of 
> them will be there) - the most complex part of my scenario is the 
> filesystem, rather than the application, and tape backup is totally 
> impractical even for sizes much smaller than 4TB, when your average 
> filesize is ~45K.
>
the thing about a checkpoint.. If you get corruption in your filesystem, 
you have to recover/fsck that *before* doing your rollback.  (the same 
is not true of a volume snapshot). Once you have done your fsck -o full 
(a rare event, indeed, and I've been using VxFS > 10 years), the 
checkpoint is clean the same time the main filesystem is clean. Now, 
some files may be missing from the checkpoint as a result of the fsck, 
but it won't require any further recovery.

> Does that change your advice? Would anyone trust the filesystem (any 
> filesystem, but especially VxFS) enough to make a few (say 3 or 4) 
> 32TB monsters holding this kind of data and being backed solely by 
> snaps? If you feel that it's not safe - what good are those 
> interconnected/grid multi-TB super-expensive SANs, if you can't mkfs 
> more than a few TBs without fear because of filesystem limitation? 
> Joerg, out of curiosity, which FS did you use there (on the 30TB 
> solution), and how did you replicate so much data to tape online?
>
> On 4/30/07, *Hallbauer, Joerg* <[EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     This is an interesting question. I think the answer depends a lot
>     on the type of data you are storing, and what you backup/recovery
>     mechanisms are. For example, we have a file system here that is
>     nearly 30TB (non Veritas), however, that file system is replicated
>     to tape, so if something happens to the disk copy of a file the
>     user just has to wait for a tape mount. So, in this case, having a
>     single 30TB file system is fine.
>
>      
>
>     So, I think that the real question to ask isn't "how long will
>     this take to back up", it's "how long will this take to recover
>     should I need to"?  all of this goes back to the two questions we
>     should always ask the business "how much data can you afford to
>     lose", and "how long can you afford to be down"? Once you have the
>     answer to those questions, you can design a solution that will
>     meet the needs of the business.
>
>      
>
>     In my experience, this tends to be iterative processes since the
>     first answer is inevitably "we can't lose ANY data, and we can
>     NEVER be down!" Of course, once they see the price tag associated
>     with that answer that almost always decide that they really CAN
>     afford to lose some data, and if they are down for a week, well,
>     that's fineā€¦ So then we do it again with the more reasonable numbers.
>
>      
>
>     --joerg
>
>     /Joerg Hallbauer/
>     /Sr. Technical Engineer/
>     /MIS - Technical Support/
>     /Warner Bros. Entertainment/
>     /(818) 954-4798/
>     / [EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>/
>
>     To achieve success, whatever the job we have, we must pay a price.
>     --Vince Lombardi
>
>     ------------------------------------------------------------------------
>
>     *From:* [EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>
>     [mailto:[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>] *On Behalf Of
>     *Aknin
>     *Sent:* Saturday, April 28, 2007 2:19 AM
>     *To:* veritas-vx@mailman.eng.auburn.edu
>     <mailto:veritas-vx@mailman.eng.auburn.edu>
>     *Subject:* Spam: Re: [Veritas-vx] Maximum Filesystem and File sizes?
>
>      
>
>     Following Asim's question answered by Scott, I'd be glad if people
>     could share real life large filesystems and their experience with
>     them. I'm slowly coming to a realization that regardless of
>     theoretical filesystem capabilities (1TB, 32TB, 256TB or more),
>     more or less across the enterprise filesystem arena people are
>     recommending to keep practical filesystems up to 1TB in size, for
>     manageability and recoverability.
>
>     What's the maximum filesystem size you've used in production
>     environment? How did the experience come out?
>
>     Thanks,
>      -Yaniv
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
>   

_______________________________________________
Veritas-vx maillist  -  Veritas-vx@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx

Reply via email to