Re: [zfs-discuss] ZFS File / Folder Management

Jim Klimov Tue, 17 May 2011 09:18:15 -0700

Hello, c4ts,

There seems to be some mixup of bad English and wrong terminology, so I am not 
sure I understood your question correctly. Still, I'll try to respond ;)


I've recently posted about ZFS terminology here: 
http://opensolaris.org/jive/click.jspa?searchID=4607806&messageID=515894

In ZFS incapsulation world, a "pool" is a collection of "top-level vdev"'s, 
which in turn are a collection of "lower-level vdev"'s. 

Lower level are usually individual disk partitions or slices, top-level are 
redundant groups of disks slices (raidzN, mirror, on non-redundant single 
devices), and the pool is a sort of striping across such groups. This is a 
simplified explanation, because it is not like usual RAID striping, and 
different types of data may be stored with different allocation algorithms 
(i.e. AFAIK in a raidzN vdev, metadata may still be mirrored for performance).

Generally when your data is written, it is striped into relatively small blocks 
and those are distributed across all top-level vdevs to provide parallel 
performance. 
There incoming writes are stiped into yet smaller blocks (attempts about 512kb 
according to one source) so as to use the hardware prefetches efficiently, 
parities are calculated and the resulting data and parity blocks are written 
across different lower-level vdevs to provide redundancy.
Each written block has a checksum so it can be easily tested for validity, and 
in redundant top-level VDEVs an invalid block can be automatically repaired 
(using other copies in mirror or parity data in raidzN) during a read operation.

If a lower-level vdev breaks, its top-level vdev becomes "degraded" - it can 
still be used, but would provide less redundancy.
If any top-level vdev fails completely, the pool is likely to become corrupted 
and may require lengthy repair or destruction and restore from backup, because 
part of striped data would be inaccessible.

You can address your user-data as "datasets", namely filesystem datasets in 
case of POSIX filesystem interface, or volume datasets for "raw storage" like 
swap. Datasets are kept as a hierarchy inside the pool. All unallocated space 
in the pool is available to all its datasets.

Now that we know how things are called, regarding your questions:

> Is it possible to ZFS to keep data that belongs
> together on the same Pool
> that way if THERE is a failure only the Data on the
> Pool that Failed Needs to be replaced 
> ( or if one Pool failed Does that Mean All the other
> Pools still fail as well ? with out a way to recover
> data ? ) 

If you have many storage disks, you can create several small pools (i.e. many 
independent mirrors) which would have less aggregate performance and shared 
available space, but would break independently. So you'll have a smaller chance 
of losing and repairing/restoring ALL of your data. But you may have to spend 
some effort to balance performance and free space when using many pools.

> I am Wanting to be able to Expand My Array over Time
> by adding either 4 or 8 HDD pools 

Yes you can do that - either by expanding an existing pool with an additional 
top-level vdev to stripe across, or by creating a new independent pool. 
You can also replace all drives in an existing top-level vdev one-by-one (and 
waiting for redundant data to be migrated to a new disk - "resilvering"), and 
when all drives have become larger, you can (auto-)expand the VDEV in-place.

> Most of the Data will probably never be Deleted 

However keep in mind, that if you add a new top-level vdev or expand an 
existing one (keeping other old vdevs as they were), you would get unbalanced 
free space. Existing data is not re-balanced automatically (though you can to 
that manually by copying it around several times), so most of the new writes 
would go to the new top-level vdev. As recently discussed in many threads on 
this forum, this is likely to
reduce your overall pool performance by a lot (because attempted writes to 
stripe accross nearly-full devices would lag, and because "fast" writes would 
effectively go to one top-level vdev and not across many top-level vdevs in 
parallel).

> but say I have 1 gig remaining on the First Pool and
> adding an 8 gig file 
> does this mean the data will be then put onto pool 1
> and pool 2 ? 
> ( 1 gig pool 1 7 gig pool 2 )
> 
> or would ZFS be able to put it onto the 2nd pool
> instead of Splitting it ?

Neither. Pools are independent.
If your question actually meant "1 gig free" on a top-level vdev or on a disk, 
while there are other "free" spots on different devices in the same pool, then 
overall free space in this pool is aggregated and you may be able to write a 
larger file than would fit on any single disk.

> 
> the other scenario would be folder structure would
> ZFS be able to understand data contained in a Folder
> Tree Belongs together and be able to store it on a
> dedicated pool ?

That would probably mean a tree of ZFS filesystem datasets.
These are kept in a hierarchy like a directory tree, and each FS dataset can 
store a tree of sub-directories.
An FS dataset has a unique mountpoint, so like with other Unix filesystems, 
your OS'es view of the filesystem tree of directories and files is aggregated 
from different individual filesystems mounted on the branches of a global tree. 
These "different individual filesystems" may come from a hierarchy of FS 
datasets in one pool (often with an inherited mountpoint from a parent dataset, 
such as "pool/export/home/username" being mounted by default as a subdirectory 
"user" within the mountpoint of "pool/export/home"), and they may come from 
different sources - such as other pools or network filesystems. For the 
different sources you may need to specify mountpoints manually, though (or use 
tricks like inheriting from a parent dataset with a specified common 
"mountpoint=/export/home" but which is not mounted by itself).

Thus files in one dataset would be kept together; files in different datasets 
in one pool would also be kept somewhat together (see below); and files in 
datasets from different pools would be kept separately. But one way or another, 
if the datasets are mounted - files are addressable in the common Unix 
filesystem tree of your OS.

Like other filesystems, FS datasets have unique IDs and private inode numbers 
within the FS; so for example you can't hard-link or fast-move (hardlink new 
name, unlink old name) files between different FS datasets even in the same 
pool. But you can often use soft-links and/or mountpoints to achieve a specific 
needed result.
Also you have shared free pool space between different FS datasets in the same 
pool (which can be further tuned for specific datasets by using quotas and 
revervations).

> if so it would be great or else you would be spending
> for ever replacing data from backup if something does
> go wrong 

If you view this forum in greater detail, for one reason or another, even (or 
especially) with ZFS things still do go wrong, especially on cheaper 
under-provisioned hardware, and repairs are often slow or complicated, so for 
large pools they can indeed take "forever" or close to that.

As always, redundant storage is not a replacement for a backup system (although 
many people suggest that a backup on another similar server box is superior to 
using tape backups - although probably using more electricity in real-time).

> sorry if this goes in the wrong spot i could no find
Seems to have come correctly ;)

HTH,
//Jim Klimov
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS File / Folder Management

Reply via email to