[expert] Filesystems Primer Or why we don't use defrag

Civileme Wed, 14 Jun 2000 07:12:46 -0700
http://step.polymtl.ca/~ldd/ext2fs/ext2fs_toc.html

Will give you a great insight into the ext2 filesystem

Basically the disk is broken into blocks (logical and physical)
and files are broken into blocks and fragments.  When enough
fragments are available at the tail of a file to fill a (logical)
block then the fragments are converted into a block.  When the
filesystem is created, some blocks are reserved for superuser
use.  This prevents the formation of an unbootable filesystem
where there aren't enough free blocks to write mount information.

In a sense, the filesystem is fragmented since one file doesn't
follow another in strict contiguous sequence supported ONLY  by a
(duplicated) File allocation table.  Superblocks and group
descriptors containing identical information are stored at
regular intervals across the disk.  This allows using an
alternate superblock to recover the filesystem if the primary one
or even SEVERAL ones are corrupted.

OK so the disk is broken into block groups and each group
contains

1.Superblock

2.Group Descriptor

3.Bitmaps (one for the logical blocks in the group and one for
the inodes there)

4Inode table (each inode describes a file--including the
permissions, the number of links pointing to it, the blocks used
by the file (actually pointers to blocks, then if needed pointer
to a block of pointers pointing to blocks (indirection) with
space for double and triple indirection, the location of
fragments, etc.)

5.Blocks of actual file data.

Information in the superblock allows rebuilding a corrupted
filesystem (within limits).  Now we are by no means
complete--there is a set of rules (storage algorithm) that is
designed to keep the fragmentation almost constant and rather
small.

There are rules used to Allocate inodes

))The inode of a new file is allocated in the same group of the
inode of its parent directory.

))inodes are allocated equally between groups

There are rules used to allocate blocks

))a new block is allocated in the same group as its inode

))allocate consecutive sequences of blocks

It is not always possible to satisfy all of these rules at the
same time, in which case the file manager written for the ext2fs
filesystem may allocate the "problems" anywhere.  Usually,
though, it is possible to satisfy most of them.

Now the result is a fair amount of scatter across the disk with
most files pretty much contiguous and the descriptor information
allocated reasonably equally on the disk.  This is efficient,
especially with larger files.  "Defragmentation" doesn't really
stack files end-to-end because descriptor information is spread
across the whole filesystem (disk or disk partition)--it just
reorganizes enough to satisfy the allocation rules.  

Anyway, the link has much fuller explanation.


The defragmenter for ext2fs, (not supplied with most systems)
which others have cited links to, produces some marginal
increases in performance.  It was really designed for the older
ext fs which benefitted significantly.  It would be either
inapplicable or dangerous to attempt to use this on a Reiserfs,
which is organized in a radically different fashion to support
journaling and high efficiency on large quantities of small
files.  Reiserfs does not remotely resemble the model of packing
files one after another on the disk with no space between, though
once again it is very efficient for what it does though not yet
so stable that I would choose it for mission-critical work.

The Reiserfs is discussed in not unduly academic language at
http://devlinux.com/projects/reiserfs/

I say "not unduly" because the filesystem is based on at least oe
more intervening level of abstraction than ext2fs.  It uses a
grand compromise developed long ago for databases of various
sorts.

It is relatively simple to prove that the most efficient method
of storing information for lookup is to have it sorted (or where
the individual information objects are large, to have a sorted
index) and to search the index on a binary basis--that is, look
in the middle and go whichever direction the mismatch indicates,
up or down and look in the middle of what's left.  Using a binary
search algorithm, while you think of a number between 1 and 1000,
I will start with 500 and say 250 n3ext if you report "high" or
750 next if you say "low", and I WILL nail your number in 10 or
fewer guesses if you just report whether I am high. low or
matched.  The problem with a binary search is that you HAVE to be
sorted.  So updating your database with a time-eating sort had to
be done when the tail log of additions grew to the point that the
lookup became inefficient.  A balanced tree duplicated the quick
lookup capabilities of the binary lookup (nearly) by making nodes
of sorted information with each node about half-full of data, two
new nodes being created from a current one when the node filled
to a certain point.  Now this wasn't done in a single level but
in a tree structure.  This entity is called a Balanced tree,
B-Tree, B+Tree, etc, and a short book could be written on the
code necessary to implement it.  It is about six times as
complicated as competing storage algorithms.

Reiserfs uses a modification of this idea to store files.  Root
nodes and (logically formatted) nodes are used for locator
information in 4 levels then an unformatted node level contains
file data packed pretty tightly--how tightly depends on whether
file tails (the portion of a file that is left over after storing
the rest of it in filled blocks) is stored in the leaf, like the
file data (notails) or stored in the node level above it
(tails).  The notails option is faster at the expense of space,
and the tails makes more efficient use of space at the expense of
speed. 

I hope this overview was useful.  If it was too complex, please
tell me in private email.  If it was understandable and you think
of a better way to explain it, please tell me also.  Maybe we can
make something for people who want an answer to the filesystem
question without slogging through all the detail.

Civileme
[expert] Filesystems Primer Or why we don't use defrag

Reply via email to