Re: Filesystem overhead

2003-08-04 Thread Andrew W. Gaunt
Ben, et.al.,

 Your explaination has corrected some misconceptions I had regarding 
journaling
filesystems. Thanks. I think I've gleaned that the journal is an add 
before you subtract
kind of system, meaning you never put at risk information you don't have 
a copy of
squirrled away somewhere else (just in case). Somehow, this reminds me of my
workshop; only I do more adding than subtracting ;-)

I did read up the ext3 implementation a bit. It's basically ext2 with a 
journal file (/.journal)
That is kind of neat and helps minimize the potential (as you point out 
below) for new
bugs as much of the code is reused. An ext3 fielsystem can even mounted 
as ext2 if it
is unmounted cleanly. Also, there are tools to add a journal to an ext2 
filesystem essential
converting it to ext3. Not to denegrate other journaling filesystems, 
but, it would seem
ext3 is a nice way to go if your're already comfortable with ext2 and 
you want journalling.

--
__
| 0|___||.   Andrew Gaunt *nix Sys. Admin,, etc. Lucent Technologies
_| _| : : }   [EMAIL PROTECTED] - http://www-cde.mv.lucent.com/~quantum
-(O)-==-o\   [EMAIL PROTECTED] - http://www.gaunt.org


[EMAIL PROTECTED] wrote:

On Wed, 30 Jul 2003, at 8:35am, [EMAIL PROTECTED] wrote:
 

Very cool, that was revealing. Perhaps this discussion can evolve into how
journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody?
   

 If a system crashes (software, hardware, power, whatever) in the middle of
a write transaction, then it likely that the filesystem will be left in an
inconsistent state.  For that reason, many OSes will run a consistency check
on a filesystem that was not unmounted cleanly before mounting it again.  
Most everyone here has probably seen fsck run after a crash for this
reason.

 That consistency check can take quite a long time, especially on a large
filesystem.  If the filesystem is sufficiently large, the check time can be
hours.  Worse still, if the crash happened at just the right (or wrong) time,
it can cause logical filesystem damage (e.g., a corrupt directory), causing
additional data loss.
 To solve this problem, one can use a journaling filesystem.  A
journaling filesystem does not simply write changes to the disk.  First, it
writes the changes to a journal (sometimes called a transaction log or
just log).  Then it writes the actual changes to the disk (sometimes
called committing).  Finally, it updates the journal to note that the
changes were successfully written (sometimes called checkpointing).
 Now, if the system crashes in the middle of a transaction, upon re-mount,
the system just has to look at the journal.  If a complete transaction is
present in the journal, but has not been checkpointed, the journal is
played back to ensure the filesystem is made consistent.  If an incomplete
transaction is present in the journal, it was never committed, and thus can
be discarded.
 Of course, none of this guarantees you won't lose data.  If a program was
in the middle of writing data to a file when the system crashed, chances
are, that file is now scrambled.  Journaling protects the filesystem itself
from damage, and avoids the need for a consistency after a crash.
 It is also important to understand the difference between journaling
*all* writes to a filesystem, and journaling just *metadata* writes.  The
term metadata means data about data.  Things such as a file's name,
size, time it was last modified, the specific blocks on disk used to store
it, that sort of thing, is metadata.  The metadata is critical, because
corruption of a small amount of metadata can lead to the loss of large
amounts of file data.
 Some journaling filesystems journal just metadata.  This keeps the
filesystem itself from becoming inconsistent in a crash, but may leave the
file data itself corrupted.  ReiserFS does this.  Why journal just metadata?  
Because journaling everything can cause a big performance hit, and, as
noted above, if the system crashed in the middle of a write, there is a good
chance you've already lost data anyway.

 Other filesystems journal all writes, or at least give you the option to.  
EXT3 is one such filesystem.  This can prevent file corruption in the case
where an atomic write of the file data was buffered in memory and being
written to disk when the crash occurred.

 About the only real drawback to a journaling filesystem is the
performance hit.  You have to write everything to disk *twice*: Once to the
journal, and once to the actual filesystem.
 There are other drawbacks:  Journaling filesystems are more complex, so
are statistically more likely to have bugs in the implementation.  But a
non-journaling filesystem can have bugs, too, so I think the best answer is
just more through code review and more testing.  The journal also uses some
space on the disk.  But as the space used by the journal is typically
megabytes on a multi-gigabyte filesystem, the overhead is insignificant.
 Finally, a journaling filesystem does not 

Re: Filesystem overhead

2003-08-02 Thread bscott
On Wed, 30 Jul 2003, at 8:35am, [EMAIL PROTECTED] wrote:
 Very cool, that was revealing. Perhaps this discussion can evolve into how
 journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody?

  If a system crashes (software, hardware, power, whatever) in the middle of
a write transaction, then it likely that the filesystem will be left in an
inconsistent state.  For that reason, many OSes will run a consistency check
on a filesystem that was not unmounted cleanly before mounting it again.  
Most everyone here has probably seen fsck run after a crash for this
reason.

  That consistency check can take quite a long time, especially on a large
filesystem.  If the filesystem is sufficiently large, the check time can be
hours.  Worse still, if the crash happened at just the right (or wrong) time,
it can cause logical filesystem damage (e.g., a corrupt directory), causing
additional data loss.

  To solve this problem, one can use a journaling filesystem.  A
journaling filesystem does not simply write changes to the disk.  First, it
writes the changes to a journal (sometimes called a transaction log or
just log).  Then it writes the actual changes to the disk (sometimes
called committing).  Finally, it updates the journal to note that the
changes were successfully written (sometimes called checkpointing).

  Now, if the system crashes in the middle of a transaction, upon re-mount,
the system just has to look at the journal.  If a complete transaction is
present in the journal, but has not been checkpointed, the journal is
played back to ensure the filesystem is made consistent.  If an incomplete
transaction is present in the journal, it was never committed, and thus can
be discarded.

  Of course, none of this guarantees you won't lose data.  If a program was
in the middle of writing data to a file when the system crashed, chances
are, that file is now scrambled.  Journaling protects the filesystem itself
from damage, and avoids the need for a consistency after a crash.

  It is also important to understand the difference between journaling
*all* writes to a filesystem, and journaling just *metadata* writes.  The
term metadata means data about data.  Things such as a file's name,
size, time it was last modified, the specific blocks on disk used to store
it, that sort of thing, is metadata.  The metadata is critical, because
corruption of a small amount of metadata can lead to the loss of large
amounts of file data.

  Some journaling filesystems journal just metadata.  This keeps the
filesystem itself from becoming inconsistent in a crash, but may leave the
file data itself corrupted.  ReiserFS does this.  Why journal just metadata?  
Because journaling everything can cause a big performance hit, and, as
noted above, if the system crashed in the middle of a write, there is a good
chance you've already lost data anyway.

  Other filesystems journal all writes, or at least give you the option to.  
EXT3 is one such filesystem.  This can prevent file corruption in the case
where an atomic write of the file data was buffered in memory and being
written to disk when the crash occurred.

  About the only real drawback to a journaling filesystem is the
performance hit.  You have to write everything to disk *twice*: Once to the
journal, and once to the actual filesystem.

  There are other drawbacks:  Journaling filesystems are more complex, so
are statistically more likely to have bugs in the implementation.  But a
non-journaling filesystem can have bugs, too, so I think the best answer is
just more through code review and more testing.  The journal also uses some
space on the disk.  But as the space used by the journal is typically
megabytes on a multi-gigabyte filesystem, the overhead is insignificant.

  Finally, a journaling filesystem does not eliminate the need for fsck  
and similar programs.  Inconsistencies can be introduced into a filesystem
in other ways (such as bugs in the filesystem code or hardware problems).  
Since, with a journaling filesystem, fsck will normally *never* be run
automatically by the system, it becomes a good idea to run an fsck on a
periodic basis, just in case.  EXT2/3 even has a feature that will cause
the filesystem to be automatically checked every X days or every Y mounts.

  Hope this helps,

-- 
Ben Scott [EMAIL PROTECTED]
| The opinions expressed in this message are those of the author and do  |
| not represent the views or policy of any other person or organization. |
| All information is provided without warranty of any kind.  |


___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-31 Thread Bob Bell
On Wed, Jul 30, 2003 at 05:48:06PM -0400, Bill Freeman [EMAIL PROTECTED] wrote:
Also, thinking about it later, I'm likely wrong that an inode
was a whole block, even with 512 byte blocks.  More likely is that
original UFS inodes were 64 bytes, with 8 fitting in a 512 byte block.
I'm sorry, I don't have any early sources to check on this.  If I
scrounge in the basement I might find my bound copies of the Unix
Programmers Manual (UPM), which might have a white paper on the
filesystem, but this seems to be too academic to bother.
   Close.  UFS inodes are 128 bytes, fitting 4 per block.

--
Bob Bell [EMAIL PROTECTED]
___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-30 Thread Andrew W. Gaunt
Very cool, that was revealing. Perhaps this discussion
can evolve into  how  journalling (e.g. ext3, etc.) works
and why it is good/bad. Anybody?
[EMAIL PROTECTED] wrote:

Hello world!

 Okay, I have satisfied my curiosity in this matter.

 Bill Freeman [EMAIL PROTECTED], who replied to me off-list quickly after my
original post, was correct in that the overhead I was seeing is due to
indirect blocks.  Credit also to Derek Martin [EMAIL PROTECTED], for
providing very nice empirical evidence (and saving me the trouble of
producing it).
.
[stuff deleted]

 

--
__
| 0|___||.   Andrew Gaunt *nix Sys. Admin,, etc. Lucent Technologies
_| _| : : }   [EMAIL PROTECTED] - http://www-cde.mv.lucent.com/~quantum
-(O)-==-o\   [EMAIL PROTECTED] - http://www.gaunt.org
___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-30 Thread Jerry Feldman
On Wed, 30 Jul 2003 08:35:16 -0400
Andrew W. Gaunt [EMAIL PROTECTED] wrote:

 Very cool, that was revealing. Perhaps this discussion
 can evolve into  how  journalling (e.g. ext3, etc.) works
 and why it is good/bad. Anybody?
I would like to see some real metrics on:
ext2
ext3
JFS
XFS
ReiserFS

In the cases of all the journaling file systems, the big gain is when
rebooting after an improper shutdown. 

I personally use Reiserfs and have had no problems with it except when I
had a bad memory chip in one of the laptops. I had to do the reiserfsck
from a rescue disk, but it worked fine. 

The main tradeoff is run-time performance, but for the most part that is
a good tradeoff unless your system needs all the I/O performance it can
get. In my specific case I have a couple of filesystems I keep unmounted
except when I do nightly backups. 
-- 
Jerry Feldman [EMAIL PROTECTED]
Boston Linux and Unix user group
http://www.blu.org PGP key id:C5061EA9
PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9


pgp0.pgp
Description: PGP signature


Re: Filesystem overhead

2003-07-30 Thread bscott
On Wed, 30 Jul 2003, at 7:12pm, [EMAIL PROTECTED] wrote:
 Probably, the best thing that ext3 has going for it is its compatibility
 with ext2.

  Yah.  I would also go so far as to say that EXT3 is the most robust (in
terms of implementation) journaling filesystem available on Linux.  Not
because XFS or Reiser suck (they don't), but simply because EXT2/3 has been
around on Linux longest, and the people maintaining it are in a position to
be the most familiar with Linux.  The maintainers are also *very*
conservative, which can be a Good Thing when you're talking about code
stability.

  ReiserFS doesn't suck.  It also has some things EXT3 doesn't have, like
better handling of large directories.  We've got a customer with a 550 GB
ReiserFS filesystem they've had for well over two years, and it has never
given us any trouble.  (We picked ReiserFS because 2.4 wasn't stable at the
time and XFS (which also handles large directories well) wasn't available on
2.2.)

 ReiserFS has some advantages is that they optimize small files so less
 space is wasted.

  Yah, ReiserFS calls them tails.  Mounting with tails turned off is a
standard practice for many, though, since they tend to really drag down
performance.  At least in the current implementation.

-- 
Ben Scott [EMAIL PROTECTED]
| The opinions expressed in this message are those of the author and do  |
| not represent the views or policy of any other person or organization. |
| All information is provided without warranty of any kind.  |

___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-29 Thread bscott
Hello world!

  Okay, I have satisfied my curiosity in this matter.

  Bill Freeman [EMAIL PROTECTED], who replied to me off-list quickly after my
original post, was correct in that the overhead I was seeing is due to
indirect blocks.  Credit also to Derek Martin [EMAIL PROTECTED], for
providing very nice empirical evidence (and saving me the trouble of
producing it).

  Once I knew what I was looking for, finding it with Google was easy enough
to do.  ;-)

  A good brief explanation:

An inode stores up to 12 direct block numbers, summing up
to a file size of 48 kByte. Number 13 points to a block with up to
1024 block numbers of 32 Bit size each (indirect blocks), summing
up to a file size of 4 MByte. Number 14 points to a block with
numbers of blocks containing numbers of data blocks (double
indirect, up to 4 GByte); and number 15 points to triple indirect
blocks (up to 4 TByte).

-- from http://e2undel.sourceforge.net/how.html

  A good diagram:

http://e2fsprogs.sourceforge.net/ext2-inode.gif

  Additional references:

http://www.nongnu.org/ext2-doc/
http://e2fsprogs.sourceforge.net/ext2intro.html

  Finally, Mr. Freeman's reply to my original post was sufficiently
informative and well-written then I asked for and received his permission to
repost it here (thanks Bill!).  His post is about UFS (the original(?) Unix
File System), but most of the concepts (if not the numbers) apply to EXT2/3
as well.

-- Begin forwarded message --
Date: Mon, 28 Jul 2003 13:24:16 -0400
From: Bill Freeman [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Filesystem overhead

Ben,

I'd guess that du is counting the indirect blocks, except then
the overhead that you see is too small, unless things have gotten a
lot better than early Unix days.  Actually, they probably have gotten
better, having some scheme to allocate most of a large file from
contiguous sets of blocks that only needs a single pointer in an inode
or indirect block.  But whatever the allocation unit is, you need at
least an index into the allocation space plus an indication of which
space, and more likely a block offset within the filesystem, for each
unit of data.  If 32 bit offsets are enough (maybe not for the new
extra large filesystems), then to see the approximate 0.1% overhead
you're describing would need 4k allocation units, which seems
reasonable to me.

Actually, I assume that you know the stuff below, but I'm
going to say it anyway.  This is all from UFS: I've never studied extN
stuff internally.  In old Unix systems, blocks were 512 bytes.  An
inode was a block, and after things like permissions, size, fragment
allocation in the last block, owner, group, etc., there was room for
13 disk pointers (index of block within the partition).  The first 10
of these were used to point to the first 10 data blocks of the file.
If a file was bigger than 5k (needed more than 10 blocks), then the
11th pointer pointed to a block that was used for nothing but
pointers.  With 32 bit (4 byte) pointers, 128 pointers would fit in a
block, so this single indirect block could handle the next 64k of
the file.  If the file was larger than 69k (more than fills the single
indirect block), then the 12th pointer in the inode points to a
double indirect block, a block of pointers to blocks of pointers to
blocks of data.  In the 4 byte pointer 512 byte block world, this
handles the next 8Mb of the file.  Finally, if the file was too big
for that, the last inode pointer pointed to a triple indirect block, a
pointer to a block of pointers to blocks of pointers to blocks of
pointers to data blocks.  That handled the next 1Gb of the file.

This size comfortably exceeds the wildest dreams of a PDP-11,
the original implementation platform for Unix.  The washing machine
sided drives of the day only held between 2.5Mb and 10Mb.  Even when
we started being able to get 40Mb drives the choice wasn't a concern.
By the mid 1980's, however, big system vendors (I was at Alliant, who
made so called mini-super-computers) were scrambling to find creative
ways to expand the limits on both filesystems and individual files
without breaking too many things.

Linux has clearly been using 1k blocks, and I wouldn't be
surprised by the allocation of 4 block clusters for all but the last
(fragmented) blocks.  One to one thousand overhead to data sounds
pretty reasonable to me.

Bill
-- End forwarded message --

  Thanks to everyone who responded.  I hope other people have found this
thread as informative and useful as I have.

  Clear skies!

-- 
Ben Scott [EMAIL PROTECTED]
| The opinions expressed in this message are those of the author and do  |
| not represent the views or policy of any other person or organization. |
| All information is provided without warranty of any kind.  

Re: Filesystem overhead

2003-07-28 Thread Bruce Dawson
That is most likely the meta data kept by most logging file systems. Or
its the duplicate superblocks. (You see these when mkfs is run.)


___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-28 Thread bscott
On Mon, 28 Jul 2003, at 2:11pm, [EMAIL PROTECTED] wrote:
 That is most likely the meta data kept by most logging file systems. Or
 its the duplicate superblocks. (You see these when mkfs is run.)

  I would buy that if I was comparing the size of the raw device (partition)  
to the available space in an empty filesystem.  But I'm just looking at a
single file here.  And a 1-byte file uses only 4096 bytes, which is what I
would expect on this filesystem (4096 bytes is the EXT3 block size for this
one).

-- 
Ben Scott [EMAIL PROTECTED]
| The opinions expressed in this message are those of the author and do  |
| not represent the views or policy of any other person or organization. |
| All information is provided without warranty of any kind.  |

___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-28 Thread Ben Boulanger
On Mon, 28 Jul 2003 [EMAIL PROTECTED] wrote:
   For example, I have an image of a data CD in a single file.  The actual
 size of the logical file (as reported by stat, ls, and other tools) is
 526,397,440 bytes.  However, the du utility says it uses 526,917,632
 bytes.  That is a difference of 520,192 bytes, or almost half a megabyte.

if I remember correctly, it's that du gets the size in K and then if you
ask for bytes, converts the K to bytes... but it's been awhile.  Silly, I
know... but that's what I recall learning in college.

-- 

An inch of time is an inch of gold but you can't buy that inch of time with
an inch of gold. 

___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss


Re: Filesystem overhead

2003-07-28 Thread bscott
On Mon, 28 Jul 2003, at 7:50pm, [EMAIL PROTECTED] wrote:
 if I remember correctly, it's that du gets the size in K and then if you
 ask for bytes, converts the K to bytes... but it's been awhile.

  What du actually does is use the stat family of system calls to find
information about the file's inode.  The stat data includes a field that
includes the number of blocks used to store the inode's data.  A block, in
the context of stat, is always 512 bytes in size.  So the answer (from
stat or anything that uses it) will always be a multiple of 512.  But I am
seeing differences in the hundreds of kilobytes, not just 512 or so bytes.

  Another list reader suggested, off-list, that the numbers I am seeing
might include filesystem blocks used as indirect blocks, which I had
forgotten all about.  That explanation does seem possible, even likely, at
least for this particular example file (a very large, single file).  I'm
going to have to do a little more exploring before I'm satisfied that is the
right answer everywhere (everywhere being defined as on the systems it
has become a concern for me).  :-)

-- 
Ben Scott [EMAIL PROTECTED]
| The opinions expressed in this message are those of the author and do  |
| not represent the views or policy of any other person or organization. |
| All information is provided without warranty of any kind.  |

___
gnhlug-discuss mailing list
[EMAIL PROTECTED]
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss