Re: Filesystem overhead
Ben, et.al., Your explaination has corrected some misconceptions I had regarding journaling filesystems. Thanks. I think I've gleaned that the journal is an add before you subtract kind of system, meaning you never put at risk information you don't have a copy of squirrled away somewhere else (just in case). Somehow, this reminds me of my workshop; only I do more adding than subtracting ;-) I did read up the ext3 implementation a bit. It's basically ext2 with a journal file (/.journal) That is kind of neat and helps minimize the potential (as you point out below) for new bugs as much of the code is reused. An ext3 fielsystem can even mounted as ext2 if it is unmounted cleanly. Also, there are tools to add a journal to an ext2 filesystem essential converting it to ext3. Not to denegrate other journaling filesystems, but, it would seem ext3 is a nice way to go if your're already comfortable with ext2 and you want journalling. -- __ | 0|___||. Andrew Gaunt *nix Sys. Admin,, etc. Lucent Technologies _| _| : : } [EMAIL PROTECTED] - http://www-cde.mv.lucent.com/~quantum -(O)-==-o\ [EMAIL PROTECTED] - http://www.gaunt.org [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2003, at 8:35am, [EMAIL PROTECTED] wrote: Very cool, that was revealing. Perhaps this discussion can evolve into how journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody? If a system crashes (software, hardware, power, whatever) in the middle of a write transaction, then it likely that the filesystem will be left in an inconsistent state. For that reason, many OSes will run a consistency check on a filesystem that was not unmounted cleanly before mounting it again. Most everyone here has probably seen fsck run after a crash for this reason. That consistency check can take quite a long time, especially on a large filesystem. If the filesystem is sufficiently large, the check time can be hours. Worse still, if the crash happened at just the right (or wrong) time, it can cause logical filesystem damage (e.g., a corrupt directory), causing additional data loss. To solve this problem, one can use a journaling filesystem. A journaling filesystem does not simply write changes to the disk. First, it writes the changes to a journal (sometimes called a transaction log or just log). Then it writes the actual changes to the disk (sometimes called committing). Finally, it updates the journal to note that the changes were successfully written (sometimes called checkpointing). Now, if the system crashes in the middle of a transaction, upon re-mount, the system just has to look at the journal. If a complete transaction is present in the journal, but has not been checkpointed, the journal is played back to ensure the filesystem is made consistent. If an incomplete transaction is present in the journal, it was never committed, and thus can be discarded. Of course, none of this guarantees you won't lose data. If a program was in the middle of writing data to a file when the system crashed, chances are, that file is now scrambled. Journaling protects the filesystem itself from damage, and avoids the need for a consistency after a crash. It is also important to understand the difference between journaling *all* writes to a filesystem, and journaling just *metadata* writes. The term metadata means data about data. Things such as a file's name, size, time it was last modified, the specific blocks on disk used to store it, that sort of thing, is metadata. The metadata is critical, because corruption of a small amount of metadata can lead to the loss of large amounts of file data. Some journaling filesystems journal just metadata. This keeps the filesystem itself from becoming inconsistent in a crash, but may leave the file data itself corrupted. ReiserFS does this. Why journal just metadata? Because journaling everything can cause a big performance hit, and, as noted above, if the system crashed in the middle of a write, there is a good chance you've already lost data anyway. Other filesystems journal all writes, or at least give you the option to. EXT3 is one such filesystem. This can prevent file corruption in the case where an atomic write of the file data was buffered in memory and being written to disk when the crash occurred. About the only real drawback to a journaling filesystem is the performance hit. You have to write everything to disk *twice*: Once to the journal, and once to the actual filesystem. There are other drawbacks: Journaling filesystems are more complex, so are statistically more likely to have bugs in the implementation. But a non-journaling filesystem can have bugs, too, so I think the best answer is just more through code review and more testing. The journal also uses some space on the disk. But as the space used by the journal is typically megabytes on a multi-gigabyte filesystem, the overhead is insignificant. Finally, a journaling filesystem does not
Re: Filesystem overhead
On Wed, 30 Jul 2003, at 8:35am, [EMAIL PROTECTED] wrote: Very cool, that was revealing. Perhaps this discussion can evolve into how journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody? If a system crashes (software, hardware, power, whatever) in the middle of a write transaction, then it likely that the filesystem will be left in an inconsistent state. For that reason, many OSes will run a consistency check on a filesystem that was not unmounted cleanly before mounting it again. Most everyone here has probably seen fsck run after a crash for this reason. That consistency check can take quite a long time, especially on a large filesystem. If the filesystem is sufficiently large, the check time can be hours. Worse still, if the crash happened at just the right (or wrong) time, it can cause logical filesystem damage (e.g., a corrupt directory), causing additional data loss. To solve this problem, one can use a journaling filesystem. A journaling filesystem does not simply write changes to the disk. First, it writes the changes to a journal (sometimes called a transaction log or just log). Then it writes the actual changes to the disk (sometimes called committing). Finally, it updates the journal to note that the changes were successfully written (sometimes called checkpointing). Now, if the system crashes in the middle of a transaction, upon re-mount, the system just has to look at the journal. If a complete transaction is present in the journal, but has not been checkpointed, the journal is played back to ensure the filesystem is made consistent. If an incomplete transaction is present in the journal, it was never committed, and thus can be discarded. Of course, none of this guarantees you won't lose data. If a program was in the middle of writing data to a file when the system crashed, chances are, that file is now scrambled. Journaling protects the filesystem itself from damage, and avoids the need for a consistency after a crash. It is also important to understand the difference between journaling *all* writes to a filesystem, and journaling just *metadata* writes. The term metadata means data about data. Things such as a file's name, size, time it was last modified, the specific blocks on disk used to store it, that sort of thing, is metadata. The metadata is critical, because corruption of a small amount of metadata can lead to the loss of large amounts of file data. Some journaling filesystems journal just metadata. This keeps the filesystem itself from becoming inconsistent in a crash, but may leave the file data itself corrupted. ReiserFS does this. Why journal just metadata? Because journaling everything can cause a big performance hit, and, as noted above, if the system crashed in the middle of a write, there is a good chance you've already lost data anyway. Other filesystems journal all writes, or at least give you the option to. EXT3 is one such filesystem. This can prevent file corruption in the case where an atomic write of the file data was buffered in memory and being written to disk when the crash occurred. About the only real drawback to a journaling filesystem is the performance hit. You have to write everything to disk *twice*: Once to the journal, and once to the actual filesystem. There are other drawbacks: Journaling filesystems are more complex, so are statistically more likely to have bugs in the implementation. But a non-journaling filesystem can have bugs, too, so I think the best answer is just more through code review and more testing. The journal also uses some space on the disk. But as the space used by the journal is typically megabytes on a multi-gigabyte filesystem, the overhead is insignificant. Finally, a journaling filesystem does not eliminate the need for fsck and similar programs. Inconsistencies can be introduced into a filesystem in other ways (such as bugs in the filesystem code or hardware problems). Since, with a journaling filesystem, fsck will normally *never* be run automatically by the system, it becomes a good idea to run an fsck on a periodic basis, just in case. EXT2/3 even has a feature that will cause the filesystem to be automatically checked every X days or every Y mounts. Hope this helps, -- Ben Scott [EMAIL PROTECTED] | The opinions expressed in this message are those of the author and do | | not represent the views or policy of any other person or organization. | | All information is provided without warranty of any kind. | ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
On Wed, Jul 30, 2003 at 05:48:06PM -0400, Bill Freeman [EMAIL PROTECTED] wrote: Also, thinking about it later, I'm likely wrong that an inode was a whole block, even with 512 byte blocks. More likely is that original UFS inodes were 64 bytes, with 8 fitting in a 512 byte block. I'm sorry, I don't have any early sources to check on this. If I scrounge in the basement I might find my bound copies of the Unix Programmers Manual (UPM), which might have a white paper on the filesystem, but this seems to be too academic to bother. Close. UFS inodes are 128 bytes, fitting 4 per block. -- Bob Bell [EMAIL PROTECTED] ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
Very cool, that was revealing. Perhaps this discussion can evolve into how journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody? [EMAIL PROTECTED] wrote: Hello world! Okay, I have satisfied my curiosity in this matter. Bill Freeman [EMAIL PROTECTED], who replied to me off-list quickly after my original post, was correct in that the overhead I was seeing is due to indirect blocks. Credit also to Derek Martin [EMAIL PROTECTED], for providing very nice empirical evidence (and saving me the trouble of producing it). . [stuff deleted] -- __ | 0|___||. Andrew Gaunt *nix Sys. Admin,, etc. Lucent Technologies _| _| : : } [EMAIL PROTECTED] - http://www-cde.mv.lucent.com/~quantum -(O)-==-o\ [EMAIL PROTECTED] - http://www.gaunt.org ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
On Wed, 30 Jul 2003 08:35:16 -0400 Andrew W. Gaunt [EMAIL PROTECTED] wrote: Very cool, that was revealing. Perhaps this discussion can evolve into how journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody? I would like to see some real metrics on: ext2 ext3 JFS XFS ReiserFS In the cases of all the journaling file systems, the big gain is when rebooting after an improper shutdown. I personally use Reiserfs and have had no problems with it except when I had a bad memory chip in one of the laptops. I had to do the reiserfsck from a rescue disk, but it worked fine. The main tradeoff is run-time performance, but for the most part that is a good tradeoff unless your system needs all the I/O performance it can get. In my specific case I have a couple of filesystems I keep unmounted except when I do nightly backups. -- Jerry Feldman [EMAIL PROTECTED] Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9 pgp0.pgp Description: PGP signature
Re: Filesystem overhead
On Wed, 30 Jul 2003, at 7:12pm, [EMAIL PROTECTED] wrote: Probably, the best thing that ext3 has going for it is its compatibility with ext2. Yah. I would also go so far as to say that EXT3 is the most robust (in terms of implementation) journaling filesystem available on Linux. Not because XFS or Reiser suck (they don't), but simply because EXT2/3 has been around on Linux longest, and the people maintaining it are in a position to be the most familiar with Linux. The maintainers are also *very* conservative, which can be a Good Thing when you're talking about code stability. ReiserFS doesn't suck. It also has some things EXT3 doesn't have, like better handling of large directories. We've got a customer with a 550 GB ReiserFS filesystem they've had for well over two years, and it has never given us any trouble. (We picked ReiserFS because 2.4 wasn't stable at the time and XFS (which also handles large directories well) wasn't available on 2.2.) ReiserFS has some advantages is that they optimize small files so less space is wasted. Yah, ReiserFS calls them tails. Mounting with tails turned off is a standard practice for many, though, since they tend to really drag down performance. At least in the current implementation. -- Ben Scott [EMAIL PROTECTED] | The opinions expressed in this message are those of the author and do | | not represent the views or policy of any other person or organization. | | All information is provided without warranty of any kind. | ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
Hello world! Okay, I have satisfied my curiosity in this matter. Bill Freeman [EMAIL PROTECTED], who replied to me off-list quickly after my original post, was correct in that the overhead I was seeing is due to indirect blocks. Credit also to Derek Martin [EMAIL PROTECTED], for providing very nice empirical evidence (and saving me the trouble of producing it). Once I knew what I was looking for, finding it with Google was easy enough to do. ;-) A good brief explanation: An inode stores up to 12 direct block numbers, summing up to a file size of 48 kByte. Number 13 points to a block with up to 1024 block numbers of 32 Bit size each (indirect blocks), summing up to a file size of 4 MByte. Number 14 points to a block with numbers of blocks containing numbers of data blocks (double indirect, up to 4 GByte); and number 15 points to triple indirect blocks (up to 4 TByte). -- from http://e2undel.sourceforge.net/how.html A good diagram: http://e2fsprogs.sourceforge.net/ext2-inode.gif Additional references: http://www.nongnu.org/ext2-doc/ http://e2fsprogs.sourceforge.net/ext2intro.html Finally, Mr. Freeman's reply to my original post was sufficiently informative and well-written then I asked for and received his permission to repost it here (thanks Bill!). His post is about UFS (the original(?) Unix File System), but most of the concepts (if not the numbers) apply to EXT2/3 as well. -- Begin forwarded message -- Date: Mon, 28 Jul 2003 13:24:16 -0400 From: Bill Freeman [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Filesystem overhead Ben, I'd guess that du is counting the indirect blocks, except then the overhead that you see is too small, unless things have gotten a lot better than early Unix days. Actually, they probably have gotten better, having some scheme to allocate most of a large file from contiguous sets of blocks that only needs a single pointer in an inode or indirect block. But whatever the allocation unit is, you need at least an index into the allocation space plus an indication of which space, and more likely a block offset within the filesystem, for each unit of data. If 32 bit offsets are enough (maybe not for the new extra large filesystems), then to see the approximate 0.1% overhead you're describing would need 4k allocation units, which seems reasonable to me. Actually, I assume that you know the stuff below, but I'm going to say it anyway. This is all from UFS: I've never studied extN stuff internally. In old Unix systems, blocks were 512 bytes. An inode was a block, and after things like permissions, size, fragment allocation in the last block, owner, group, etc., there was room for 13 disk pointers (index of block within the partition). The first 10 of these were used to point to the first 10 data blocks of the file. If a file was bigger than 5k (needed more than 10 blocks), then the 11th pointer pointed to a block that was used for nothing but pointers. With 32 bit (4 byte) pointers, 128 pointers would fit in a block, so this single indirect block could handle the next 64k of the file. If the file was larger than 69k (more than fills the single indirect block), then the 12th pointer in the inode points to a double indirect block, a block of pointers to blocks of pointers to blocks of data. In the 4 byte pointer 512 byte block world, this handles the next 8Mb of the file. Finally, if the file was too big for that, the last inode pointer pointed to a triple indirect block, a pointer to a block of pointers to blocks of pointers to blocks of pointers to data blocks. That handled the next 1Gb of the file. This size comfortably exceeds the wildest dreams of a PDP-11, the original implementation platform for Unix. The washing machine sided drives of the day only held between 2.5Mb and 10Mb. Even when we started being able to get 40Mb drives the choice wasn't a concern. By the mid 1980's, however, big system vendors (I was at Alliant, who made so called mini-super-computers) were scrambling to find creative ways to expand the limits on both filesystems and individual files without breaking too many things. Linux has clearly been using 1k blocks, and I wouldn't be surprised by the allocation of 4 block clusters for all but the last (fragmented) blocks. One to one thousand overhead to data sounds pretty reasonable to me. Bill -- End forwarded message -- Thanks to everyone who responded. I hope other people have found this thread as informative and useful as I have. Clear skies! -- Ben Scott [EMAIL PROTECTED] | The opinions expressed in this message are those of the author and do | | not represent the views or policy of any other person or organization. | | All information is provided without warranty of any kind.
Re: Filesystem overhead
That is most likely the meta data kept by most logging file systems. Or its the duplicate superblocks. (You see these when mkfs is run.) ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
On Mon, 28 Jul 2003, at 2:11pm, [EMAIL PROTECTED] wrote: That is most likely the meta data kept by most logging file systems. Or its the duplicate superblocks. (You see these when mkfs is run.) I would buy that if I was comparing the size of the raw device (partition) to the available space in an empty filesystem. But I'm just looking at a single file here. And a 1-byte file uses only 4096 bytes, which is what I would expect on this filesystem (4096 bytes is the EXT3 block size for this one). -- Ben Scott [EMAIL PROTECTED] | The opinions expressed in this message are those of the author and do | | not represent the views or policy of any other person or organization. | | All information is provided without warranty of any kind. | ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
On Mon, 28 Jul 2003 [EMAIL PROTECTED] wrote: For example, I have an image of a data CD in a single file. The actual size of the logical file (as reported by stat, ls, and other tools) is 526,397,440 bytes. However, the du utility says it uses 526,917,632 bytes. That is a difference of 520,192 bytes, or almost half a megabyte. if I remember correctly, it's that du gets the size in K and then if you ask for bytes, converts the K to bytes... but it's been awhile. Silly, I know... but that's what I recall learning in college. -- An inch of time is an inch of gold but you can't buy that inch of time with an inch of gold. ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
Re: Filesystem overhead
On Mon, 28 Jul 2003, at 7:50pm, [EMAIL PROTECTED] wrote: if I remember correctly, it's that du gets the size in K and then if you ask for bytes, converts the K to bytes... but it's been awhile. What du actually does is use the stat family of system calls to find information about the file's inode. The stat data includes a field that includes the number of blocks used to store the inode's data. A block, in the context of stat, is always 512 bytes in size. So the answer (from stat or anything that uses it) will always be a multiple of 512. But I am seeing differences in the hundreds of kilobytes, not just 512 or so bytes. Another list reader suggested, off-list, that the numbers I am seeing might include filesystem blocks used as indirect blocks, which I had forgotten all about. That explanation does seem possible, even likely, at least for this particular example file (a very large, single file). I'm going to have to do a little more exploring before I'm satisfied that is the right answer everywhere (everywhere being defined as on the systems it has become a concern for me). :-) -- Ben Scott [EMAIL PROTECTED] | The opinions expressed in this message are those of the author and do | | not represent the views or policy of any other person or organization. | | All information is provided without warranty of any kind. | ___ gnhlug-discuss mailing list [EMAIL PROTECTED] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss