Re: ext2 feature request

2000-05-01 Thread tytso

   Date: Mon, 1 May 2000 21:29:42 +0200 (MEST)
   From: Lennert Buytenhek [EMAIL PROTECTED]

   Eeeeh. not quite.

   Because we reserve the first 1024 bytes of an ext2 fs for putting the boot
   loader, we put the superblock at offset 1024. For 1k blocks, this means
   the superblock ends up in block #1. For 2k/4k block sizes, this means the
   superblock ends up in block #0 (albeit at offset 1024). So for 1k block
   fs'es, s_first_data_block is 1. For 2k/4k blocks fs'es, s_first_data_block
   is 0. So this field is actually being used.

Yup, that's right.  In order to make this work, we'd need to make a
certain number of blocks as "in use" so that the kernel doesn't allocate
into them, and then somehow signal to e2fsck to not to release those
blocks when it is doing the check.  

The answer to that is pretty simple --- you create a compatible
filesystem feature, and indicate the number of "extra" saved blocks in
the superblock.  Old e2fsck's who don't know about the compatible
filesystem features will refuse to touch the filesystem, so they won't
clear the extra bits in the block bitmap.  However, since it's a
compatible filesystem feature, the kernel will happily mount such a
filesystem --- since it trusts the accuracy of the block allocation
bitmaps, it won't try to allocate any of the reserved blocks.

So this can be done using a pure e2fsprogs enhancement. maybe I'll
just code it up, and see what happens.  :-)

- Ted

P.S.  The e2fsprogs 1.19 will have support for off-line resizing, even
without this feature.  It just simply moves the inode table if necessary
(with very paranoid, very carefully written code).  (Yes, that's right,
the GPL timeout on resize2fs finally timed out.)  If you want an early
peek at it, the 1.19 pre-release snapshot at e2fsprogs.sourceforge.net
already has it.  I just need to get a couple of final bug
fixes/enhancements before I release 1.19 for real.  

(I'm currently debapting whether or not to code this above idea before
or after the 1.19 release.  :-)



Re: optimal number of files in directory on ext2

2000-03-19 Thread tytso

   Date:   Wed, 15 Mar 2000 23:08:38 +0100 (CET)
   From: Michal Pise [EMAIL PROTECTED]

   I was asked to create the application which will handle about 100 000 000
   relatively small (about 1-2k) files. I decided to make a
   multi-level directory tree, but I don't know which is the optimal number
   of files in one directory. Is it 10, 100, 1000 or even more? 

   Someone told me that directory should be so small so all the entries fits
   in directly adressed blocks - on 4kB block its 12*4kB = 48kB of space for
   filenames, for short filenames its at least 1000 files in one directory.

   Another guy told me thah directory should be so small so all the entries
   fits in the first block. Its something about 200 short filenames in one
   directory.

   But if I create 14 directories each with 14 files, the average lookup time
   will be 2 * 7 * time_to_explore_directory_entry + time_to_descend_into_dir
   instead of 100 * time_to_explore_directory_entry. And I think that
   time_to_descend_into_dir  80 * time_to_explore_directory_entry. So its
   better to create tree with many levels and only a few files in each
   directory.

   Am I right? Did I miss something?

The important point to consider is that actually searching a directory
block is cheap (mere CPU time).  What is expensive is disk I/O.  This
comes from (a) reading the directory block, and (b) reading the inode
block to find out where the directory entries are.

Given that you will need to read the directory blocks to get at the data
regardless, the only cost to consider between the two approaches is
reading the inode data.  By using a flatter directory hierarchy, Linux
will need to read in fewer inodes, and those inodes, once read into
memory, will more likely be cached.

One futher trick you can play is to pre-allocate space in the
directories so that their blocks are contiguously allocated.  See the
sources to mklost+found (included below, since it's so small),  for an
example of how to do this.   This will further speed up the search, and
make the fewer directory levels approach even more attractive, since you
avoid having to force the disk to seek all over the place.

Finally, note that by using multiple directories, what you are
effectively doing is a Radix search, which for certain datasets (and
from what I can tell, your file name space may very well qualify), a
carefully designed Radix Search tree can be more efficient than a
B-tree.  So even if in the future, if you decide to move to a different
filesystem that has B-tree directories, such as xfs, reseirfs, or a
future version of ext3, you may still want to consider keeping this
structure in your program.

- Ted

/*
 * mklost+found.c   - Creates a directory lost+found on a mounted second
 *extended file system
 *
 * Copyright (C) 1992, 1993  Remy Card [EMAIL PROTECTED]
 *
 * This file can be redistributed under the terms of the GNU General
 * Public License
 */

#include errno.h
#include fcntl.h
#include stdio.h
#include string.h
#include unistd.h
#include sys/param.h
#include sys/stat.h

#include linux/ext2_fs.h

#define LPF "lost+found"

int main (int argc, char ** argv)
{
char name [EXT2_NAME_LEN];
char path [sizeof (LPF) + 1 + 256];
struct stat st;
int i, j;
int d;

if (mkdir (LPF, 0755) == -1) {
perror ("mkdir");
exit(1);
}

i = 0;
memset (name, 'x', 252);
do {
sprintf (name + 252, "%02d", i);
strcpy (path, LPF);
strcat (path, "/");
strcat (path, name);
if ((d = creat (path, 0644)) == -1) {
perror ("creat");
exit (1);
}
i++;
close (d);
if (stat (LPF, st) == -1) {
perror ("stat");
exit (1);
}
} while (st.st_size = (EXT2_NDIR_BLOCKS - 1) * st.st_blksize);
for (j = 0; j  i; j++) {
sprintf (name + 252, "%02d", j);
strcpy (path, LPF);
strcat (path, "/");
strcat (path, name);
if (unlink (path) == -1) {
perror ("unlink");
exit (1);
}
}
exit (0);
}






Re: lost partition table

2000-01-25 Thread tytso

   Date: Sat, 22 Jan 2000 12:13:08 +0100
   From: Guest section DW [EMAIL PROTECTED]

   On Fri, Jan 21, 2000 at 09:27:59AM -0500, [EMAIL PROTECTED] wrote:

Actually, there are two such tools.

   (i) fixdisktable (http://bmrc.berkeley.edu/people/chaffee/fat32.html)
   (ii) gpart (http://home.pages.de/~michab/gpart/)

   Ha, Ted, you quote my text, but remove the rescuept part. Why?
   Have you decided the other two are better? In that case I disagree.
   I have lots of reports where people could not achieve anything
   with fixdisktable and had success with rescuept.
   Maybe fixdisktable is buggy on large disks.

   I think that if you quote only two out of the four then
   it should be rescuept and gpart.

I removed rescuept and findsuper because they are both manual tools, and
in my experience a lot of the people asking for help don't have the
expertise to use the manual tools.  Since what I sent is my "stock
answer" for when people ask me for help, I optimized the list for tools
where people would be less likely to send me e-mail again asking for
help.  (I don't have the time to be a generic Linux help desk, but I
still want to help people get their work done.)

If there are cases where rescuept will recover information that gpart or
fixdisktable will not, then I should probably update my stock answer
file to include it.  Better yet, someone with time should get this into
one of the HOWTO documents for the LDP.  Any volunteers?  :-)

- Ted



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-14 Thread tytso

   From: "Stephen C. Tweedie" [EMAIL PROTECTED]
   Date:   Mon, 11 Oct 1999 17:34:36 +0100 (BST)

   The _fast_ quick fix is to maintain a per-inode list of dirty buffers
   and to invalidate that list when we do a delete.  This works for
   directories if we only support truncate back to zero --- it obviously
   gets things wrong if we allow partial truncates of directories (but why
   would anyone want to allow that?!)

   This would have minimal performance implication and would also allow
   fast fsync() of indirect block metadata for regular files.

I've actually had patches to do fast fsync for some time, but I
thought you said you had some changes in queue which would make the
need for this obsolete, so didn't bother to submit this, since it is a
bit of a hack.  Here it is though, for folks to comment upon.  Code to
invalidate the list of dirty buffers is missing from this code, but it
wouldn't be hard to add.

- Ted

Patch generated: on Wed Aug 18 15:01:49 EDT 1999 by [EMAIL PROTECTED]
against Linux version 2.2.10
 
===
RCS file: fs/ext2/RCS/fsync.c,v
retrieving revision 1.1
diff -u -r1.1 fs/ext2/fsync.c
--- fs/ext2/fsync.c 1999/07/21 10:45:22 1.1
+++ fs/ext2/fsync.c 1999/07/21 10:45:26
@@ -250,36 +250,83 @@
return err;
 }
 
-/*
- * File may be NULL when we are called. Perhaps we shouldn't
- * even pass file to fsync ?
- */
-
-int ext2_sync_file(struct file * file, struct dentry *dentry)
+static int sync_inode(struct inode *inode)
 {
-   int wait, err = 0;
-   struct inode *inode = dentry-d_inode;
-
+   int i, wait, err = 0;
+   __u32   *blklist;
+   
if (S_ISLNK(inode-i_mode)  !(inode-i_blocks))
-   /*
-* Don't sync fast links!
-*/
-   goto skip;
+   return 0;
 
-   for (wait=0; wait=1; wait++)
-   {
-   err |= sync_direct (inode, wait);
-   err |= sync_indirect (inode,
+   if (!S_ISDIR(inode-i_mode)  inode-u.ext2_i.i_ffsync_flag) {
+   blklist = inode-u.ext2_i.i_ffsync_blklist;
+   for (wait = 0; wait =1; wait++) {
+   for (i=0; i  inode-u.ext2_i.i_ffsync_ptr; i++) {
+#if 0  /* Debugging */
+   if (!wait)
+   printk("Fasy sync: %d\n", blklist[i]);
+#endif 
+   err |= sync_block(inode, blklist[i], wait);
+   }
+   }
+   } else {
+   for (wait=0; wait=1; wait++) {
+   err |= sync_direct (inode, wait);
+   err |= sync_indirect (inode,
  inode-u.ext2_i.i_data+EXT2_IND_BLOCK,
- wait);
-   err |= sync_dindirect (inode,
+ wait);
+   err |= sync_dindirect (inode,
   inode-u.ext2_i.i_data+EXT2_DIND_BLOCK, 
-  wait);
-   err |= sync_tindirect (inode, 
+  wait);
+   err |= sync_tindirect (inode, 
   inode-u.ext2_i.i_data+EXT2_TIND_BLOCK, 
-  wait);
+  wait);
+   }
}
-skip:
+   inode-u.ext2_i.i_ffsync_flag = 1;
+   inode-u.ext2_i.i_ffsync_ptr = 0;
err |= ext2_sync_inode (inode);
+   return err;
+}
+
+/*
+ * File may be NULL when we are called by msync on a vma.  In the
+ * future, the VFS layer should be changed to not pass the struct file
+ * parameter to the fsync function, since it's not used by any of the
+ * implementations (and the dentry parameter is all that we need).
+ */
+int ext2_sync_file(struct file * file, struct dentry *dentry)
+{
+   int err = 0;
+
+   err = sync_inode(dentry-d_inode);
+   if (dentry-d_parent  dentry-d_parent-d_inode)
+   err |= sync_inode(dentry-d_parent-d_inode);
+   
return err ? -EIO : 0;
+}
+
+/*
+ * This function adds a list of blocks to be written out by fsync.  If
+ * it exceeds NUM_EXT2_FFSYNC_BLKS, then we turn off the fast fsync flag.
+ */
+void ext2_ffsync_add_blk(struct inode *inode, __u32 blk)
+{
+   int i;
+   __u32   *blklist;
+
+   if (inode-u.ext2_i.i_ffsync_flag == 0)
+   return;
+#if 0  /* Debugging */
+   printk("Add fast sync: %d\n", blk);
+#endif
+   blklist = inode-u.ext2_i.i_ffsync_blklist;
+   for (i=0; i  inode-u.ext2_i.i_ffsync_ptr; i++)
+   if (blklist[i] == blk)
+   return;
+   if (inode-u.ext2_i.i_ffsync_ptr  NUM_EXT2_FFSYNC_BLKS) {