Re: Ext2 / VFS projects

2000-02-11 Thread Manfred Spraul

Erez Zadok wrote:
> [...]
> (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
> can be included in the same original source code as well as stackable
> file systems:
> 
> check_parent macro (fs/namei.c -> include/linux/dcache_func.h)
> lock_parent (fs/namei.c -> include/linux/dcache_func.h)
> get_parent (fs/namei.c -> include/linux/dcache_func.h)
> unlock_dir (fs/namei.c -> include/linux/dcache_func.h)
> double_lock (fs/namei.c -> include/linux/dcache_func.h)
> double_unlock (fs/namei.c -> include/linux/dcache_func.h)
> 
That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of
most of these functions...

--
Manfred



Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Manfred Spraul writes:
> Erez Zadok wrote:
> > [...]
> > (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
> > can be included in the same original source code as well as stackable
> > file systems:
> > 
> > check_parent macro (fs/namei.c -> include/linux/dcache_func.h)
> > lock_parent (fs/namei.c -> include/linux/dcache_func.h)
> > get_parent (fs/namei.c -> include/linux/dcache_func.h)
> > unlock_dir (fs/namei.c -> include/linux/dcache_func.h)
> > double_lock (fs/namei.c -> include/linux/dcache_func.h)
> > double_unlock (fs/namei.c -> include/linux/dcache_func.h)
> > 
> That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of
> most of these functions...

I agree.  I didn't want to make copies of those b/c I got burnt in the past
when they changed subtly and I didn't notice the change.

> --
>   Manfred

Erez.



Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Tigran 
Aivazian writes:
> I noticed the stackable fs item on Alan's list ages ago but there was no
> pointer to the patch (I noticed FIST stuff but surely that is not a "small
> passive patch" you are referring to?)

Yes the patches are small and passive.  No new vfs/mm code is added or
changed!  The most important part of my patches had already been included
since 2.3.17; that was an addition/renaming of a private field in struct
vm_area_struct.  What's left are things that are necessary to support
stacking for the first time in linux: exposing some functions/symbols from
{mm,fs}/*.c, adding externs to headers, additions to ksyms.c, and moving
some macros and inline functions from private .c files to a header, so they
can be included in any file system.

I've used these patches on dozens of linux machines for the past 2+ years,
and have had no problems.  I constantly get people asking me when my patches
will become part of the main kernel.  I have about 9 active developers who
write file systems using my templates.  I've had more than 21,000 downloads
of my templates in the past two years.

> So, my point is - if you point everyone to those patches, someone might
> help Alan out if one feels like it (and has time).

http://www.cs.columbia.edu/~ezk/research/software/fist-patches/

The latest 2.3 patches in that URL include two things: my small main kernel
patches, and a fully working lofs.  The lofs of course is several thousands
of lines of code, but it is not strictly necessary to include it with the
main kernel; it can be distributed and built separately, just as my other f/s
modules are.  However, I do think that lofs is a useful enough f/s that it
should be part of the main kernel.

If you go to the 2.3 directory under the above URL, there's a README
describing the latest 2.3 patches.  I've included it below, so everyone can
read it and see what my patches do, and how harmless they are.

BTW, I've got a prototype unionfs for linux if anyone is interested.

> Regards,
> Tigran.

As always, I'll be delighted to help *anyone* use my work, and would love to
help the linux maintainers incorporate my patches, answer any concerns they
might have, etc.

Cheers,
Erez.

==
Summary of changes for 2.3.25 to support stackable file systems and lofs.

(Note: some of my previous patches had been incorporated in 2.3.17.)

(1) Created a new header file include/linux/dcache_func.h.  This header file
contains dcache-related definitions (mostly static inlines) used by my
stacking code and by fs/namei.c.  Ion and I tried to put these
definitions in fs.h and dcache.h to no avail.  We would have to make
lots of changes to fs.h or dcache.h and other .c files just to get these
few definitions in.  In the interest of simplicity and minimizing kernel
changes, we opted for a new, small header file.  This header file is
included in fs/namei.c because everything in dcache_func.h was taken
from fs/namei.c.  And of course, these static inlines are useful for my
stacking code.

If you don't like the name dcache_func.h, maybe you can suggest a better
name.  Maybe namei.h?

If you don't like having a new header file, let me know what you'd
prefer instead and I'll work on it, even if it means making more changes
to fs.h, namei.c, and dcache.h...

(2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
can be included in the same original source code as well as stackable
file systems:

check_parent macro (fs/namei.c -> include/linux/dcache_func.h)
lock_parent (fs/namei.c -> include/linux/dcache_func.h)
get_parent (fs/namei.c -> include/linux/dcache_func.h)
unlock_dir (fs/namei.c -> include/linux/dcache_func.h)
double_lock (fs/namei.c -> include/linux/dcache_func.h)
double_unlock (fs/namei.c -> include/linux/dcache_func.h)

(3) Added to include/linux/fs.h an extern definition to default_llseek.

(4) include/linux/mm.h: also added extern definitions for

filemap_swapout
filemap_swapin
filemap_sync
filemap_nopage

so they can be included in other code (esp. stackable f/s modules).

(5) added EXPORT_SYMBOL declarations in kernel/ksyms.c for functions which I
now exposed to (stackable f/s) modules:

EXPORT_SYMBOL(___wait_on_page);
EXPORT_SYMBOL(add_to_page_cache);
EXPORT_SYMBOL(default_llseek);
EXPORT_SYMBOL(filemap_nopage);
EXPORT_SYMBOL(filemap_swapout);
EXPORT_SYMBOL(filemap_sync);
EXPORT_SYMBOL(remove_inode_page);
EXPORT_SYMBOL(swap_free);
EXPORT_SYMBOL(nr_lru_pages);
EXPORT_SYMBOL(console_loglevel);

(6) mm/filemap.c: make the function filemap_nopage non-static, so it can be
called from other places.  This was not an inline function so there's no
performance impact.

ditto 

Re: Ext2 / VFS projects

2000-02-11 Thread Stephen C. Tweedie

Hi,

On Thu, 10 Feb 2000 10:27:29 -0500 (EST), Alexander Viro
<[EMAIL PROTECTED]> said:

> Correct, but that's going to make design much more complex - you really
> don't want to do it for anything other than sub-page stuff (probably even
> sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector
> fragment. Not to mention the fact that for cases when you have 1K
> fragments and really large blocks you don't want all this mess around...
> It's doable, indeed, but...

Sure, but to me the main question is this --- can we do this sort of
fragment support in ext3 without having to add complexity to the rest of
the VM/VFS?  I think the answer is yes.

--Stephen



Re: Ext2 / VFS projects

2000-02-11 Thread Tigran Aivazian

I noticed the stackable fs item on Alan's list ages ago but there was no
pointer to the patch (I noticed FIST stuff but surely that is not a "small
passive patch" you are referring to?)

So, my point is - if you point everyone to those patches, someone might
help Alan out if one feels like it (and has time).

Regards,
Tigran.

On Thu, 10 Feb 2000, Erez Zadok wrote:
> Also, I really hope that my remaining (small, passive) patches to the VFS to
> support stackable file systems will be incorporated soon.



Re: Ext2 / VFS projects

2000-02-10 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Matthew Wilcox writes:
> 
> Greetings.  Ted Ts'o recently hosted an ext2 puffinfest where we
> discussed the future of the VFS and ext2.  Ben LaHaise, Phil Schwan,
[...]

Also, I really hope that my remaining (small, passive) patches to the VFS to
support stackable file systems will be incorporated soon.

Cheers,
Erez.



Re: Ext2 / VFS projects

2000-02-10 Thread Stephen C. Tweedie

Hi,

On Wed, 09 Feb 2000 11:31:03 -0500, Matthew Wilcox <[EMAIL PROTECTED]>
said:

> fine-grained locking
>   [remove test_and_set_bit()]

The critical one here is the superblock lock.

--Stephen



Re: Ext2 / VFS projects

2000-02-10 Thread Alexander Viro



On Thu, 10 Feb 2000, Stephen C. Tweedie wrote:

> That shoudn't matter.  In the new VM it would be pretty trivial for the
> filesystem to reserve a separate address_space against which to cache
> fragment blocks.  Populating that address_space when we want to read a
> fragment block doesn't have to be any more complex than populating the
> page cache already is.  IO itself shouldn't be hard.

Correct, but that's going to make design much more complex - you really
don't want to do it for anything other than sub-page stuff (probably even
sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector
fragment. Not to mention the fact that for cases when you have 1K
fragments and really large blocks you don't want all this mess around...
It's doable, indeed, but...



Re: Ext2 / VFS projects

2000-02-10 Thread Stephen C. Tweedie

Hi,

On Wed, 9 Feb 2000 14:30:13 -0500 (EST), Alexander Viro
<[EMAIL PROTECTED]> said:

> On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

>> with 2k blocks and 128 byte fragments, we get to really reduce wasted
>> space below any other system i've ever experienced.

> Erm... I'm afraid that you are missing the point. You will get the
> hardware sectors shared between the files. And you can't pass requess
> smaller than that. _And_ you have to lock the bh when you do IO. Now,
> estimate the fun with deadlocks...

That shoudn't matter.  In the new VM it would be pretty trivial for the
filesystem to reserve a separate address_space against which to cache
fragment blocks.  Populating that address_space when we want to read a
fragment block doesn't have to be any more complex than populating the
page cache already is.  IO itself shouldn't be hard.

Yes, this will end up double-caching fragmented files to some extent,
since we'll have to reserve a separate, non-physically-mapped page for
the tail of a fragmented file.

Allocation/deallocation of fragments themselves obviously has to be done
very carefully, but we already have to deal with that sort of race in
the filesystem for normal allocations --- this isn't really any
different in principle.

--Stephen



Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote:
> 
> On 09-Feb-00 Andi Kleen wrote:
> > On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:
> > 
> > [...]
> > 
> > How about secure deletion? 
> > 
> > 1.3 used to have some simple minded overwriting of deleted data when the 
> > 's' attribute was set.  That got lost with 2.0+. 
> > 
> > Secure disk overwriting that is immune to 
> > manual surface probing seems to take a lot more effort  (Colin Plumb's 
> > sterilize does 25 overwrites with varying bit patterns). Such a complicated
> > procedure is probably better kept in user space. What I would like is some
> > way to have a sterilize daemon running, and when a get 's' file gets
> > deleted the VFS would open a new file descriptor for it, pass it to 
> > sterilized (sterild?) using a unix control message and let it do its job.
> > 
> > What does the audience think? Should such a facility have kernel support
> > or not?  I think secure deletion is an interesting topic and it would be
> > nice if Linux supported it better.
> 
> You have to be careful that you don't leak the file you're trying to eradicate
> into the swap via the serilize daemon.  I guess simply never reading the file
> is a good start.

sterilize does that. You have of course be careful that you didn't leak
its content to swap before (one way around that is encrypted swapping) 

> 
> The other question is whether you're talking about an ext2-specific thing, or
> whether its a general service all filesystems provide.  Many filesystem

I was actually only thinking about ext2 (because only it has a 's' bit 
and the thread is about ext2's future) 

> designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl
> filesystem, don't let a process overwrite an existing block on disk.  Well,
> ext3 does, but only via the journal; wafl never does.  There's also the
> question of what happens when you have a RAID device under the filesystem,
> especially with hot-swappable disks.

reiserfs lets you when you don't change the file size (if you do it is possible
that the file is migrated from a formatted node to a unformatted node). 
sterilize does not change file sizes.

ext3 only doesn't let you when you do data journaling (good point I forgot
that) 

RAID0/RAID1 are no problem I think, because you have always well defined
block(s) to write too. The wipe data does not depend on the old data on
the disk, so e.g. on a simple mirrored configuration both blocks would
be sterilized in parallel.

RAID5 devices could be a problem, especially when they do data journaling
(I think most only journal some metadata). It is not clear how the sterilize
algorithms interact with the XORed blocks.

If you swap your disks inbetween you lost.

> 
> Perhaps a better approach, since we're talking about a privileged process, is
> to get a list of raw blocks and go directly to the disk.  You'd have to be very
> careful to synchronize with the filesystem...

Not too much. The file still exists, but there are no references to it
outside sterild. No other process can access it. Assuming the file system
does not have fragments and the raw io has block granuality and the file was
fdatasync'ed before you could directly access it without worrying about
any file system interference. If the fs has fragments you need the
infrastructure needed for O_DIRECT (I think that is planned anyways).

With a "invalidate all dirty buffers for file X" call you could optimize
part of the fdatasync writes away, but a good sterilize needs so many 
writes anyways (25+) that it probably does not make much difference. 

The data would be only really deleted when the system is turned off,
because it could partly still exist in some not yet reused buffers.


-Andi



Re: Ext2 / VFS projects

2000-02-09 Thread Jeremy Fitzhardinge


On 09-Feb-00 Andi Kleen wrote:
> On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:
> 
> [...]
> 
> How about secure deletion? 
> 
> 1.3 used to have some simple minded overwriting of deleted data when the 
> 's' attribute was set.  That got lost with 2.0+. 
> 
> Secure disk overwriting that is immune to 
> manual surface probing seems to take a lot more effort  (Colin Plumb's 
> sterilize does 25 overwrites with varying bit patterns). Such a complicated
> procedure is probably better kept in user space. What I would like is some
> way to have a sterilize daemon running, and when a get 's' file gets
> deleted the VFS would open a new file descriptor for it, pass it to 
> sterilized (sterild?) using a unix control message and let it do its job.
> 
> What does the audience think? Should such a facility have kernel support
> or not?  I think secure deletion is an interesting topic and it would be
> nice if Linux supported it better.

You have to be careful that you don't leak the file you're trying to eradicate
into the swap via the serilize daemon.  I guess simply never reading the file
is a good start.

The other question is whether you're talking about an ext2-specific thing, or
whether its a general service all filesystems provide.  Many filesystem
designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl
filesystem, don't let a process overwrite an existing block on disk.  Well,
ext3 does, but only via the journal; wafl never does.  There's also the
question of what happens when you have a RAID device under the filesystem,
especially with hot-swappable disks.

Perhaps a better approach, since we're talking about a privileged process, is
to get a list of raw blocks and go directly to the disk.  You'd have to be very
careful to synchronize with the filesystem...

J



Re: Ext2 / VFS projects

2000-02-09 Thread Benjamin C.R. LaHaise

On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

> This requires Ben's work to decouple the ext2 allocation size from the
> hardware page size.  We would _always_ want to write out a fragment block
> as one to ensure that the fragment descriptor wasn't at odds with the
> contents of the block.  Imagine the descriptor not being written out after
> the block was compacted.

My initial plan is to decouple the allocation size from the hardware page
size only for the cases where the allocation size is larger than the
physical block size of the disk.  Going beyond that is non-trivial, but
doable.  It may only be interesting for e2compr, since large blocks with
512 byte fragments will rock.

-ben



Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:

[...]

How about secure deletion? 

1.3 used to have some simple minded overwriting of deleted data when the 
's' attribute was set.  That got lost with 2.0+. 

Secure disk overwriting that is immune to 
manual surface probing seems to take a lot more effort  (Colin Plumb's 
sterilize does 25 overwrites with varying bit patterns). Such a complicated
procedure is probably better kept in user space. What I would like is some
way to have a sterilize daemon running, and when a get 's' file gets
deleted the VFS would open a new file descriptor for it, pass it to 
sterilized (sterild?) using a unix control message and let it do its job.

What does the audience think? Should such a facility have kernel support
or not?  I think secure deletion is an interesting topic and it would be
nice if Linux supported it better.

sterilize also does some tricks to overwrite entries in directories, but
I see no easy way to make that fit into the kernel. 

Comments? 

-Andi
-- 
This is like TV. I don't like TV.



Re: Ext2 / VFS projects

2000-02-09 Thread willy

On Wed, Feb 09, 2000 at 02:30:13PM -0500, Alexander Viro wrote:
> 
> 
> On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:
> 
> > with 2k blocks and 128 byte fragments, we get to really reduce wasted
> > space below any other system i've ever experienced.
> 
> Erm... I'm afraid that you are missing the point. You will get the
> hardware sectors shared between the files. And you can't pass requess
> smaller than that. _And_ you have to lock the bh when you do IO. Now,
> estimate the fun with deadlocks...

This requires Ben's work to decouple the ext2 allocation size from the
hardware page size.  We would _always_ want to write out a fragment block
as one to ensure that the fragment descriptor wasn't at odds with the
contents of the block.  Imagine the descriptor not being written out after
the block was compacted.



Re: Ext2 / VFS projects

2000-02-09 Thread Theodore Y. Ts'o

   Date:   Wed, 9 Feb 2000 13:25:23 -0500
   From: [EMAIL PROTECTED]

   Yes.  Based on my misunderstanding of how BSD did fragments, Ted and I
   came up with an interestingly different way of doing fragments.  Here's
   the basic idea:

   Keep the bitmaps of _blocks_ instead of fragments and use a second
   bitmap to identify which blocks are non-full fragment blocks.  Use the
   last fragment in a fragment block as a fragment block descriptor to
   indicate which inode each fragment belongs to (exact format still to
   be determined).  Fragment blocks are kept compact as otherwise we would
   have to deal with internal fragmentation.

The further refinement of this plan is to use a always keep the
fragments compacted, and then using an indirection table.  So instead of
storing the fragment address in the inode, we store an index into
fragment location table in find the fragment.  This makes it trivial to
pack the fragments in the block to avoid internal fragmentation.

Then you don't need a bitmap to keep track of the fragment allocation;
you just need a single entry in the administrative fragment block to
point at the next free fragment.

Note that the idea here is to set the block size to the maximum ideal
transfer size for disks.  For modern disks, that's probably something
like 64k or 128k.  (i.e., it doesn't take much more time to read 64k
compared to 1k).

The one downside of this plan is that when you delete a file with a
tail, you have to do an extra block read/write to update the allocation
information in the fragment block.  In the BSD scheme, you just have to
update the allocation bitmap.  This does slow deletions by a small
amount, but that might not be that big of an issue.

The reason why we were considering this sort of thing is because when
the difference between the fragment and block size grows, the potential
problems with internal fragmentation is a real issue.

- Ted



Re: Ext2 / VFS projects

2000-02-09 Thread Matti Aarnio

   Lately I have been encaged at other activities, and haven't
   had time to check upon VFS layer happenings.

On Wed, Feb 09, 2000 at 11:31:03AM -0500, Matthew Wilcox wrote:
> Greetings.  Ted Ts'o recently hosted an ext2 puffinfest where we
> discussed the future of the VFS and ext2.  Ben LaHaise, Phil Schwan,
...
   Add  pathconf()  to the VFS.  Right now the peeks I have had at
   2.3 series do show that people do WRONG things with  O_LARGEFILE
   flag bit per what the LFS semantics are telling.r

   The filesystem must be able to pass to the VFS what capabilities
   given file/directory has -- like can filesizes exceeding 2G be
   used at all...  (EXT2, UFS, NFSv3 can, MINIX et.al. can't..)
   (And filenamesizes supported at directories, and...)

   These don't look right even at egrep tersenes:  (2.3.42)

[root@mea linux]# egrep O_LARGEFILE $cc
./fs/open.c:flags |= O_LARGEFILE;
./fs/ext2/file.c: * the caller didn't specify O_LARGEFILE.  On 64bit systems we force
./fs/ext2/file.c:   if (inode->u.ext2_i.i_high_size && !(filp->f_flags & 
O_LARGEFILE))
./fs/udf/file.c: *  On 64 bit systems we force on O_LARGEFILE in sys_open.
./fs/udf/file.c:if ((inode->i_size & 0xUL) && !(filp->f_flags 
& O_LARGEFILE))
./arch/sparc64/kernel/sys_sparc32.c: * not force O_LARGEFILE on.
./arch/sparc64/solaris/fs.c:if (flags & 0x2000) fl |= O_LARGEFILE;

   The limit at 32-bit systems is 2G, not 4G, and NO kernel space system
   shall (aside of sys_open64() syscall)  set that flag.  (Which I think
   the sparc64/solaris thing does.)

   The tests of file open at EXT2 and UDF (?!) should, I think, be
   conditionalized under a wrapper of:
 #if BITS_PER_LONG == 32
  ...
 #endif

   Sigh, so much to do, so little time for kernel hacking...

/Matti Aarnio <[EMAIL PROTECTED]>



Re: Ext2 / VFS projects

2000-02-09 Thread Alexander Viro



On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

> with 2k blocks and 128 byte fragments, we get to really reduce wasted
> space below any other system i've ever experienced.

Erm... I'm afraid that you are missing the point. You will get the
hardware sectors shared between the files. And you can't pass requess
smaller than that. _And_ you have to lock the bh when you do IO. Now,
estimate the fun with deadlocks...



Re: Ext2 / VFS projects

2000-02-09 Thread willy

On Wed, Feb 09, 2000 at 12:02:46PM -0500, Alexander Viro wrote:
> > 256k blocks with 1k fragments (viro)
> > [also 8k blocks with 256-byte fragments!]
>^^^
> HUH??? You want to deal with sharing sectors between the different files?

Yes.  Based on my misunderstanding of how BSD did fragments, Ted and I
came up with an interestingly different way of doing fragments.  Here's
the basic idea:

Keep the bitmaps of _blocks_ instead of fragments and use a second
bitmap to identify which blocks are non-full fragment blocks.  Use the
last fragment in a fragment block as a fragment block descriptor to
indicate which inode each fragment belongs to (exact format still to
be determined).  Fragment blocks are kept compact as otherwise we would
have to deal with internal fragmentation.

with 256k blocks, we get to grow a block group to 16GB (probably excessive
with today's discs) and then we don't have any further problems with
not enough free space.

with 2k blocks and 128 byte fragments, we get to really reduce wasted
space below any other system i've ever experienced.




Re: Ext2 / VFS projects

2000-02-09 Thread Alexander Viro



On Wed, 9 Feb 2000, Jeff Garzik wrote:

> al viro was kind enough to e-mail me some of his thoughts on banishing
> the big kernel lock from the VFS.  Though my time with the VFS has been
> nil in the past few months, I'd still like to work on this if noone
> beats me to it.
> 
> IIRC the two big items are dcache/dentry and inode threading.

s/inode/friggin' POSIX locks shit/

And it's a bit different story - ext2fs needs some serialization of its
own; after all it got some internal structures ;-)

> Any and all ideas for online defrag, please post.  I'm very interested.

See below.
 
> > delayed allocation
> 
> this needs to be in the VFS desperately.  every new & advanced
> filesystem is winding up implementing their own logic for this...
> 
> > Address spaces (viro)
> 
> can someone elaborate?

Urgh. It's a long(ish) story. Basically, we are getting address_space
methods. It removes ->readpage/->writepage/->get_block out of
inode_operations, BTW. What it means: we are getting rid of a lots of code
duplication (data semantics; as in normal block fs vs. no-holes fs vs.
extent-based with holes vs. fragments-handling a-la FFS vs. fs with small
files embeddable into inodes vs. ...). Address_space is an MMU. This way
they become separated from filesystems proper (i.e. layout, etc.). If you
want more coherent description - ask and I'll write it. The latest version
of my patch sits on ftp.math.psu.edu/pub/viro/as-patch-26z2 (warning:
needs testing).



Re: Ext2 / VFS projects

2000-02-09 Thread Alexander Viro



On Wed, 9 Feb 2000, Tigran Aivazian wrote:

> On Wed, 9 Feb 2000, Matthew Wilcox wrote:
> > fix posix fcntl locks (willy)
> 
> Very interesting. I haven't checked recently but which part of POSIX fcntl
> locks is broken?

Take a _large_[1] barf-bag and read posix_locks_deadlock(), for one. Could
you spell "totally inadequate data structures"?

[1] You'll need it. Don't complain about the ruined keyboard - you've been
warned.



Re: Ext2 / VFS projects

2000-02-09 Thread Tigran Aivazian

On Wed, 9 Feb 2000, Matthew Wilcox wrote:
> fix posix fcntl locks (willy)

Very interesting. I haven't checked recently but which part of POSIX fcntl
locks is broken?

Tigran.



Re: Ext2 / VFS projects

2000-02-09 Thread Alexander Viro



On Wed, 9 Feb 2000, Matthew Wilcox wrote:

> 2.4:
> Collapsed indirect blocks [readonly] (willy)
> 
> 2.5:
> Journalling (sct)
> Access to lownumbered inodes
> Dynamic inode tables (phil)
> Btree directories (phil)
> ext2 allocation page size greater than cpu page size (bcrl)
> 256k blocks with 1k fragments (viro)
>   [also 8k blocks with 256-byte fragments!]
 ^^^
HUH??? You want to deal with sharing sectors between the different files?

> fine-grained locking
>   [remove test_and_set_bit()]
> e2compr

requires VM modifications (and basically nothing else).

> nounlink attribute flag

may go into 2.4.early, FWIC.

> Some interesting mm/fs projects:
> 
> 2.4:
> Move directories to page cache (bcrl)
> Address spaces (viro)

Variant submitted to Linus.

> Move fhandle <-> dentry conversion functions to VFS (viro)
> move silly_rename to VFS (viro)
> Investigate buddy allocator algorithms with more interesting properties (bcrl)
> bdflush may need tuning (bcrl)
> 
> 2.5:
> Removal of buffer heads (bcrl)
> fix posix fcntl locks (willy)
> sort out interface to block devices (viro)

I still hope to get at least the interface parts into 2.4...

Other 2.4.early stuff:
caching ext2_find_entry() results in dentry (patch exists,
obviously correct and well-tested).
caching the position of last lookup and doing cyclic lookups from
that place (literal copying from VFAT, where it worked since the early
Summer; it's an old BSD optimization).
pre-alloc for directories (yup, right now it's _off_).

Mandatory 2.3.late stuff:
serialization between truncate and write. There are races...



Re: Ext2 / VFS projects

2000-02-09 Thread Jeff Garzik

Caveat reader:  With the exception of procfs stuff in 2.3.x, most of my
VFS participation thus far has been of the "I want to work on this when
I get time" sort of partitication.  ;-)


First, my request:

Add an fcntl flag, O_NOCACHE (or O_DIRECT, unimplemented) which allows
an app to hint that it does not want the OS to cache this data.  This
will be a _big_ win for servers and desktops where large multimedia or
database files are slung around.



Matthew Wilcox wrote:
> Btree directories (phil)

I hope these are _not_ pure binary search trees but rather a smarter
ADT...


> Backup inode table

interesting idea

> fine-grained locking

al viro was kind enough to e-mail me some of his thoughts on banishing
the big kernel lock from the VFS.  Though my time with the VFS has been
nil in the past few months, I'd still like to work on this if noone
beats me to it.

IIRC the two big items are dcache/dentry and inode threading.


> Online defragmentation & size

Has there been any substantive discussion about online defragmentation?

I think it is a wholly separate, and more interesting issue than resize
(which will be solved in the future with LVMs, IMHO...)

For online defrag, there are tons of different scenarios and hueristics
which can be employed to optimize for various situations:
* move frequently-accessed files to the middle of the disk (requires
knowledge of physical disk organization, below the partition layer)
* group files together on disk in directories, with gaps of free space
in between for placement of files "near" other files in the same
directory
* options to pack files into inodes (if possible and supported by fs) or
to fragment small files, to conserve space
* dozens of hueristics.  if online defrag is in userspace, admin can
even craft their own disk optimization rules.

Kernel changes  Short term, the easiest implementation will be
in-kernel.  Long term, I would like to see (if possible) a set of
generalized ioctls which allow a userspace program to contain the bulk
of the defragmenting/disk optimization logic.

Any and all ideas for online defrag, please post.  I'm very interested.



> delayed allocation

this needs to be in the VFS desperately.  every new & advanced
filesystem is winding up implementing their own logic for this...


> Address spaces (viro)

can someone elaborate?


> sort out interface to block devices (viro)

mostly done?


-- 
Jeff Garzik | Only so many songs can be sung
Building 1024   | with two lips, two lungs, and
MandrakeSoft, Inc.  | one tongue.