Re: Ext2 / VFS projects

2000-02-11 Thread Tigran Aivazian

I noticed the stackable fs item on Alan's list ages ago but there was no
pointer to the patch (I noticed FIST stuff but surely that is not a "small
passive patch" you are referring to?)

So, my point is - if you point everyone to those patches, someone might
help Alan out if one feels like it (and has time).

Regards,
Tigran.

On Thu, 10 Feb 2000, Erez Zadok wrote:
 Also, I really hope that my remaining (small, passive) patches to the VFS to
 support stackable file systems will be incorporated soon.



Re: Ext2 / VFS projects

2000-02-11 Thread Stephen C. Tweedie

Hi,

On Thu, 10 Feb 2000 10:27:29 -0500 (EST), Alexander Viro
[EMAIL PROTECTED] said:

 Correct, but that's going to make design much more complex - you really
 don't want to do it for anything other than sub-page stuff (probably even
 sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector
 fragment. Not to mention the fact that for cases when you have 1K
 fragments and really large blocks you don't want all this mess around...
 It's doable, indeed, but...

Sure, but to me the main question is this --- can we do this sort of
fragment support in ext3 without having to add complexity to the rest of
the VM/VFS?  I think the answer is yes.

--Stephen



Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok

In message [EMAIL PROTECTED], Tigran 
Aivazian writes:
 I noticed the stackable fs item on Alan's list ages ago but there was no
 pointer to the patch (I noticed FIST stuff but surely that is not a "small
 passive patch" you are referring to?)

Yes the patches are small and passive.  No new vfs/mm code is added or
changed!  The most important part of my patches had already been included
since 2.3.17; that was an addition/renaming of a private field in struct
vm_area_struct.  What's left are things that are necessary to support
stacking for the first time in linux: exposing some functions/symbols from
{mm,fs}/*.c, adding externs to headers, additions to ksyms.c, and moving
some macros and inline functions from private .c files to a header, so they
can be included in any file system.

I've used these patches on dozens of linux machines for the past 2+ years,
and have had no problems.  I constantly get people asking me when my patches
will become part of the main kernel.  I have about 9 active developers who
write file systems using my templates.  I've had more than 21,000 downloads
of my templates in the past two years.

 So, my point is - if you point everyone to those patches, someone might
 help Alan out if one feels like it (and has time).

http://www.cs.columbia.edu/~ezk/research/software/fist-patches/

The latest 2.3 patches in that URL include two things: my small main kernel
patches, and a fully working lofs.  The lofs of course is several thousands
of lines of code, but it is not strictly necessary to include it with the
main kernel; it can be distributed and built separately, just as my other f/s
modules are.  However, I do think that lofs is a useful enough f/s that it
should be part of the main kernel.

If you go to the 2.3 directory under the above URL, there's a README
describing the latest 2.3 patches.  I've included it below, so everyone can
read it and see what my patches do, and how harmless they are.

BTW, I've got a prototype unionfs for linux if anyone is interested.

 Regards,
 Tigran.

As always, I'll be delighted to help *anyone* use my work, and would love to
help the linux maintainers incorporate my patches, answer any concerns they
might have, etc.

Cheers,
Erez.

==
Summary of changes for 2.3.25 to support stackable file systems and lofs.

(Note: some of my previous patches had been incorporated in 2.3.17.)

(1) Created a new header file include/linux/dcache_func.h.  This header file
contains dcache-related definitions (mostly static inlines) used by my
stacking code and by fs/namei.c.  Ion and I tried to put these
definitions in fs.h and dcache.h to no avail.  We would have to make
lots of changes to fs.h or dcache.h and other .c files just to get these
few definitions in.  In the interest of simplicity and minimizing kernel
changes, we opted for a new, small header file.  This header file is
included in fs/namei.c because everything in dcache_func.h was taken
from fs/namei.c.  And of course, these static inlines are useful for my
stacking code.

If you don't like the name dcache_func.h, maybe you can suggest a better
name.  Maybe namei.h?

If you don't like having a new header file, let me know what you'd
prefer instead and I'll work on it, even if it means making more changes
to fs.h, namei.c, and dcache.h...

(2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
can be included in the same original source code as well as stackable
file systems:

check_parent macro (fs/namei.c - include/linux/dcache_func.h)
lock_parent (fs/namei.c - include/linux/dcache_func.h)
get_parent (fs/namei.c - include/linux/dcache_func.h)
unlock_dir (fs/namei.c - include/linux/dcache_func.h)
double_lock (fs/namei.c - include/linux/dcache_func.h)
double_unlock (fs/namei.c - include/linux/dcache_func.h)

(3) Added to include/linux/fs.h an extern definition to default_llseek.

(4) include/linux/mm.h: also added extern definitions for

filemap_swapout
filemap_swapin
filemap_sync
filemap_nopage

so they can be included in other code (esp. stackable f/s modules).

(5) added EXPORT_SYMBOL declarations in kernel/ksyms.c for functions which I
now exposed to (stackable f/s) modules:

EXPORT_SYMBOL(___wait_on_page);
EXPORT_SYMBOL(add_to_page_cache);
EXPORT_SYMBOL(default_llseek);
EXPORT_SYMBOL(filemap_nopage);
EXPORT_SYMBOL(filemap_swapout);
EXPORT_SYMBOL(filemap_sync);
EXPORT_SYMBOL(remove_inode_page);
EXPORT_SYMBOL(swap_free);
EXPORT_SYMBOL(nr_lru_pages);
EXPORT_SYMBOL(console_loglevel);

(6) mm/filemap.c: make the function filemap_nopage non-static, so it can be
called from other places.  This was not an inline function so there's no
performance impact.

ditto for 

Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok

In message [EMAIL PROTECTED], Manfred Spraul writes:
 Erez Zadok wrote:
  [...]
  (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
  can be included in the same original source code as well as stackable
  file systems:
  
  check_parent macro (fs/namei.c - include/linux/dcache_func.h)
  lock_parent (fs/namei.c - include/linux/dcache_func.h)
  get_parent (fs/namei.c - include/linux/dcache_func.h)
  unlock_dir (fs/namei.c - include/linux/dcache_func.h)
  double_lock (fs/namei.c - include/linux/dcache_func.h)
  double_unlock (fs/namei.c - include/linux/dcache_func.h)
  
 That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of
 most of these functions...

I agree.  I didn't want to make copies of those b/c I got burnt in the past
when they changed subtly and I didn't notice the change.

 --
   Manfred

Erez.



Re: Ext2 / VFS projects

2000-02-10 Thread Stephen C. Tweedie

Hi,

On Wed, 9 Feb 2000 14:30:13 -0500 (EST), Alexander Viro
[EMAIL PROTECTED] said:

 On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

 with 2k blocks and 128 byte fragments, we get to really reduce wasted
 space below any other system i've ever experienced.

 Erm... I'm afraid that you are missing the point. You will get the
 hardware sectors shared between the files. And you can't pass requess
 smaller than that. _And_ you have to lock the bh when you do IO. Now,
 estimate the fun with deadlocks...

That shoudn't matter.  In the new VM it would be pretty trivial for the
filesystem to reserve a separate address_space against which to cache
fragment blocks.  Populating that address_space when we want to read a
fragment block doesn't have to be any more complex than populating the
page cache already is.  IO itself shouldn't be hard.

Yes, this will end up double-caching fragmented files to some extent,
since we'll have to reserve a separate, non-physically-mapped page for
the tail of a fragmented file.

Allocation/deallocation of fragments themselves obviously has to be done
very carefully, but we already have to deal with that sort of race in
the filesystem for normal allocations --- this isn't really any
different in principle.

--Stephen



Re: Ext2 / VFS projects

2000-02-09 Thread Jeff Garzik

Caveat reader:  With the exception of procfs stuff in 2.3.x, most of my
VFS participation thus far has been of the "I want to work on this when
I get time" sort of partitication.  ;-)


First, my request:

Add an fcntl flag, O_NOCACHE (or O_DIRECT, unimplemented) which allows
an app to hint that it does not want the OS to cache this data.  This
will be a _big_ win for servers and desktops where large multimedia or
database files are slung around.



Matthew Wilcox wrote:
 Btree directories (phil)

I hope these are _not_ pure binary search trees but rather a smarter
ADT...


 Backup inode table

interesting idea

 fine-grained locking

al viro was kind enough to e-mail me some of his thoughts on banishing
the big kernel lock from the VFS.  Though my time with the VFS has been
nil in the past few months, I'd still like to work on this if noone
beats me to it.

IIRC the two big items are dcache/dentry and inode threading.


 Online defragmentation  size

Has there been any substantive discussion about online defragmentation?

I think it is a wholly separate, and more interesting issue than resize
(which will be solved in the future with LVMs, IMHO...)

For online defrag, there are tons of different scenarios and hueristics
which can be employed to optimize for various situations:
* move frequently-accessed files to the middle of the disk (requires
knowledge of physical disk organization, below the partition layer)
* group files together on disk in directories, with gaps of free space
in between for placement of files "near" other files in the same
directory
* options to pack files into inodes (if possible and supported by fs) or
to fragment small files, to conserve space
* dozens of hueristics.  if online defrag is in userspace, admin can
even craft their own disk optimization rules.

Kernel changes  Short term, the easiest implementation will be
in-kernel.  Long term, I would like to see (if possible) a set of
generalized ioctls which allow a userspace program to contain the bulk
of the defragmenting/disk optimization logic.

Any and all ideas for online defrag, please post.  I'm very interested.



 delayed allocation

this needs to be in the VFS desperately.  every new  advanced
filesystem is winding up implementing their own logic for this...


 Address spaces (viro)

can someone elaborate?


 sort out interface to block devices (viro)

mostly done?


-- 
Jeff Garzik | Only so many songs can be sung
Building 1024   | with two lips, two lungs, and
MandrakeSoft, Inc.  | one tongue.



Re: Ext2 / VFS projects

2000-02-09 Thread Tigran Aivazian

On Wed, 9 Feb 2000, Matthew Wilcox wrote:
 fix posix fcntl locks (willy)

Very interesting. I haven't checked recently but which part of POSIX fcntl
locks is broken?

Tigran.



Re: Ext2 / VFS projects

2000-02-09 Thread willy

On Wed, Feb 09, 2000 at 12:02:46PM -0500, Alexander Viro wrote:
  256k blocks with 1k fragments (viro)
  [also 8k blocks with 256-byte fragments!]
^^^
 HUH??? You want to deal with sharing sectors between the different files?

Yes.  Based on my misunderstanding of how BSD did fragments, Ted and I
came up with an interestingly different way of doing fragments.  Here's
the basic idea:

Keep the bitmaps of _blocks_ instead of fragments and use a second
bitmap to identify which blocks are non-full fragment blocks.  Use the
last fragment in a fragment block as a fragment block descriptor to
indicate which inode each fragment belongs to (exact format still to
be determined).  Fragment blocks are kept compact as otherwise we would
have to deal with internal fragmentation.

with 256k blocks, we get to grow a block group to 16GB (probably excessive
with today's discs) and then we don't have any further problems with
not enough free space.

with 2k blocks and 128 byte fragments, we get to really reduce wasted
space below any other system i've ever experienced.




Re: Ext2 / VFS projects

2000-02-09 Thread Benjamin C.R. LaHaise

On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

 This requires Ben's work to decouple the ext2 allocation size from the
 hardware page size.  We would _always_ want to write out a fragment block
 as one to ensure that the fragment descriptor wasn't at odds with the
 contents of the block.  Imagine the descriptor not being written out after
 the block was compacted.

My initial plan is to decouple the allocation size from the hardware page
size only for the cases where the allocation size is larger than the
physical block size of the disk.  Going beyond that is non-trivial, but
doable.  It may only be interesting for e2compr, since large blocks with
512 byte fragments will rock.

-ben



Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote:
 
 On 09-Feb-00 Andi Kleen wrote:
  On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:
  
  [...]
  
  How about secure deletion? 
  
  1.3 used to have some simple minded overwriting of deleted data when the 
  's' attribute was set.  That got lost with 2.0+. 
  
  Secure disk overwriting that is immune to 
  manual surface probing seems to take a lot more effort  (Colin Plumb's 
  sterilize does 25 overwrites with varying bit patterns). Such a complicated
  procedure is probably better kept in user space. What I would like is some
  way to have a sterilize daemon running, and when a get 's' file gets
  deleted the VFS would open a new file descriptor for it, pass it to 
  sterilized (sterild?) using a unix control message and let it do its job.
  
  What does the audience think? Should such a facility have kernel support
  or not?  I think secure deletion is an interesting topic and it would be
  nice if Linux supported it better.
 
 You have to be careful that you don't leak the file you're trying to eradicate
 into the swap via the serilize daemon.  I guess simply never reading the file
 is a good start.

sterilize does that. You have of course be careful that you didn't leak
its content to swap before (one way around that is encrypted swapping) 

 
 The other question is whether you're talking about an ext2-specific thing, or
 whether its a general service all filesystems provide.  Many filesystem

I was actually only thinking about ext2 (because only it has a 's' bit 
and the thread is about ext2's future) 

 designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl
 filesystem, don't let a process overwrite an existing block on disk.  Well,
 ext3 does, but only via the journal; wafl never does.  There's also the
 question of what happens when you have a RAID device under the filesystem,
 especially with hot-swappable disks.

reiserfs lets you when you don't change the file size (if you do it is possible
that the file is migrated from a formatted node to a unformatted node). 
sterilize does not change file sizes.

ext3 only doesn't let you when you do data journaling (good point I forgot
that) 

RAID0/RAID1 are no problem I think, because you have always well defined
block(s) to write too. The wipe data does not depend on the old data on
the disk, so e.g. on a simple mirrored configuration both blocks would
be sterilized in parallel.

RAID5 devices could be a problem, especially when they do data journaling
(I think most only journal some metadata). It is not clear how the sterilize
algorithms interact with the XORed blocks.

If you swap your disks inbetween you lost.

 
 Perhaps a better approach, since we're talking about a privileged process, is
 to get a list of raw blocks and go directly to the disk.  You'd have to be very
 careful to synchronize with the filesystem...

Not too much. The file still exists, but there are no references to it
outside sterild. No other process can access it. Assuming the file system
does not have fragments and the raw io has block granuality and the file was
fdatasync'ed before you could directly access it without worrying about
any file system interference. If the fs has fragments you need the
infrastructure needed for O_DIRECT (I think that is planned anyways).

With a "invalidate all dirty buffers for file X" call you could optimize
part of the fdatasync writes away, but a good sterilize needs so many 
writes anyways (25+) that it probably does not make much difference. 

The data would be only really deleted when the system is turned off,
because it could partly still exist in some not yet reused buffers.


-Andi