Re: module counts in kern_mount()

2000-07-30 Thread Andi Kleen

On Sun, Jul 30, 2000 at 06:28:11PM -0400, Alexander Viro wrote:
> 
> 
> On Mon, 31 Jul 2000, Andi Kleen wrote:
> 
> > On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote:
> > > 
> > > 
> > > On Sun, 30 Jul 2000, Andi Kleen wrote:
> > > 
> > > > 
> > > > kern_mount currently forgets to increase the module count of the file system,
> > > > leading to negative module count after umount.
> > > 
> > > That's because it is not supposed to be used more than once or to be
> > > undone by umount(). If it _would_ increment the counter you would be
> > > unable to get rid of the module once it's loaded. What are you actually
> > > trying to do?
> > 
> > It is not even done once. I was just writing a small module that registers
> > a private file system similar to sockfs.
> 
> Great, so why locking it in-core? It should be done when you mount it, not
> when you register.

It is mounted in the module too (it is a fileless file system like sockfs
and does not have a file system mount point, so it can do that).

Anyways, the problem is that the mounting does not increase the module
count, but the umount does.

> 
> > IMHO kern_mount should increase the count so that it is symmetric with
> > kern_umount
> 
>  How TF did kern_umount() come to decrement it? Oh, I see -
> side effect of kill_super()... OK, _that_ must be fixed. By adding
> get_filesystem(sb->s_type); before the call of kill_super(sb, 0); in
> kern_umount().

I'm not sure I follow, but shouldn't mounting increase the fs module count? 
How else would you do module count management for file systems  ?


-Andi



Re: module counts in kern_mount()

2000-07-30 Thread Andi Kleen

On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote:
> 
> 
> On Sun, 30 Jul 2000, Andi Kleen wrote:
> 
> > 
> > kern_mount currently forgets to increase the module count of the file system,
> > leading to negative module count after umount.
> 
> That's because it is not supposed to be used more than once or to be
> undone by umount(). If it _would_ increment the counter you would be
> unable to get rid of the module once it's loaded. What are you actually
> trying to do?

It is not even done once. I was just writing a small module that registers
a private file system similar to sockfs.

IMHO kern_mount should increase the count so that it is symmetric with
kern_umount

-Andi
> 



module counts in kern_mount()

2000-07-30 Thread Andi Kleen


kern_mount currently forgets to increase the module count of the file system,
leading to negative module count after umount.

-Andi

--- linux-work/fs/super.c-MOUNT Tue Jul 25 02:04:13 2000
+++ linux-work/fs/super.c   Sun Jul 30 16:15:08 2000
@@ -942,6 +942,7 @@
kill_super(sb, 0);
return ERR_PTR(-ENOMEM);
}
+   get_filesystem(type); 
type->kern_mnt = mnt;
return mnt;
 }



Re: Questions about the buffer+page cache in 2.4.0

2000-07-29 Thread Andi Kleen

On Sat, Jul 29, 2000 at 10:50:28PM +0200, Gary Funck wrote:
> On Jul 29, 10:24pm, Andi Kleen wrote:
> > 
> > > 
> > > In 2.4, does the 'read actor' (ie, for ext2) optimize the case where
> > > the part of the I/O request being handled has a user-level address that
> > > is page aligned, and the requested bytes to transfer are at least one
> > > full page?  Ie, does it handle the 'copy' by simply remapping the I/O
> > > page directly into the user's address space (avoiding a copy)?
> > 
> > No, because you cannot avoid the copy anyways because you need to maintain
> > cache coherency in the page cache. If you want zero copy IO use mmap().
> 
> Couldn't that be taken care of, by implementing copy-on-write semantics
> on the user-level page that has been (hypothetically) remapped by
> the "copy agent"?  It would seem that the common case would be not to
> modify the data buffer that was just read, and the copy-on-write would
> be typically unneeded.

Write protecting pages is very expensive, especially on SMP machines where
the TLB updates may have to be broadcasted to other CPUs when you have
multiple threads.

> 
> Does glibc implement fread() and friends using mmap(), when reading
> from a file?  Would there be something to be gained from that?
> Alternatively, if the file system implemented zero copy semantics
> in read(), there'd be no need to push the optimization to a higer
> level (ie, into glibc).

Why asking when you can easily check yourself ?

-Andi



Re: Questions about the buffer+page cache in 2.4.0

2000-07-29 Thread Andi Kleen

On Sat, Jul 29, 2000 at 06:58:34PM +0200, Gary Funck wrote:
> What entity is responsible for tearing down the file<->page mapping, when
> the storage is needed?  Is that bdflush's job?

kswapd's and the allocators themselves.

> 
> In 2.4, does the 'read actor' (ie, for ext2) optimize the case where
> the part of the I/O request being handled has a user-level address that
> is page aligned, and the requested bytes to transfer are at least one
> full page?  Ie, does it handle the 'copy' by simply remapping the I/O
> page directly into the user's address space (avoiding a copy)?

No, because you cannot avoid the copy anyways because you need to maintain
cache coherency in the page cache. If you want zero copy IO use mmap().


-Andi



Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Andi Kleen

On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote:
> So now it's time to start asking questions.  Just jumping in at a place I felt I
> knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
> it's changed somewhat.  Finding and removing a block from the free list is now
> bracketed by a spinlock pair.  First question: why do we use atomic_set to set
> the initial buffer use count if this is already protected by a spinlock?

The buffer use count needs to be atomic in other places because interrupts
may change it on UP. atomic_t can only be modified by atomic_* functions 
and atomic.h is lacking a "atomic_set_nonatomic". So even when you only 
need the atomic property once you have to change all uses of the field.


-Andi

-- 
This is like TV. I don't like TV.



Re: Reiserfs and NFS

2000-05-04 Thread Andi Kleen

On Wed, May 03, 2000 at 11:57:52PM +0200, Steve Dodd wrote:
> On Wed, May 03, 2000 at 11:27:34AM +0200, Andi Kleen wrote:
> > On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote:
> 
> > > First off, could we call them "inode labels" or something less confusing?
> > > "file" outside of NFS has a different meaning (semantic namespace collision
> > > ) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to
> > > work - (think hardlinks, etc.). You do need an iget_by_label or something
> > 
> > NFS file handles are always a kind of hard link, in traditional Unix
> > they refer directly to inodes. Linux knfsd uses dentries only because the
> > VFS semantics require it, not because of any NFS requirements.
> 
> What I meant was, you can't have a "ilabel_to_dentry" (or fh_to_dentry for
> that matter ) function because there may well be more than one dentry
> pointing to the inode.

It does not matter, as long as you get some dentry pointing to it.
The new nfs file handle code even uses anonymous dentries in some 
cases (dentries not connected to the dentry tree). 
The nfsfh conceptually just acts like another hard link.
The standard 2.2 code is really broken because it cannot handle renaming
of directories (because the file handles are path dependent). 2.3/
2.2+sourceforge code tries to fix this mostly with some evil tricks.

> As for NFS's use of dentries, I'm still not sure I understand all the
> details. Without having reading the specs, I would expect it to be operating
> mostly on inodes, but I'm sure there are good reasons why it doesn't.

NFS wants to operate on inodes, but Linux 2.2+ VFS does not allow it
(that is why the nfsfh.c code is so complex -- in early 2.1 knfsd
before the dcache architecture was introduced nfsfh.c was *much* simpler)

> 
> [..]
> > iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately
> > it is a bit inefficient to search inodes this way [because you cannot index
> > on the additional information], but not too bad.
> 
> iget4 isn't quite the same -- you need to supply a "find actor" to compare
> the other parts of the inode identifier, which are fs-specific. knfsd wouldn't
> be able to supply a find actor for the underlying filesystem it was serving.

``with some trivial extensions''
The 2.3 nfsfh code supports arbitary file handle types, indexed with
an identifier. You would associate the find_actor with an specific
identifier. Some fs specific code is needed anyways to write the
private parts into the fh (e.g. in the reiser case for writing
the true packing locality) 


> 
> > > Also, what are the size constraints imposed by NFS? What about other network
> > > filesystems?
> > 
> > NFSv2 has 2GB limits for files. 
> 
> Sorry, I was thinking more of limits imposed on the size of the "file handle" /
> inode identifier..

NFSv2 has 32bytes file handles. NFSv3 has longer ones.
For all current versions of reiserfs the necessary information can
be squeezed into 32bytes with some tricks.


-Andi

-- 
This is like TV. I don't like TV.



Re: Reiserfs and NFS

2000-05-03 Thread Andi Kleen

On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote:
> On Tue, May 02, 2000 at 01:50:16PM -0700, Chris Mason wrote:
> > 
> > ReiserFS has unique inode numbers, but they aren't enough to actually find
> > the inode on disk.  That requires the inode number, and another 32 bits of
> > information we call the packing locality.  The packing locality starts as
> > the parent directory inode number, but does not change across renames.
> > 
> > So, we need to add a fh_to_dentry lookup operation for knfsd to use, and
> > perhaps a dentry_to_fh operation as well (but _fh_update in pre6 looks ok
> > for us).
> 
> First off, could we call them "inode labels" or something less confusing?
> "file" outside of NFS has a different meaning (semantic namespace collision
> ) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to
> work - (think hardlinks, etc.). You do need an iget_by_label or something

NFS file handles are always a kind of hard link, in traditional Unix
they refer directly to inodes. Linux knfsd uses dentries only because the
VFS semantics require it, not because of any NFS requirements.

> similar though. Details that need to be worked out would be max label size
> and how they're passed around (void * ptr and a length?)

iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately
it is a bit inefficient to search inodes this way [because you cannot index
on the additional information], but not too bad.

> 
> Also, what are the size constraints imposed by NFS? What about other network
> filesystems?

NFSv2 has 2GB limits for files. 
 
-Andi



Re: fs changes in 2.3

2000-05-03 Thread Andi Kleen

On Wed, May 03, 2000 at 08:54:54AM +0200, Alexander Viro wrote:

[flame snipped -- hopefully everybody can go back to normal work now]

> 
> ObNFS: weird as it may sound to you, I actually write stuff - not
> "subcontract" to somebody else. So I'm afraid that I have slightly less
> free time than you do. FWIC, in Reiserfs context nfsd is a non-issue.
> Current kludge is not too lovely, but it's well-isolated and can be
> replaced fast. So ->read_inode2() is ugly, but in my opinion it's not an
> obstacle. If other problems will be resolved and by that time 
> ->fh_to_dentry() interface will not be in place - count on my vote for
> temporary adding ->read_inode2().

In the long run some generic support for 64bit inodes will be needed
anyways -- other file systems depend on that too (e.g. XFS). So 
fh_to_dentry only saves you temporarily. I think adding read_inode2
early is the best.


-Andi



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-14 Thread Andi Kleen

On Sat, Apr 15, 2000 at 12:24:16AM +0200, Andrew Clausen wrote:
> i mentioned in some remarks to benno how important i thought it was to
> preallocate the files used for hard disk recording under linux.

[...]

Unfortunately efficient preallocation is rather hard with the current
ext2. To do it efficiently you just want to allocate the blocks in the
bitmaps without writing into the actual allocated blocks (otherwise
it would be as slow as the manual write-every-block-from-userspace trick)
Now in these blocks there could be data from other files, the new owner
of the block should not be allowed to see the old data for security reasons. 

You suddenly get a new special kind of block in the file system with
different semantics: ignore old data and only supply zeroes to the reader, 
unless the block has been actually writen. 

This ``ignore old data until writen'' information about the block would
need to be persistent on disk -- you cannot just hold it in memory, 
otherwise it would not be known anymore after a reboot/crash. Filling 
the unwriten blocks with zeroes on shutdown would be too slow[1]. 

The problem is that ext2 has no space to store this information. The blocks
are allocated using simple bitmaps, and you cannot express three states
(free, allocated, write-only) in only a single bit. If you were using a 
extent based file system there would be probably enough space (you usually
find space for a single bit somewhere in the extent tree), but with
bitmaps it is tricky and would require on disk format changes. 

Apparently extent based ext2 is planned, maybe it would be useful to 
include that feature then, but before that it looks too hairy. 

JFS and XFS seem to support these things already. 

-Andi

[1] Imagine your system starting a 100MB write when the UPS tries to force 
a quick shutdown on power fail -- you really don't want that.




Re: NWFS Source Code Posted at 207.109.151.240

2000-03-29 Thread Andi Kleen

On Wed, Mar 29, 2000 at 10:17:12PM +0100, Matthew Kirkwood wrote:
> > a sample VFS code example for a NULL file system for folks who are
> > porting to Linux.
> 
> As I understand it, your nwfs is probably the first filesystem to
> have been successfully "ported" to Linux.  Pretty much everything
> else (with, perhaps, the exception of the abomination that is the
> NTFS driver) started off native.

Examples that were/are being ported: reiserfs (from some simulator 
I think), AFS, Coda (although they rewrote a large part of the vnode
driver), XFS (they finally released a compilable tree which can 
read/write), Novell/Caldera (binary only netware client fs module) 


> The multiple times that I have written 30 to 70% of a filesystem,
> I found the romfs and minixfs code to be most instructive as a
> guide to the VFS interfaces.  The buffer and page cache stuff is
> rather harder to track down canonical examples for, though again
> minixfs is pretty helpful, if rather simplistic.

Unfortunately neither romfs nor minixfs are integrated well 
with the page cache, so they are no good templates for general
purpose filesystems (this causes problems like nwfs not supporting
shared mappings). nwfs will already do one more copy than ext2 in
the write path in 2.3 because of that.


-Andi 



Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote:
> 
> On 09-Feb-00 Andi Kleen wrote:
> > On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:
> > 
> > [...]
> > 
> > How about secure deletion? 
> > 
> > 1.3 used to have some simple minded overwriting of deleted data when the 
> > 's' attribute was set.  That got lost with 2.0+. 
> > 
> > Secure disk overwriting that is immune to 
> > manual surface probing seems to take a lot more effort  (Colin Plumb's 
> > sterilize does 25 overwrites with varying bit patterns). Such a complicated
> > procedure is probably better kept in user space. What I would like is some
> > way to have a sterilize daemon running, and when a get 's' file gets
> > deleted the VFS would open a new file descriptor for it, pass it to 
> > sterilized (sterild?) using a unix control message and let it do its job.
> > 
> > What does the audience think? Should such a facility have kernel support
> > or not?  I think secure deletion is an interesting topic and it would be
> > nice if Linux supported it better.
> 
> You have to be careful that you don't leak the file you're trying to eradicate
> into the swap via the serilize daemon.  I guess simply never reading the file
> is a good start.

sterilize does that. You have of course be careful that you didn't leak
its content to swap before (one way around that is encrypted swapping) 

> 
> The other question is whether you're talking about an ext2-specific thing, or
> whether its a general service all filesystems provide.  Many filesystem

I was actually only thinking about ext2 (because only it has a 's' bit 
and the thread is about ext2's future) 

> designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl
> filesystem, don't let a process overwrite an existing block on disk.  Well,
> ext3 does, but only via the journal; wafl never does.  There's also the
> question of what happens when you have a RAID device under the filesystem,
> especially with hot-swappable disks.

reiserfs lets you when you don't change the file size (if you do it is possible
that the file is migrated from a formatted node to a unformatted node). 
sterilize does not change file sizes.

ext3 only doesn't let you when you do data journaling (good point I forgot
that) 

RAID0/RAID1 are no problem I think, because you have always well defined
block(s) to write too. The wipe data does not depend on the old data on
the disk, so e.g. on a simple mirrored configuration both blocks would
be sterilized in parallel.

RAID5 devices could be a problem, especially when they do data journaling
(I think most only journal some metadata). It is not clear how the sterilize
algorithms interact with the XORed blocks.

If you swap your disks inbetween you lost.

> 
> Perhaps a better approach, since we're talking about a privileged process, is
> to get a list of raw blocks and go directly to the disk.  You'd have to be very
> careful to synchronize with the filesystem...

Not too much. The file still exists, but there are no references to it
outside sterild. No other process can access it. Assuming the file system
does not have fragments and the raw io has block granuality and the file was
fdatasync'ed before you could directly access it without worrying about
any file system interference. If the fs has fragments you need the
infrastructure needed for O_DIRECT (I think that is planned anyways).

With a "invalidate all dirty buffers for file X" call you could optimize
part of the fdatasync writes away, but a good sterilize needs so many 
writes anyways (25+) that it probably does not make much difference. 

The data would be only really deleted when the system is turned off,
because it could partly still exist in some not yet reused buffers.


-Andi



Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:

[...]

How about secure deletion? 

1.3 used to have some simple minded overwriting of deleted data when the 
's' attribute was set.  That got lost with 2.0+. 

Secure disk overwriting that is immune to 
manual surface probing seems to take a lot more effort  (Colin Plumb's 
sterilize does 25 overwrites with varying bit patterns). Such a complicated
procedure is probably better kept in user space. What I would like is some
way to have a sterilize daemon running, and when a get 's' file gets
deleted the VFS would open a new file descriptor for it, pass it to 
sterilized (sterild?) using a unix control message and let it do its job.

What does the audience think? Should such a facility have kernel support
or not?  I think secure deletion is an interesting topic and it would be
nice if Linux supported it better.

sterilize also does some tricks to overwrite entries in directories, but
I see no easy way to make that fit into the kernel. 

Comments? 

-Andi
-- 
This is like TV. I don't like TV.



Re: file system size limits

2000-01-10 Thread Andi Kleen

On Mon, Jan 10, 2000 at 05:19:13PM +0100, Theodore Y. Ts'o wrote:
>You use these disks for a few weeks, then your Alpha breaks down. You
>decide to attach these disks to your x86 computer.
>Now the RAID tools MUST NOT mount the 4 TB disk array, or you'll destroy
>your data.
>[the calculation
>  bh[i]->b_rsector = bh[i]->b_blocknr*(bh[i]->b_size>>9);
>in ll_rw_blk.c would overflow]
> 
> That's an MD (and block devlice layer) problem, not an ext2 problem.
> Granted, the customer won't care much whose problem it is; he just
> doesn't want his data trashed.  It shouldn't be that hard to fix,
> though.

I think it would require changes in all device drivers, which is a 
"hard fix".
But these changes will be needed soon anyways for highmem-without-bouncebuffers
support.

-Andi

-- 
This is like TV. I don't like TV.



Re: file system size limits

2000-01-06 Thread Andi Kleen

On Thu, Jan 06, 2000 at 04:03:38PM +0100, Manfred Spraul wrote:
> What's the current limit for ext2 filesystems, and what happens if a users
> creates a larger disk? I think we should document the current limits, and
> refuse to mount larger disks. I guess the current limit is somewhere around
> 1000 GB (512*INT_MAX)?

The limit is more ~2TB with 1K blocks and ~8TB with 4K blocks. 

> 
> I'm posting this question because I've already seen a message that someone
> uses a 500 GB ext2 fs, and because (IIRC) certain versions of the Norton
> Commander silently corrupted disks > 2GB on the Macintosh when Apple removed
> the 2 GB limit a few years ago.

Some version of fsck compiled with the wrong llseek() did that too.

> The Linux filesystems/utilities [kernel, fsck,...] should avoid similar
> problems.

I think it is not a problem currently, because both 2TB and 8TB would
take several days of fsck, which makes them impractical.

-Andi



Re: Ext2 defragmentation

1999-11-15 Thread Andi Kleen

On Mon, Nov 15, 1999 at 03:00:20PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > > > How necessary is it to defragment ones ext2 partitions? It just hit me
> > > > > that defragmentation is very important under the Wintendo filesystem.
> > > > 
> > > > It's not as important.  But... I had an idea for an ext2 defrag daemon,
> > > > e2defragd, which would take advantage of _disk_ idle time to reorganize
> > > > blocks, while the filesystem was mounted.  This daemon would be a good
> > > > candidate for disk optimizations like moving frequently-accessed files
> > > > to the middle of the disk in addition to background defragging.
> > > 
> > > There's one usefull thing that could be done with e2defrag: putting
> > > directories at the beggining of the disk exactly in the order find /
> > > would use. One line hack, but e2defrag just does not work for me.
> > >   Pavel
> > 
> > Isn't it better to simply use locate / updatedb instead ? 
> 
> No. There are other operations (such as du -s ., search from midnight)
> which have find-like access pattern. And you have no chance of getting
> out of date.

It just sounds silly to optimize the disk layout for such specific cases.
Maybe if you're only running du -s and find / all day, but somehow I doubt
that.


-Andi


-- 
This is like TV. I don't like TV.



Re: Ext2 defragmentation

1999-11-15 Thread Andi Kleen

On Sat, Nov 13, 1999 at 08:44:39PM +0100, Pavel Machek wrote:
> HI!
> 
> > > How necessary is it to defragment ones ext2 partitions? It just hit me
> > > that defragmentation is very important under the Wintendo filesystem.
> > 
> > It's not as important.  But... I had an idea for an ext2 defrag daemon,
> > e2defragd, which would take advantage of _disk_ idle time to reorganize
> > blocks, while the filesystem was mounted.  This daemon would be a good
> > candidate for disk optimizations like moving frequently-accessed files
> > to the middle of the disk in addition to background defragging.
> 
> There's one usefull thing that could be done with e2defrag: putting
> directories at the beggining of the disk exactly in the order find /
> would use. One line hack, but e2defrag just does not work for me.
>   Pavel

Isn't it better to simply use locate / updatedb instead ? 


-Andi