from:"Erez Zadok"

Re: Userfs homepage (again)

1999-04-26 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "Andrew Morton" writes:
> Hi, J.
> 
> I'd be interested in seeing some expansion on why shared mem was
> undesirable and why the NFS interface was not the way to go. 
> Particularly the latter, as people have been suggesting this
> occasionally.
> 
> This was thrashed through a number of months back on the gnome list. 
> The debate concerned whether the exercise of unifying FTP/HTTP/etc with
> filesystems should be done in user space or the OS.  User space won out
> (A VFS library, but limited to gnome apps only) because gnome is not
> just for linux.  
> 
> But the consensus was that if this were to be done in the OS, the NFS
> hook should be the way because that gives the best shot at cross-Unix
> portability.

I have some comments on this.  My wrapfs is a wrapper stackable f/s with
hooks for a higher-level language.  Wrapfs is ported to linux (2.[012]),
solaris, and freebsd, and offers the same functionality via its api.  Ion
and I wrote several file systems using wrapfs, and their performance
(in-kernel) beats user-level nfs servers anywhere from 50% to 10x.  Most of
the savings come from reduced context switches of course.  For more info,
you can see my research Web page, and esp. the paper I'd be presenting at
Usenix in June: "Extending File Systems Using Stackable Templates."  I also
have another paper which I'll be presenting next month at LinuxExpo (.org)
detailing the implementation of wrapfs for linux 2.1: "A Stackable File
System Interface For Linux."

As to the question of nfs vs. something else, I'd agree that the vnode f/s
interface is not the right api for f/s developers to use, esp. if you'd like
portability.  My Wrapfs uses a simplified api to modify file data and file
names, but that's still not enough.  My FiST (f/s translator) language can
describe file system at a higher level using an api that looks more like
system calls and NFS: read, write, lookup, unlink, mkdir, etc.  Using that
common f/s description, it generates the right vnode-level code for whatever
OS wrapfs was ported to.

At a later date I intend to use fist to generate user-level nfs server code.
(I've had my share of user-level file servers: I maintain amd/am-utils and
wrote and maintain hlfsd.)

My research Web page is: http://www.cs.columbia.edu/~ezk/research/

> Another comment on the web page: s/lession/lesson/ :-)
> 
> - Andrew.

PS. I'll be releasing a new wrapfs and kernel patches for 2.2.6 (new rename
code), as soon as Ion had a chance to verify my fixes.

Erez.

lookup() return val in 2.2.7

1999-05-10 Thread Erez Zadok


2.2.7 changed the return value of the ->lookup() that's called from
fs/namei.c:real_lookup().  It used to return an int and the f/s was to fill
in the dentry in the preallocated outarg "dentry".  2.2.7 changed the ret val
to struct dentry ptr.  Why?  The new code is semantically the same.  It used
to be 

int error = dir->i_op->lookup(dir, dentry);
result = dentry;
if (error) {
dput(dentry);
result = ERR_PTR(error);
}

and now it is

result = dir->i_op->lookup(dir, dentry);
if (result)
dput(dentry);
else
result = dentry;

I was hoping that the 2.2 series won't change f/s APIs.  Each time something
like this changes, I have to update my stackable f/s for linux.  Will the
prototype for lookup remain this way or not?

It seems to me that having the result be available both as a retval and
filled in the dentry outarg is confusing.  IMHO it may confuse some
programmers who may not know how to pass back their result to the VFS.  If
we want the ret val to double up as both a valid point and an ERR_PTR, then
why not change lookup so it only takes the dir inode and, not take a second
"dentry" argument, and is expected to return a value that is either the
valid dentry looked up, or an ERR_PTR.  Of course then, the VFS won't be
able to d_alloc the space for the new dentry, and each f/s will have to
instead.  So we go back to where we were: the old prototype for lookup which
took an allocated dentry and returned a plain integer errno was better, no?

Maybe someone can explain this to me?  Are there historical reasons why this
change was made now?  Perhaps this is in preparation for a different lookup
API?

Either way, I think I can get my stackable f/s to work with little change.

Thanks,
Erez.

Re: VFS question...

1999-06-30 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "John P. Looney" writes:
>  I'm trying to write a program that would gather statistics on filesystem
> usage, over a long period of time.
> 
>  I *think* the best way of doing this is to write a kernel module that
> could replace some VFS functions, perhaps sys_write and the like, with
> a function that writes a "what happened" message to a userspace program,
> and then calls the original function.

This is called stacking... :-)

>  Has something like this been done already ? If not, is it clean enough
> to work ? If I wished to log just:
> 
>  When a file is created/deleted/modified/read from
> 
>  what's are the VFS functions of most use ?

Off the top of my head: create, open, unlink, read, write, putpage, getpage.
But there's more.

> Can a module override functions
> in the kernel proper ?

No, but it can intercept f/s fuctions right below the VFS and before the VFS
calls the lower level f/s (ext2fs, nfs, etc.)

> Kate
> 
> -- 
> Microsoft - the best reason in the world to drink beer

You can use my wrapfs/lofs template for linux.  They are stackable f/s
modules.  They wrap on top of any directory and can modify f/s behavior, or
in your case, just observe it.  That way you can measure any f/s and you
don't need to modify the VFS or other file systems.

In fact I've done something similar to what you've done already a while back
when I needed to count the number of lookups, unlinks, and create's on a
news spool.  I used my template and put in simple integer counters at the
entry point to several VFS functions.  Then I added an ioctl that returned
the values of those counters back to a user process that polled them every
few minutes.  Unfortunately I didn't save that code b/c it was so simple; I
didn't think someone else might find it useful... :-(

Anyway, the idea is simple and you can extend it to any number of integer
counters and VFS ops.  The performance overhead of lofs on 2.2 is only
2--4%, and adding the counters should not add more than 1--2%.

I would not recommend that you do a printk from the kernel and count it from
syslog or something; that will harm performance significantly and will fill
up your logs quickly.  My templates come with a lot of debugging messages
that you can turn on/off or set to a given level.  (Yes I have printfs on
entry/exit to every VFS function.)

You can get source code for these in

http://www.cs.columbia.edu/~ezk/research/software/

The code is for earlier 2.2.x kernels.  I've updated the code for all
kernels up to 2.2.10 and I'm am testing it now.  I'll release this within a
few days.

I've also been working on porting my templates to 2.3 kernels.  So far 2.3.8
is giving me trouble but I suspect the kernel itself has problems.
2.3.9-pre8 seems more stable.  Once I get wrapfs/lofs to work on 2.3
kernels, I'll release that too.

My code requires small kernel patches.  I've been working on getting those
cleaned and submitted to Linus, who agreed in principal to incorporate them.
I will make all such announcements on linux-fsdevel.  Stay tuned.

If you use my code, let me know if you have any questions.  I'd like to
help.

Cheers,
Erez.

Updated Stackable f/s support for 2.2/2.3

1999-07-13 Thread Erez Zadok


I've released updates to my stackable file systems for linux.  The updates
work for up to 2.2.10 and 2.3.10.  You can find all software in
http://www.cs.columbia.edu/~ezk/research/software/.

The packages include (small) kernel patches, and sources for several
stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs.

Let me know if you have any questions.

Happy stacking. :-)

Erez.

Updated Stackable f/s support for 2.3.11-12

1999-08-07 Thread Erez Zadok


I've released updates to my stackable file systems for linux.  The updates
work for up 2.3.12.  You can find all software in
http://www.cs.columbia.edu/~ezk/research/software:

fist-linux-2.3-fs-0.1.1.tar.gz
fist-linux-2.3-cryptfs-0.1.1.tar.gz (under "export controlled")

The packages include (small) kernel patches, and sources for several
stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs.

Let me know if you have any questions.

Erez.

2.3 NFS client expects monotonically increasing cookies

1999-08-10 Thread Erez Zadok


I found out that 2.3 kernels (see linux/fs/nfs/dir.c) expect NFS cookies
passed from the server to be monotonically increasing.  2.2 kernels do not
make that assumption it seems.  The cookies I'm talking about are the
'cookie' field in 'struct entry' (rpcsvc/nfs_prot.h).  The NFS (v2) specs do
not specify that the nfs cookies should or should not be increasing.  I
quote from RFC 1094, section #2.2.17:

   ``... and a "cookie" which is an opaque pointer to
   the next entry in the directory.  The cookie is used in the next
   READDIR call to get more entries starting at a given point in the
   directory.  The special cookie zero (all bits zero) can be used to
   get the entries starting at the beginning of the directory.''

The cookies are opaque and must not be interpreted by the client!  Linux
should not assume anything about them, only make sure it sends back to the
server the last cookie in the direntry chain, so that the *server* can
restart directory reading from the last entry just read.

I discovered this problem b/c directory reading in amd stopped working when
using 2.3 kernels.  Turned out that my "browsable directories" code didn't
generate monotonically increasing cookies.  I rewrote amd's code so they are
monotonically increasing and directory listing under amd works again.

Nevertheless, I think this assumption of 2.3 kernels can cause problems when
interacting with non-linux NFS servers that do not generate monotonically
increasing cookies.

Erez.

Updated Stackable f/s support for 2.2.11 and 2.3.13

1999-08-14 Thread Erez Zadok


I've released updates to my stackable file systems for linux.  The updates
work for up 2.2.11 and 2.3.13.  There were no functional changes since the
previous versions.  You can find all software in
http://www.cs.columbia.edu/~ezk/research/software:

For 2.2 kernels:

fist-linux-2.2-fs-0.4.1.tar.gz
fist-linux-2.2-cryptfs-0.4.1.tar.gz (under "export controlled")

For 2.3 kernels:

fist-linux-2.3-fs-0.1.2.tar.gz
fist-linux-2.3-cryptfs-0.1.2.tar.gz (under "export controlled")

The packages include (small) kernel patches, and sources for several
stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs.

Let me know if you have any questions.

Erez.

2.3.14 text for CONFIG_MSDOS_PARTITION

1999-08-20 Thread Erez Zadok


The text for the CONFIG_MSDOS_PARTITION option in 2.3.14 is a bit
misleading:

  This option enables support for using hard disks that were
  partitioned on an MS-DOS system. This may be useful if you are
  sharing a hard disk between i386 and m68k Linux boxes, for example.
  Say Y if you need this feature; users who are only using their
  system-native partitioning scheme can say N here.

I think many PC based systems will want that option, so the default for PC
systems should be changed to 'Y'.  If you've partitioned your system using
MSDOS fdisk, or linux fdisk/xxx, you probably want this option on.

The text as it is does not make it too clear when to pick this option?  It
implies that the option is only necessary for cross-platform compatibility.
Should I pick this option if I partitioned using some flavor of MS Windows?
FreeBSD? Solaris?

If someone will give me a bit more accurate details of  when to say Y/N, I'm
willing to rewrite the text and produce a small patch.

Cheers,
Erez.

Updated Stackable f/s support for 2.2.12 and 2.3.1[45]

1999-08-30 Thread Erez Zadok


I've released updates to my stackable file systems for linux.  The updates
work for up 2.2.12 and 2.3.15.  There were no functional changes since the
previous versions.  One small bug was fixed in my stackable file system
templates.  Better documentation was included.  Kernel patches remain
essentially the same.  You can find all software in

http://www.cs.columbia.edu/~ezk/research/software/

For 2.2 kernels:

fist-linux-2.2-fs-0.4.2.tar.gz
fist-linux-2.2-cryptfs-0.4.2.tar.gz (under "export controlled")

For 2.3 kernels:

fist-linux-2.3-fs-0.1.3.tar.gz
fist-linux-2.3-cryptfs-0.1.3.tar.gz (under "export controlled")

The packages include (small) kernel patches, and sources for several
stackable file system modules: lofs, wrapfs, rot13fs, and cryptfs.

Let me know if you have any questions.

Cheers,
Erez Zadok.
---
Columbia University Department of Computer Science.
EMail: [EMAIL PROTECTED]   Web: http://www.cs.columbia.edu/~ezk

Re: Testing Filesystems

1999-09-10 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Steve Dodd writes:
> On Thu, Sep 09, 1999 at 03:17:06PM +0100, Angelo Masci wrote:
> 
> > Is there a set of regression tests for filesystems 
> > available?
> 
> Not that I know of, but I didn't look very hard..
> 
> > I'd like to start testing a filesystem and was wondering
> > if there's a set of IO operation tests lurking out there
> > somewhere.
> 
> If you run into one, I'd be very interested 
> 
> > I'm looking for obvious stuff. Boundary tests for all
> > IO related operations.
> 
> -- 
> "I decry the current tendency to seek patents on algorithms.  There are
> better ways to earn a living than to prevent other people from making use of
> one's contributions to computer science."  -- Donald E. Knuth, TAoCP vol 3

It would be nice if SPEC SFS '97 could test non-NFS file systems.

For my testing of stackable file systems, I unpack, configure, and build
several large packages: am-utils, egcs-1.1.2, binutils, and emacs-20.4.  I
run a loop of "configure;make; make clean" about 10-20 times.  I watch for
any obvious bugs/oopses, odd kernel messages, creeping-up memory
utilization, and such.  I also check for any leftover locked vnodes/inodes
or ones with incorrect reference counts; to detect some of that I try to
unmount and remount my file system, and also unmount and remount the
underlying file system.  Any bad locks or refcounts will cause trouble when
you try to unmount, remount, or fsck.

I also use bonnie.

Over the past few years, I've written over two dozen small C programs
intended to test specific features.  For example, a lot of the complexity in
my stackable f/s has to do with mmap'ed pages.  So I have a program that
opens and mmaps files, then it reads/writes specific pages, page ranges, and
bytes just around page boundaries.  I used that to detect and fix boundary
conditions.

I'm pretty sure that everyone who ever wrote a file system has written a few
test programs.  My small test programs are intended for testing stackable
f/s.  I'm sure developers of other f/s have written tests specific to their
f/s.  Maybe the fsdevel community should join forces and put together a real
f/s regression tests package.  That would be useful for every filesystem on
Unix systems, not just Linux.  I'd be happy to contribute to such an effort
if one were organized.

Erez.

Updated Stackable f/s support (+lofs) for 2.3.17/2.3.18

1999-09-14 Thread Erez Zadok


First, some good news.  Kernel 2.3.17 finally contain some of my stackable
f/s patches, about 50% of them.

I've released updates to my stackable file systems for linux.  The updates
work up to 2.3.18.  There were no functional changes since the previous
versions, just one LRU-related bug in lofs/wrapfs/cryptfs.  This time, I've
included in my kernel patches more stuff:

(1) what other patches are necessary to support stackable f/s
(2) an updated vfs.txt file
(3) complete lofs sources, integrated with the rest of the kernel

You can find all software in

http://www.cs.columbia.edu/~ezk/research/software/

For 2.3 kernels:

fist-linux-2.3-fs-0.1.4.tar.gz
fist-linux-2.3-cryptfs-0.1.4.tar.gz (under "export controlled")

The packages include kernel patches, and sources for several stackable file
system modules: lofs, wrapfs, rot13fs, and cryptfs.

If all you want is lofs for 2.3.18, then simply apply this patch to a 2.3.18
kernel:

http://www.cs.columbia.edu/~ezk/research/software/fist-2.3.18.diff

Let me know if you have any questions.

Cheers,
Erez Zadok.
---
Columbia University Department of Computer Science.
EMail: [EMAIL PROTECTED]   Web: http://www.cs.columbia.edu/~ezk

Re: page_cache: how does generic_file_read really work?

1999-09-17 Thread Erez Zadok


Heh, heh.  Funny you should mention this Peter.  I'm struggling with this
question every time a new kernel release is made, b/c I have to updated my
stackable f/s modules.

If you tell what you're using (2.2 or 2.3), I'll take a stab at explaining
this.

Erez.

Re: page_cache: how does generic_file_read really work?

1999-09-17 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "Peter J. Braam" writes:
> Hi
> 
> I wondered if someone could explain what is happening in
> generic_file_read:
> 
> More generically, I'd like to understand how in a file system I can
> "get" a page and use it to copy date into/out of it.  How do I then put
> the page away? 

I think my example from Cryptfs (or Wrapfs, or any other of my stackable
f/s) might help you.

My file systems use generic_file_read as their read routine.  Let's take the
simple case of the first time to get a page, meaning it's not in memory or
in cache anywhere.  In the VFS, generic_file_read essentially calls
do_generic_file_read, which does in a loop:

- find hash of the page

- try to find a page w/o locking it __find_page_nolock.

- initially there won't be a page or cached page, so it allocates a page
  (page_cache_alloc()) and puts it in the cache (__add_to_page_cache()).
  The page is allocated already locked.

- call *your* file system's readpage routine (which must exist, b/c you've
  defined your f/s to use generic_file_read, instead of your own read
  routine.  This means that your readpage must assume that the page is
  allocated and locked.  No matter how your readpage is called, you'll get a
  locked page.

- After returning from your readpage function, the VFS calls
  page_cache_release, which frees the page, but does not remove it from LRU
  caches.  (I find the name 'page_cache_release' a bit confusing.)  This
  means that your readpage routine should have done all the necessary
  actions prior to the very last free'ing of the page: this may include
  setting uptodate/locked/whatever bits, removing from LRU caches, and more.

Now let's go into my cryptfs_readpage function.  Remember that my situation
may be (slightly) more complicated then yours.  My stackable file systems
must both emulate a VFS and a lower level f/s.  My stackable f/ss act as a
VFS to the lower level f/s (say, ext2fs), and at the same time they look
like a lower level f/s to the real VFS.

This is what I do in cryptfs_readpage():

- find a page_hash() of the lower-level inode, for the same offset.  This is
  part of how I emulate a VFS to the lower-level f/s.  The VFS looked for a
  page hash at a given offset, so I repeat the same operation on the
  lower-level inode/filesystem (which I sometimes call the "hidden" inode or
  filesystem).

- find and lock a page at the lower-level, for the same offset.  Remember
  that the VFS called me w/ a locked page, so here I'm preparing a
  lower-level f/s page and locking it, before calling the lower-level
  file-system's readpage().

- if I cannot find such a page, I allocate it in kernel space, and add it to
  the page cache (add_to_page_cache)

- I call the the readpage() routine of the lower-level f/s, and make sure I
  have valid data (wait_on_page).

At this point, I have two pages: the hidden_page is the one I retrieved from
the lower level f/s, and the 'page' which was passed to me by the VFS.  In
cryptfs, the hidden_page is encrypted, so now I'm decrypting the data from
the hidden_page and into the page which was passed to me.

- I use page_address() to "map" a page's data into kernel memory, so I can
  copy and manipulate it as any other "char *" buffer.  I map both pages,
  then I call my decryption routine to decode the hidden_page into the
  current page.  This is done of course with the locks held on both pages.

- now that I have valid, decrypted data into the page that I got from the
  VFS.  I unlock it, set the uptodate flag on, and wake-up anyone who might
  be waiting on it.

- finally, right before I return, I call __free_page() (which is the same as
  page_cache_release) on the hidden_page.  Since the VFS will do the same on
  my page, I must free the hidden page which I allocated.

In all this fun, sooner or later, your flushpage routine may be called (via
truncate_inode_pages) from multiple places, such as iput(), vmtruncate() and
more.  Some are invoked [in]directly by your f/s code or the VFS, while
other times are the result of a kernel thread that cleans up old unused
pages (LRU).  All this means that your flushpage() function must do a few
more things, and emulate on the lower-level what truncate_inode_pages does
to cryptfs's pages:

- find the corresponding hidden page and lock it.  the f/s's flushpage()
  routine gets a locked page, but must not unlock it, b/c the VFS will
  unlock the page.

- call the flushpage routine of the lower-level f/s

- clear the uptodate flag of the hidden_page, remove it from the lru cache
  (lru_cache_del), call remove_inode_page on it as well, unlock the
  hidden_page, and free it.  These actions are mostly what
  truncate_inode_pages does to your page, therefore cryptfs must do the same
  to the lower-level f/s.

The above explanation is a simplified version of what really goes on, and
what my stackable f/s modules do.  I didn't explain the other cases, nor the
interaction with other parts of the same file sy

Re: d_path or way to get full pathname

1999-12-06 Thread Erez Zadok


Marc, regarding your dentry full pathname function (and Serge's): I've not
yet looked at either in detail but what I think is needed (assuming it's not
there already) is this:

- a flag to pass to the function: if true, returns full path names starting
  w/ a '/' and crossing mount points.  There are cases you want one behavior
  and cases for another.

- if the flag is false, return relative pathname to this super_block

- a faster method than constant shifting of the bytes.  This is a serious
  one.  If you keep shifting bytes for each component, your complexity is
  O(n^2).  You can make it 2*O(n) as follows:

  (1) first, scan the dentrys and their parents in reverse, cross mount
  points as needed.

  (2) sum up the total number of bytes needed, from the q_str structures.

  (3) allocate the correct number of bytes (or verify that the user passed
  enough space)

  (4) repeat the reverse traversal, but this time, copy the bytes into the
  output buffer directly at their offsets into the buffer (don't copy
  any terminating nulls so you won't trash the beginning of the
  component that followed).

I'll be happy to help anyone write or test such a version (I started
something similar a while back).  I think it would be a useful small
addition to the kernel.

Erez.

Re: Web FS Q

1999-12-08 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "David Bialac" writes:
> For fun (and because I think it might be a useful feature), I'm working 
> on a filesystem that allows a website to be mounted as a local 
> filesystem.  I'm starting to dive in, and successfully have the kernel 
> recognizing that webfs exists, so it's now time to write some socket 
> code.  Amongst the thing I want to put into this system is caching of 
> server data locally, specifically on the local filesystem.  The 
> question I have is, can one filesystem ask to write to another?  I 
> don't see anythinng in there that seems to attempt to do this, so I 
> need to be sure said is possible.

As others have already said, this isn't a new idea, and there are several
alternatives you should look at first.  Also there are issues wrt mapping
the HTTP protocol to a file-system interface that you should be aware of.  I
believe this was discussed again in linux-fsdevel and the freebsd-fs mailing
lists just in the past 4-6 weeks.

However, there's nothing wrong with doing such a project for fun.  But if
you can find something that wasn't done before, that would be even better.

If you think that stackable file systems could help your project, see my
stackable f/s templates work

http://www.cs.columbia.edu/~ezk/research/software/

> Why this is not as stupid as it sounds:  Imagine the internet-enabled 
> appliance scenario: today, if say a DVD manufacturer has a glitch in 
> their DVD player, the only fix is to take it in for repair.  If the 
> device was internet-enabled, and further read its software off the web, 
> it could conceivably update software on the fly without the inconvience 
> of the user going without his player.  Nother scenario: you could save 
> your files to a website run anywhere, then download them anywhere.

This idea somewhat matches some of the ideas that were discussed in the
Usenix '94 "unionfs" paper: a way to merge a readonly f/s with a writable
f/s, the latter includes patches and updates to the readonly stuff (which
may come from a cdrom).

> David Bialac
> [EMAIL PROTECTED]

PS. I don't see a problem writing *loadable* kernel code.  It doesn't make
the core kernel bigger, only run-time kernel memory consumption increases.
Kernel modules aren't a solution for every application.  If speed is not a
concern, user-level file servers are easier to write and debug.  Otherwise I
personally think that all file systems should be in the kernel (loadable or
statically compiled) for performance reasons.

Erez.

Re: Minimal fs module won't unmount

1999-12-08 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Malcolm Beattie writes:
> I sent this to linux-kernel 10 days ago but got zero responses so I'll
> try here in case I get luckier.

I don't recall seeing your message from 10 days ago.  Maybe it didn't get to
others as well.

> I'm writing a little "fake" filesystem module which, when mounted on a
> mountpoint, makes the root of the new filesystem be a "magic" symlink.
> It all works fine except that the filesystem won't unmount. strace
> shows that oldumount is returning EINVAL. What the "magic symlink"
> does isn't important here and the following cut-down version displays
> the same problem. Is it something to do with the fact that the core fs
> code expects the root of the new filesystem to be an ordinary
> directory or am I missing something else? Here's the cut-down module
> which simply makes follow_link appear to be your cwd. You can compile
> (under 2.2 or 2.3, though 2.3 is untested) by
> 
> cc -c -D__KERNEL__ -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -pipe 
>-fno-strength-reduce -m486 -DCPU=486 -DMODULE -DMODVERSIONS -include 
>/usr/src/linux/include/linux/modversions.h -I/usr/src/linux/include nullfs.c
> 
> (modifying arch-specific options as necessary) and then doing
> # insmod ./nullfs.o
> # mkdir /tmp/nullcwd
> # mount -t null none /tmp/nullcwd
> # ls -l /tmp/nullcwd
> lrwxrwxrwx   1 root root0 Nov 30 12:21 /tmp/nullcwd -> foo
> # ls -l /tmp/nullcwd/
> ...listing of your current working directory...
> # umount /tmp/nullcwd
> umount: none: not found
> umount: /tmp/foo: not mounted
> How can I get it to umount properly?

I've not looked at your code, but you might want to see what I do in my
wrapfs/lofs during mount and unmount.  Usually the main reason why something
won't unmount is that you're holding some resources (inodes, dentries, etc.)
in which case you get EBUSY.

If you're getting EINVAL, the question is where?  Is your code being invoked
at all, or is the VFS giving this EINVAL.  If your code isn't called, then
search the VFS (starting w/ do_umount) to find what code path could return
you an EINVAL.  I personally found out that it's faster (and more fun :-) to
debug VFS code myself by sticking printf's at certain places and building a
test kernel with that.

BTW, just to avoid any potential problems, mount w/ the real mnt point name
instead of 'none'.

Erez.

Re: Oops with ext3 journaling

1999-12-08 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Pavel Machek writes:
> Hi!
> 
> > No, and I'm pretty much convinced now that I'll move to having a
> > private, hidden inode for the journal in the future.
> 
> Please don't do that. Current way of switching ext2/ext3 is very
> nice. If someone wants to shoot in their foot...
>   Pavel

IMHO, as a long term solution, ext3 should have as few ways to shoot oneself
in the foot as possible.  Hackers usually won't do "stupid" things (at least
not unintentionally :-), but hoards of Joe-users will.

> I'm really [EMAIL PROTECTED] Look at http://195.113.31.123/~pavel.  Pavel
> Hi! I'm a .signature virus! Copy me into your ~/.signature, please!

Erez.

Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-26 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, 
Jeff Garzik writes:
> On Thu, 23 Dec 1999, Hans Reiser wrote:

> > All I'm going to ask is that if mark_buffer_dirty gets changed again,
> > whoever changes it please let us know this time.  The last two times
> > it was changed we weren't informed, and the first time it happened it
> > took a long time to figure it out.
>
> Can't you figure this sort of thing out on your own?  Generally if you
> want to stay updated on something, you are the one who needs to do the
> legwork.  And grep'ing patches ain't that hard
> 
>   Jeff

Jeff, Hans is absolutely right.

We can all figure it out on our own, and waste many hours re-discovering
that which others have discovered independently.  It's a royal pain and time
sink.  I'd rather write new code than try to figure out what's changed b/t
kernel versions.  In my case (stackable f/s), every time there's a change to
anything under linux/fs, linux/mm, or headers, I've got to find out what
changed and how it affected my code.  It's NOT enough to grep the patches.
Union diffs don't give you enough of a context of difference that's
meaningful to understanding the overall changes that were made.  I have to
use emacs's ediff or other methods to find out the meaning and motivation
behind the change.

There is no NEWS file for each release.

There is no ChangeLog for each release.  Actually there are a few ChangeLog
files sprinkled around the sources.  The last linux-2.3.25/fs/ChangeLog was
updated was 1998.

There is no one who summarizes kernel changes.  A long time ago, someone
used to.  I don't remember his name.  Is he still doing that?

I maintain a much smaller package (am-utils) and there's no way I could
remember what changes I've made throughout the years.  That's why I keep a
details ChangeLog and NEWS files w/ my releases.  I realize the linux kernel
is a much bigger and complex beast, but shouldn't that be a bigger
motivation for everyone to keep ChangeLogs?  IMHO, if we want to speed linux
development along, we should help the documentation of linux.

Hans and linux-fsdevel folks: I have a proposal.  How would you all feel
forming an informal group that would report changes relevant to f/s
developers on this list.  (Maybe even a different mailing list?)  I'm
willing to take the time to report whatever VFS changes I find each time I
update my stackable f/s code for a new kernel, including when no relevant
changes are made (which IMHO is just as important).  This effort would help
all of us f/s developers, but only if we each take the time to report our
findings to this list.  The few minutes each person takes to report their
findings as they relate to their f/s, will save numerous other people many
hours; overall this would help everyone.  We can also make it easy to find
these messages in the archives, so we can make the Subject of such messages
a grep-able format---say,

CHANGE 2.3.17-2.3.18: vm_area_struct->vm_pte renamed vm_private_data

Comments?

Erez.

Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)

1999-12-26 Thread Erez Zadok


In message <[EMAIL PROTECTED]>, 
Jeff Garzik writes:
> 
[...]
> To sum, documenting changes is a very good idea, notifying specific
> hackers of specific kernel changes is a waste of time [unless they
> are the maintainers of the code being changed, of course].

I agree that notifying individuals doesn't scale.  Notifying the list as a
whole, does.

>   Jeff

Erez.

Re: Ext2 / VFS projects

2000-02-10 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Matthew Wilcox writes:
> 
> Greetings.  Ted Ts'o recently hosted an ext2 puffinfest where we
> discussed the future of the VFS and ext2.  Ben LaHaise, Phil Schwan,
[...]

Also, I really hope that my remaining (small, passive) patches to the VFS to
support stackable file systems will be incorporated soon.

Cheers,
Erez.

Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Tigran 
Aivazian writes:
> I noticed the stackable fs item on Alan's list ages ago but there was no
> pointer to the patch (I noticed FIST stuff but surely that is not a "small
> passive patch" you are referring to?)

Yes the patches are small and passive.  No new vfs/mm code is added or
changed!  The most important part of my patches had already been included
since 2.3.17; that was an addition/renaming of a private field in struct
vm_area_struct.  What's left are things that are necessary to support
stacking for the first time in linux: exposing some functions/symbols from
{mm,fs}/*.c, adding externs to headers, additions to ksyms.c, and moving
some macros and inline functions from private .c files to a header, so they
can be included in any file system.

I've used these patches on dozens of linux machines for the past 2+ years,
and have had no problems.  I constantly get people asking me when my patches
will become part of the main kernel.  I have about 9 active developers who
write file systems using my templates.  I've had more than 21,000 downloads
of my templates in the past two years.

> So, my point is - if you point everyone to those patches, someone might
> help Alan out if one feels like it (and has time).

http://www.cs.columbia.edu/~ezk/research/software/fist-patches/

The latest 2.3 patches in that URL include two things: my small main kernel
patches, and a fully working lofs.  The lofs of course is several thousands
of lines of code, but it is not strictly necessary to include it with the
main kernel; it can be distributed and built separately, just as my other f/s
modules are.  However, I do think that lofs is a useful enough f/s that it
should be part of the main kernel.

If you go to the 2.3 directory under the above URL, there's a README
describing the latest 2.3 patches.  I've included it below, so everyone can
read it and see what my patches do, and how harmless they are.

BTW, I've got a prototype unionfs for linux if anyone is interested.

> Regards,
> Tigran.

As always, I'll be delighted to help *anyone* use my work, and would love to
help the linux maintainers incorporate my patches, answer any concerns they
might have, etc.

Cheers,
Erez.

==
Summary of changes for 2.3.25 to support stackable file systems and lofs.

(Note: some of my previous patches had been incorporated in 2.3.17.)

(1) Created a new header file include/linux/dcache_func.h.  This header file
contains dcache-related definitions (mostly static inlines) used by my
stacking code and by fs/namei.c.  Ion and I tried to put these
definitions in fs.h and dcache.h to no avail.  We would have to make
lots of changes to fs.h or dcache.h and other .c files just to get these
few definitions in.  In the interest of simplicity and minimizing kernel
changes, we opted for a new, small header file.  This header file is
included in fs/namei.c because everything in dcache_func.h was taken
from fs/namei.c.  And of course, these static inlines are useful for my
stacking code.

If you don't like the name dcache_func.h, maybe you can suggest a better
name.  Maybe namei.h?

If you don't like having a new header file, let me know what you'd
prefer instead and I'll work on it, even if it means making more changes
to fs.h, namei.c, and dcache.h...

(2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
can be included in the same original source code as well as stackable
file systems:

check_parent macro (fs/namei.c -> include/linux/dcache_func.h)
lock_parent (fs/namei.c -> include/linux/dcache_func.h)
get_parent (fs/namei.c -> include/linux/dcache_func.h)
unlock_dir (fs/namei.c -> include/linux/dcache_func.h)
double_lock (fs/namei.c -> include/linux/dcache_func.h)
double_unlock (fs/namei.c -> include/linux/dcache_func.h)

(3) Added to include/linux/fs.h an extern definition to default_llseek.

(4) include/linux/mm.h: also added extern definitions for

filemap_swapout
filemap_swapin
filemap_sync
filemap_nopage

so they can be included in other code (esp. stackable f/s modules).

(5) added EXPORT_SYMBOL declarations in kernel/ksyms.c for functions which I
now exposed to (stackable f/s) modules:

EXPORT_SYMBOL(___wait_on_page);
EXPORT_SYMBOL(add_to_page_cache);
EXPORT_SYMBOL(default_llseek);
EXPORT_SYMBOL(filemap_nopage);
EXPORT_SYMBOL(filemap_swapout);
EXPORT_SYMBOL(filemap_sync);
EXPORT_SYMBOL(remove_inode_page);
EXPORT_SYMBOL(swap_free);
EXPORT_SYMBOL(nr_lru_pages);
EXPORT_SYMBOL(console_loglevel);

(6) mm/filemap.c: make the function filemap_nopage non-static, so it can be
called from other places.  This was not an inline function so there's no
performance impact.

ditto

Re: Ext2 / VFS projects

2000-02-11 Thread Erez Zadok


In message <[EMAIL PROTECTED]>, Manfred Spraul writes:
> Erez Zadok wrote:
> > [...]
> > (2) Inline functions moved from linux/{fs,mm}/*.c to header files so they
> > can be included in the same original source code as well as stackable
> > file systems:
> > 
> > check_parent macro (fs/namei.c -> include/linux/dcache_func.h)
> > lock_parent (fs/namei.c -> include/linux/dcache_func.h)
> > get_parent (fs/namei.c -> include/linux/dcache_func.h)
> > unlock_dir (fs/namei.c -> include/linux/dcache_func.h)
> > double_lock (fs/namei.c -> include/linux/dcache_func.h)
> > double_unlock (fs/namei.c -> include/linux/dcache_func.h)
> > 
> That sounds like a good idea: fs/nfsd/vfs.c currently contains copies of
> most of these functions...

I agree.  I didn't want to make copies of those b/c I got burnt in the past
when they changed subtly and I didn't notice the change.

> --
>   Manfred

Erez.

Re: [Announcement] inode_operations/super_operations changes

2000-03-04 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, 
Alexander Viro writes:
> Summary:
> 
>   1) s_op->notify_change() - went into inode_operations (called
> ->setattr(), otherwise the same).

Thanks for this info Alexander.

I also noticed that i_op->getattr was added, but it is not called from
anywhere yet, right?  So I left it NULL in my inode_operations structure for
my stackable templates.  Could you inform -fs when the VFS starts to use it?

Thanks,
Erez.

Re: [Announce] VFS changes (2.3.51-1)

2000-03-10 Thread Erez Zadok


Thanks Al.

VFS changes are important to any F/S developer, but even more important to
me since my stackable templates must behave like both a lower-level F/S and
a VFS.  Ion and I updated our templates to 2.3.49 just a couple of days ago,
taking into account the previous set of VFS changes.

I was under the impression that this late into 2.3, no such major changes
were going to happen, so that we get a 2.4 soon, not another long series
like 2.1.  Do you know if there are more (VFS) changes planned in 2.3, and
if so, which ones?  I would prefer to wait until all changes are in, rather
than spend time on my stacking templates for each change; it would be a
smaller effort doing it all at once.

BTW, the new vfs_* things are very nice.  They are "stacking friendly."  But
Ion and I noticed other problems that make it hard to do clean stacking.
For example, there are asymmetries b/t the creation and deletion of inodes
and dentries; a file system can get notified when a refcnt of an object is
decreased, but not when it is increased, and more.  Ion will send a separate
detailed mail about that a little later.

If you're doing all this VFS work, are you open to suggestions that would
make stacking cleaner and more flexible?  We were going to hold off
submitting such changes until 2.5, but if 2.3 is going to stretch further,
we might as well do it now.

Thanks,
Erez.

the last remaining patches to support stacking modules (for 2.3.49)

2000-03-10 Thread Erez Zadok


Linus,

Here are the last remaining kernel patches to support stacking, at least up
to 2.3.49.  As you can see, it's very small, passive stuff.  Hopefully you
can include it soon.

I didn't include a full lofs with this patch, b/c more VFS changes are
coming up soon, which will definitely require changes to the lofs templates
code (but hopefully nothing to the kernel itself).

Erez.

==

diff -ruN linux-2.3.49-vanilla/include/linux/fs.h linux-2.3.49-fist/include/linux/fs.h
--- linux-2.3.49-vanilla/include/linux/fs.h Thu Mar  2 17:01:26 2000
+++ linux-2.3.49-fist/include/linux/fs.hSun Mar  5 03:26:34 2000
@@ -949,6 +949,8 @@
 
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, 
unsigned long);
 
+/* needed for stackable file system support */
+extern loff_t default_llseek(struct file *file, loff_t offset, int origin);
 
 extern struct dentry * lookup_dentry(const char *, struct dentry *, unsigned int);
 extern struct dentry * __namei(const char *, unsigned int);
diff -ruN linux-2.3.49-vanilla/kernel/ksyms.c linux-2.3.49-fist/kernel/ksyms.c
--- linux-2.3.49-vanilla/kernel/ksyms.c Sun Feb 27 01:34:27 2000
+++ linux-2.3.49-fist/kernel/ksyms.cTue Mar  7 04:22:28 2000
@@ -234,12 +234,12 @@
 EXPORT_SYMBOL(page_symlink_inode_operations);
 EXPORT_SYMBOL(block_symlink);
 
-/* for stackable file systems (lofs, wrapfs, etc.) */
-EXPORT_SYMBOL(add_to_page_cache);
+/* for stackable file systems (lofs, wrapfs, cryptfs, etc.) */
+EXPORT_SYMBOL(default_llseek);
 EXPORT_SYMBOL(filemap_nopage);
 EXPORT_SYMBOL(filemap_swapout);
 EXPORT_SYMBOL(filemap_sync);
-EXPORT_SYMBOL(remove_inode_page);
+EXPORT_SYMBOL(lock_page);
 
 #if !defined(CONFIG_NFSD) && defined(CONFIG_NFSD_MODULE)
 EXPORT_SYMBOL(do_nfsservctl);
==

2.3.99-pre1 VFS comments

2000-03-19 Thread Erez Zadok


Hi Al,

Ion and I worked on updating our stackable templates for 2.3.99-pre1 for the
past few days.  We have found various oddities and other possible problems.
We promised to report on anything interesting we find wrt the VFS, so here
it is.  We are willing to test and submit patches for anything below that
you think is worth it.

(1) Asymmetry b/t double_lock and double_unlock:

Only double_unlock does dput() on the two dentries.  The only place where
double_lock is called is in do_rename, and do_rename already calls
get_parent() which increments the reference counts.  We can simplify the
code and make it symmetric by moving the two get_parent() calls into
double_lock().

(2) vfs_readlink:

It would be nice if all vfs_ were essentially wrappers that did some
checking and then called the file system specific method.  This isn't the
case for vfs_readlink.  (BTW, we like the vfs_* routines very much!)

(3) "__" routines:

In fs/namei.c, vfs_follow_link simply calls __vfs_follow_link with the same,
unchanged args.  Can't we simplify and get rid of the __vfs_follow_link
routine?  Then at least in page_follow_link, it should call the
vfs_follow_link directly.

(4) permission:

fs/namei.c:permission() should probably be renamed to vfs_permission, b/c
it is a generic VFS routine (and we make direct use of it in lofs).

BTW, with stacking, "permission" gets called O(n^2) times in total.  I'm not
sure there's anything that can be done about it now, but it's something to
keep in mind.  Here's the call sequence recursive scenario when we have lofs
mounted on, say, ext2 (just one stack level):

vfs_create
may_create
permission
lofs_permission
permission
ext2_permission
lofs_create
vfs_create
may_create
permission
ext2_permission

This happens b/c we use the nicer/newer vfs_ routines.  However, since
permission() is also called from places other than vfs_ routines, we
must define permission in lofs, and thus it gets called recursively.

We thought we could solve the problem by not defining our own permission
method, b/c the real routines (mkdir, create, etc) will call permission on
the lower f/s via the vfs_ routines, but we couldn't do it b/c
permission() is called explicitly in open_namei().

One possible solution is creating a vfs_open() routine which will do most of
the checks in filp_open, including permission(), but will take a dentry and
not a filename.  Then filp_open can call vfs_open, and so could we; right
now we have to duplicate most of the filp_open code in our ->open function.
This would also nicely solve the recursive permission problem, as well a
cleanup filp_open().

(5) llseek:

In fs/read_write.c, llseek should probably be renamed vfs_llseek, and the
un/lock_kernel that it calls should be moved to sys_llseek.  Then vfs_llseek
should be exported so we can use it.

(6) vfs_readdir:

vfs_readdir doesn't take the same prototype list as _readdir, which
can be *very* confusing since all other vfs_ use the same prototype.  I
suggest you make the two the same: swap the "dirent" and "filldir" args in
vfs_readdir() so they're the same everywhere.  We've had some amusing (read:
nasty :) kernel panics b/c of that.

Cheers,
Erez & Ion.

cleaning up 2.3.99-pre3 fs/exec.c:open_exec()

2000-03-24 Thread Erez Zadok


Al, this is the current (and new) open_exec():

struct file *open_exec(const char *name)
{
struct dentry *dentry;
struct file *file;

lock_kernel();
dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW);
file = (struct file*) dentry;
if (!IS_ERR(dentry)) {
file = ERR_PTR(-EACCES);
if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) {
int err = permission(dentry->d_inode, MAY_EXEC);
file = ERR_PTR(err);
if (!err) {
file = dentry_open(dentry, O_RDONLY);
out:
unlock_kernel();
return file;
}
}
dput(dentry);
}
goto out;
}

The exit conditions from it are rather odd.  It ends with a "goto out" to
the middle of the code, just so it can return an arg and unlock the kernel.
Also, it has a few too many nested if's.  Ion and I rewrote it more cleanly
and clearly.  Here's a small patch.

Erez.



*** linux-2.3-vanilla/fs/exec.c Fri Mar 24 12:34:59 2000
--- linux-2.3.bad/fs/exec.c Fri Mar 24 22:13:59 2000
***
*** 319,343 
  {
struct dentry *dentry;
struct file *file;
  
lock_kernel();
dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW);
!   file = (struct file*) dentry;
!   if (!IS_ERR(dentry)) {
file = ERR_PTR(-EACCES);
!   if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) {
!   int err = permission(dentry->d_inode, MAY_EXEC);
!   file = ERR_PTR(err);
!   if (!err) {
!   file = dentry_open(dentry, O_RDONLY);
! out:
!   unlock_kernel();
!   return file;
!   }
!   }
!   dput(dentry);
}
goto out;
  }
  
  int kernel_read(struct file *file, unsigned long offset,
--- 319,351 
  {
struct dentry *dentry;
struct file *file;
+   int err;
  
lock_kernel();
dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW);
!   if (IS_ERR(dentry)) {
!   file = (struct file*) dentry;
!   goto out;
!   }
!   if (!dentry->d_inode || !S_ISREG(dentry->d_inode->i_mode)) {
file = ERR_PTR(-EACCES);
!   goto out_dput;
!   }
! 
!   err = permission(dentry->d_inode, MAY_EXEC);
!   if (err) {
!   file = ERR_PTR(err);
!   goto out_dput;
}
+ 
+   file = dentry_open(dentry, O_RDONLY);
goto out;
+ 
+  out_dput:
+   dput(dentry);
+  out:
+   unlock_kernel();
+   return file;
  }
  
  int kernel_read(struct file *file, unsigned long offset,

stacking patches and other cleanups for 2.3.99-pre3

2000-03-25 Thread Erez Zadok


Hi Al,

Ion and I looked at -pre3 and found you began doing what we suggested
earlier, splitting filp_open() into a generic part and an open(2)-specific
part.  Thanks!  We've tried to use it, but we had to change the code to pass
a "mode" variable to dentry_open(); otherwise, dentry_open was munging the
mode/flags which are undesirable to stacking.  Also, with our changes, the
open(2) specific stuff was moved back into where it belongs, and
dentry_open() became more generic.

Next, we cleaned up the logic in dentry_open, filp_open, and open_exec.
They all had hard-to-follow nested "if" statements.  With our restructuring,
it's easier to follow the execution flow of this relatively new code.  Also,
we were able to eliminate a couple of cases where variables were computed
unnecessarily or more than once.

Finally, I'm including in our patch some very small stuff that's needed for
stacking: exporting a couple more symbols to modules, and one extern for
default_llseek() in fs.h.  Could you please apply those?  They are pretty
harmless, very small, but necessary.

We tested the patch below with and without our stacking.  We now have an
lofs/wrapfs/cryptfs which work with 2.3.99-pre3, using all the latest VFS
code changes, and we also fixed all known reported bugs in the templates.
We'd love to [re]submit lofs for inclusion in the kernel, as soon as the
stuff below is included.

Enjoy.

Erez.

==
diff -ruN linux-2.3.99-pre3-vanilla/fs/exec.c linux-2.3.99-pre3-fist/fs/exec.c
--- linux-2.3.99-pre3-vanilla/fs/exec.c Fri Mar 24 01:38:50 2000
+++ linux-2.3.99-pre3-fist/fs/exec.cSat Mar 25 01:34:17 2000
@@ -319,25 +319,33 @@
 {
struct dentry *dentry;
struct file *file;
+   int err;
 
lock_kernel();
dentry = lookup_dentry(name, NULL, LOOKUP_FOLLOW);
-   file = (struct file*) dentry;
-   if (!IS_ERR(dentry)) {
+   if (IS_ERR(dentry)) {
+   file = (struct file*) dentry;
+   goto out;
+   }
+   if (!dentry->d_inode || !S_ISREG(dentry->d_inode->i_mode)) {
file = ERR_PTR(-EACCES);
-   if (dentry->d_inode && S_ISREG(dentry->d_inode->i_mode)) {
-   int err = permission(dentry->d_inode, MAY_EXEC);
-   file = ERR_PTR(err);
-   if (!err) {
-   file = dentry_open(dentry, O_RDONLY);
-out:
-   unlock_kernel();
-   return file;
-   }
-   }
-   dput(dentry);
+   goto out_dput;
+   }
+
+   err = permission(dentry->d_inode, MAY_EXEC);
+   if (err) {
+   file = ERR_PTR(err);
+   goto out_dput;
}
+
+   file = dentry_open(dentry, FMODE_READ, O_RDONLY);
goto out;
+
+out_dput:
+   dput(dentry);
+out:
+   unlock_kernel();
+   return file;
 }
 
 int kernel_read(struct file *file, unsigned long offset,
diff -ruN linux-2.3.99-pre3-vanilla/fs/open.c linux-2.3.99-pre3-fist/fs/open.c
--- linux-2.3.99-pre3-vanilla/fs/open.c Thu Mar 23 16:11:49 2000
+++ linux-2.3.99-pre3-fist/fs/open.cSat Mar 25 07:30:52 2000
@@ -631,7 +631,7 @@
 }
 
 /*
- * Note that while the flag value (low two bits) for sys_open means:
+ * Note that while the open_mode value (low two bits) for sys_open means:
  * 00 - read-only
  * 01 - write-only
  * 10 - read-write
@@ -647,23 +647,23 @@
 struct file *filp_open(const char * filename, int flags, int mode, struct dentry * 
base)
 {
struct dentry * dentry;
-   int flag,error;
+   int open_mode, open_namei_mode;
 
-   flag = flags;
-   if ((flag+1) & O_ACCMODE)
-   flag++;
-   if (flag & O_TRUNC)
-   flag |= 2;
+   open_namei_mode = flags;
+   open_mode = ((flags + 1) & O_ACCMODE);
+   if (open_mode)
+   open_namei_mode++;
+   if (open_namei_mode & O_TRUNC)
+   open_namei_mode |= 2;
 
-   dentry = __open_namei(filename, flag, mode, base);
-   error = PTR_ERR(dentry);
-   if (!IS_ERR(dentry))
-   return dentry_open(dentry, flags);
+   dentry = __open_namei(filename, open_namei_mode, mode, base);
+   if (IS_ERR(dentry))
+   return (struct file *)dentry;
 
-   return ERR_PTR(error);
+   return dentry_open(dentry, open_mode, flags);
 }
 
-struct file *dentry_open(struct dentry *dentry, int flags)
+struct file *dentry_open(struct dentry *dentry, int mode, int flags)
 {
struct file * f;
struct inode *inode;
@@ -674,7 +674,7 @@
if (!f)
goto cleanup_dentry;
f->f_flags = flags;
-   f->f_mode = (flags+1) & O_ACCMODE;
+   f->f_mode = mode;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
error = get_write_access(inode);
diff -ruN linux-2.3.99-p

Re: __block_prepare_write(): bug?

2000-04-05 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Ion Badulescu writes:
[...]
> The current implementation will also populate the page cache with pages
> that are not Uptodate, but are not Locked either, which is clearly a bug.
> It will always happen if there is a partial write to a page, e.g. if a
> program creates a file and then writes 1.5k worth of data, on a 1k-block
> filesystem.
> 
> It should be fixed either by getting all the buffers within the page
> Uptodate, or by throwing away the page at the end of the write operation.
> 
> 
> Ion

Right.  This messed up our stacking code a bit.  The VFS essentially does
this in generic_file_read:

read_cache_page()
wait_on_page()
if (!Page_uptodate()) {
report an error
}

Since our stacking behaves like a VFS, we have to reproduce the above code
in our readpage().  For some file systems, such as cryptfs, there are two
pages in memory for each normal page: one ciphertext and one cleartext.  But
now we have a problem in the following scenario:

(1) you copy a file through the lower level file system (ext2) which has a
1k block size for a 4k page size (intel)

(2) the file you copy isn't an exact multiple of PAGE_CACHE_SIZE

(3) the caching at the ext2 level will put the last page in the cache, with
only some of the buffers being BH_Uptodate.  This is
fs/buffer.c:__block_commit_write() which ext2 uses.  The code in that
function will not set the page uptodate flag on partial pages that only
have a few buffers uptodate.  So now we have a page in the cache that is
not up-to-date.  Now see what happens when our stacking layer executes
the code similar to generic_file_read() in our own readpage():

read_cache_page(lower_page) -> we find it
wait_on_page(lower_page)-> page not locked, no more wait
if (!Page_uptodate(lower-page)) {   -> page is NOT uptodate
report an error -> we flag an error
}

There is no way for us to fix the problem in our stacking code b/c we cannot
distinguish b/t a page that is truly not up-to-date, and a partial page such
as the last page of a file just written.

Note also that there is no problem if the file is written *through* our
stacking layer, b/c then we can force the up-to-date flag on the cached
pages.  There is also no a problem if we read a file which was not cached at
the ext2 level, and we read it through our stacked layer; in that case, the
page comes back up-to-date and we're ok.  It's only when
__block_commit_write() runs that we have a problem.

Summary: a page should not be in the cache and not be up-to-date.  If it is
not up-to-date, then it should also probably be locked, but only b/c it is
probably in transit from the disk to the cache.

We have tested the patch included here, which tries to ensure that no pages
are in the cache and not up-to-date; it fixes __block_prepare_write().  Can
someone who knows the buffer.c code well comment on this?  I can't see a way
out of this situation without one of the following:

- no partial pages are left in the cache (our patch)

- partial pages are put in the cache if they are the last page of a file,
  but they are then marked up-to-date, and the rest of the code changed to
  handle this special situation

- a new flag added to pages that indicates that the page is partial (last
  page of a file) *and* up-to-date.  That way, everyone can write code that
  depends on handles this situation.

Thanks,
Erez.

diff -ruN linux-2.3.99-pre3-vanilla/fs/buffer.c linux-2.3.99-pre3-fist/fs/buffer.c
--- linux-2.3.99-pre3-vanilla/fs/buffer.c   Tue Mar 21 14:30:08 2000
+++ linux-2.3.99-pre3-fist/fs/buffer.c  Mon Apr  3 08:59:11 2000
@@ -1448,10 +1448,6 @@
if (!bh)
BUG();
block_end = block_start+blocksize;
-   if (block_end <= from)
-   continue;
-   if (block_start >= to)
-   break;
bh->b_end_io = end_buffer_io_sync;
if (!buffer_mapped(bh)) {
err = get_block(inode, block, bh, 1);
@@ -1459,10 +1455,15 @@
goto out;
if (buffer_new(bh)) {
unmap_underlying_metadata(bh);
-   if (block_end > to)
-   memset(kaddr+to, 0, block_end-to);
-   if (block_start < from)
-   memset(kaddr+block_start, 0, from-block_start);
+   if (block_end <= from || block_start >= to)
+   memset(kaddr+block_start, 0, block_end);
+   else {
+   if (block_end > to)
+   memset(kaddr+to, 0, block_end-to);
+   if

Re: __block_prepare_write(): bug?

2000-04-06 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Alexander 
Viro writes:
> 
> 
> On Wed, 5 Apr 2000, Erez Zadok wrote:
> 
> > -   if (block_start >= to)
> > -   break;
> > bh->b_end_io = end_buffer_io_sync;
> > if (!buffer_mapped(bh)) {
> > err = get_block(inode, block, bh, 1);
> 
> And there you go: bloody thing bumps the size of every file to 4k
> boundary. Which is _not_ going to make fsck[1] happy, since ->i_size is
> not consistent with the block pointers in inode. get_block() has side
> effects, damnit.
> 
> [1] (8), that is.

We were not sure our patch was right, and now we are certain it isn't.
Thanks to you and Erik for pointing out these problems.   Maybe all that's
needed is more documentation in the code?

Either way, we'll have to change our stacking code so that it'll probably do
a readpage after a wait_on_page that isn't uptodate.

Thanks,
Erez.

new VFS method sync_page and stacking

2000-04-30 Thread Erez Zadok


Background: my stacking code for linux is minimal.  I only stack on things I
absolutely have to.  By "stack on" I mean that I save a link/pointer to a
lower-level object in the private data field of an upper-level object.  I do
so for struct file, inode, dentry, etc.  But I do NOT stack on pages.  Doing
so would complicate stacking considerably.  So far I was able to avoid this
b/c every function that deals with pages also passes a struct file/dentry to
it so I can find the correct lower page.

The new method, sync_page() is only passed a struct page.  So I cannot stack
on it!  If I have to stack on it, I'll have to either

(1) complicate my stacking code considerably by stacking on pages.  This is
impossible for my stackable compression file system, b/c the mapping of
upper and lower pages is not 1-to-1.

(2) change the kernel so that every instance of sync_page is passed the
corresponding struct file.  This isn't pretty either.

Luckily, sync_page isn't used too much.  Only nfs seems to use it at the
moment.  All other file systems which define ->sync_page use
block_sync_page() which is defined as:

int block_sync_page(struct page *page)
{
run_task_queue(&tq_disk);
return 0;
}

This is confusing.  Why would block_sync_page ignore the page argument and
call something else.  The name "block_sync_page" might be misleading.  The
only thing I can think of is that block_sync_page is a placeholder for for a
time when it would actually do something with the page.

Anyway, since sync_page appears to be an optional function, I've tried my
stacking without defining my own ->sync_page.  Preliminary results show it
seems to work.  However, if at any point I'd have to define ->sync_page page
and have to call the lower file system's ->sync_page, I'd urge a change in
the prototype of this method that would make it possible for me to stack
this operation.

Also, I don't understand what's ->sync_page for in the first place.  The
name of the fxn implies it might be something like a commit_write.

Thanks,
Erez.

Re: new VFS method sync_page and stacking

2000-04-30 Thread Erez Zadok


In message <[EMAIL PROTECTED]>, "Roman V. Shaposhnick" writes:
> On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote:

> > Background: my stacking code for linux is minimal.  I only stack on
> > things I absolutely have to.  By "stack on" I mean that I save a
> > link/pointer to a lower-level object in the private data field of an
> > upper-level object.  I do so for struct file, inode, dentry, etc.  But I
> > do NOT stack on pages.  Doing so would complicate stacking considerably.
> > So far I was able to avoid this b/c every function that deals with pages
> > also passes a struct file/dentry to it so I can find the correct lower
> > page.
> > 
> > The new method, sync_page() is only passed a struct page.  So I cannot
> > stack on it!  If I have to stack on it, I'll have to either
> 
>  If inode will be enough for you than ( as it is implemented in
> nfs_sync_page ) you can do something like:
>struct inode*inode = (struct inode *)page->mapping->host;

Yes I can probably do that.  I can get the inode, from it I can get the
lower level inode since I stack on inodes.  Then I can call grab_cache_page
on the i_mapping of the lower inode and given this page's index.  I'll give
this idea a try.  Thanks.

> > (2) change the kernel so that every instance of sync_page is passed the
> > corresponding struct file.  This isn't pretty either.
> > 
>Did you see my letter about readpage ? Nevertheless, I think that first
> argument of every function from address_space_operations should be "struct
> file *" and AFAIK this is 1) possible with the current kernel 2) will
> simplify things a lot since it lets one to see the whole picture:
> file->dentry->inode->pages, not the particular spot.

Yes, I saw your post.  I agree.  I'm all for common-looking APIs.

> Roman.

Erez.

Re: new VFS method sync_page and stacking

2000-04-30 Thread Erez Zadok


In message <[EMAIL PROTECTED]>, Steve Dodd writes:
> On Sun, Apr 30, 2000 at 01:44:50PM +0400, Roman V. Shaposhnick wrote:
> 
> >Did you see my letter about readpage ? Nevertheless, I think that first
> > argument of every function from address_space_operations should be 
> > "struct file *" and AFAIK this is 1) possible with the current kernel 2) will
> > simplify things a lot since it lets one to see the whole picture:
> > file->dentry->inode->pages, not the particular spot.
> 
> But an address_space is (or could be) a completely generic cache. It might
> never be associated with an inode, let alone a dentry or file structure.
> 
> For example, I've got some experimental NTFS code which caches all metadata
> in the page cache using the address_space stuff. (This /mostly/ works really
> well, and makes the code a lot simpler. The only problem is
> block_read_full_page() and friends, which do:
> 
>   struct inode *inode = (struct inode*)page->mapping->host;
> 
> At the moment I have an evil hack in place -- I'm kludging up an inode
> structure and temporarily changing mapping->host before I call
> block_read_full_page. I'd really like to see this cleaned up, though I accept
> it may not happen before 2.5.)

It sounds like different people have possibly conflicting needs.  I think
any major changes should wait for 2.5.  I would also suggest that such
significant VFS changes be discussed on this list so we can ensure that we
can all get what we need out of the VFS.  Thanks.

Erez.

Re: new VFS method sync_page and stacking

2000-04-30 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "Roman V. Shaposhnick" writes:
> On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote:
[...]
> > But an address_space is (or could be) a completely generic cache. It
> > might never be associated with an inode, let alone a dentry or file
> > structure.

[...]
>   Thus my opinion is that address_space_operations should remain 
> file-oriented ( and if there are no good contras take the first argument 
> of "struct file *" type ). At the same time it is possible to have completely
> different set of methods around  the same address_space stuff, but from my 
> point of view this story has nothing in common with how an *existing* 
> file-oriented interface should work.
>   
> Thanks, 
> Roman.

If you look at how various address_space ops are called, you'll see enough
evidence of an attempt to make this interface both a file-based interface
and a generic cache one (well, at least as far as I understood the code):

(1) generic_file_write (mm/filemap.c) can call ->commit_write with a normal
non-NULL file.
(2) block_symlink (fs/buffer.c) calls ->commit_write with NULL for the file
arg.

So perhaps to satisfy the various needs, all address_space ops should be
passed a struct file which may be NULL; the individual f/s will have to
check for it being NULL and deal with it.  (My stacking code already treats
commit_write this way.)

Erez.

Re: new VFS method sync_page and stacking

2000-05-01 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Steve Dodd writes:
> On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote:
> 
> > Background: my stacking code for linux is minimal.  I only stack on
> > things I absolutely have to.  By "stack on" I mean that I save a
> > link/pointer to a lower-level object in the private data field of an
> > upper-level object.  I do so for struct file, inode, dentry, etc.  But I
> > do NOT stack on pages.  Doing so would complicate stacking considerably.
> > So far I was able to avoid this b/c every function that deals with pages
> > also passes a struct file/dentry to it so I can find the correct lower
> > page.
> 
> You shouldn't need to "stack on" pages anyway, I wouldn't have thought.
> For each page you can reference mapping->host, which should point to the
> hosting structure (at the moment always an inode, but this may change).
> 
> > The new method, sync_page() is only passed a struct page.  So I cannot
> > stack on it!  If I have to stack on it, I'll have to either
> > 
> > (1) complicate my stacking code considerably by stacking on pages.  This is
> > impossible for my stackable compression file system, b/c the mapping of
> > upper and lower pages is not 1-to-1.
> 
> Why can your sync_page implementation not grab the inode from mapping->host
> and then call sync_page on the underlying fs' page(s) that hold the data?

I can, and I do (at least now :-) I tried it last night and so far it seems
to work just fine.

> > (2) change the kernel so that every instance of sync_page is passed the
> > corresponding struct file.  This isn't pretty either.
> 
> I'd like to the see the address_space methods /lose/ the struct file /
> struct dentry pointer, but it may be there are situations which require
> it.

I took a closer look at my address_space ops for stacking.  We don't do
anything special with the struct file/dentry that we get.  We just pass
those along (or their lower unstacked counterparts) to other address_space
ops which require them.  We get corresponding lower pages using the
mapping->host inode.

I also agree that pages should be associated with the inode, not the
file/dentry.

So I'm now leaning more towards losing the struct file/dentry from the
address_space ops.  Furthermore, since the address_space structure showed up
relatively recently, we might consider cleaning up this API before 2.4.  I
believe my stacking code would work fine w/o these struct file/dentry being
passed around (Ion, can you verify this please?)

Thanks for the info Steve.

Erez.

Re: fs changes in 2.3

2000-05-02 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, [EMAIL PROTECTED] writes:
> On Mon, May 01, 2000 at 06:25:43PM +0200, Peter Schneider-Kamp wrote:
> > I second that. I had to stop maintaining the steganographic file
> > system around 2.3.7 because I did not have that much time to
> > find out where my fs is "broken" and needs to be "fixed".
> 
> FYI, the changes which broke filesystems in 2.3.8 were page cache /
> buffer cache changes and as such were VM changes, not VFS.  They were
> a major change that was required to make Linux more scalable.

Ideally, developing file systems would involve only the VFS.  In practice,
in involves the VM as well.  I've worked on stacking interfaces for several
different OSs, and as much as they all want the VFS and VM to be two
completely separate entities, in practice they are not.  About half of the
effort I spent on my stacking templates was related to the VM and changes to
the VM (linux/mm/*.c).  IOW, most people who maintain file systems must
track changes in both the VFS and the VM.

That said, I'm quite pleased with the changes that happened in late 2.3.40s:
breaking some into address_space ops and more.  IMHO the separation b/t the
VFS and the VM became more clear then, and it allowed me to cleanup my
stacking code quite nicely, as well as make easy use of vfs_ calls,
generic_file_{read,write}, and more.

Erez.

stackable f/s patches for 2.3.99-pre6

2000-05-02 Thread Erez Zadok


Hi Al.  First, thanks for my last set of patches you've applied.  Here a few
more that we have, some of which are the result of recent VFS changes.

The patches are passive.  They are intended to make the Linux kernel more
"stacking-friendly" so that my stacking templates can call VFS code directly
rather than reproducing it.  As a result of my patches, I hope the VFS is
gradually becoming cleaner.  Here's what the patch below does:

- dentry_open cannot be called as is from stackable templates because it
  computes the 'mode' variable internally.  So we changed dentry_open such
  that it accepts a specific mode variable, separate from "flags".  Then we
  assign the mode which was passed to dentry_open to f->f_mode.

- we changed all 3 invocations of dentry_open to pass the correct mode
  information:

o fs/exec.c:open_exec(), we pass FMODE_READ to the dentry_open call.
o fs/open.c:filp_open(), we compute and pass the correct mode to
  pass to dentry_open based on the flags passed to filp_open.  This
  makes the code in filp_open more clear, rather than distributing
  mode computations inside and outside various functions.
o ipc/shm.c:sys_shmat(), we pass "prot" which is already computed in
  this function, further showing that computing the mode flags
  inside dentry_open isn't correct; it should be done by the caller
  of dentry_open().

- we clarified filp_open() by adding a new variable called open_flags, which
  is computed from the flags passed to filp_open.  This is passed to the
  dentry_open call as the "mode" argument.

- we moved the static inline function sync_page from filemap.c to mm.h so
  that it can be called from stackable file systems.  In many ways this
  sync_page function is a "VFS" callable function that could be called by
  other file systems.

Comments are welcome.  Let me know if you have any concerns about this
patch, and we can work on it some more.

Thanks,
Erez.

##
diff -ruN linux-2.3.99-pre6-vanilla/fs/exec.c linux-2.3.99-pre6-fist/fs/exec.c
--- linux-2.3.99-pre6-vanilla/fs/exec.c Fri Apr 21 16:36:39 2000
+++ linux-2.3.99-pre6-fist/fs/exec.cFri Apr 28 22:36:32 2000
@@ -331,7 +331,7 @@
int err = permission(nd.dentry->d_inode, MAY_EXEC);
file = ERR_PTR(err);
if (!err) {
-   file = dentry_open(nd.dentry, nd.mnt, O_RDONLY);
+   file = dentry_open(nd.dentry, nd.mnt, FMODE_READ, 
+O_RDONLY);
 out:
unlock_kernel();
return file;
diff -ruN linux-2.3.99-pre6-vanilla/fs/open.c linux-2.3.99-pre6-fist/fs/open.c
--- linux-2.3.99-pre6-vanilla/fs/open.c Mon Apr 24 19:10:27 2000
+++ linux-2.3.99-pre6-fist/fs/open.cFri Apr 28 22:36:32 2000
@@ -615,12 +615,12 @@
 }
 
 /*
- * Note that while the flag value (low two bits) for sys_open means:
+ * Note that while the flags value (low two bits) for sys_open means:
  * 00 - read-only
  * 01 - write-only
  * 10 - read-write
  * 11 - special
- * it is changed into
+ * when it is copied into open_flags, it is changed into
  * 00 - no permissions needed
  * 01 - read-permission
  * 10 - write-permission
@@ -630,23 +630,24 @@
  */
 struct file *filp_open(const char * filename, int flags, int mode)
 {
-   int namei_flags, error;
+   int namei_flags, open_flags, error;
struct nameidata nd;
 
namei_flags = flags;
-   if ((namei_flags+1) & O_ACCMODE)
+   open_flags = ((flags + 1) & O_ACCMODE);
+   if (open_flags)
namei_flags++;
if (namei_flags & O_TRUNC)
namei_flags |= 2;
 
error = open_namei(filename, namei_flags, mode, &nd);
if (!error)
-   return dentry_open(nd.dentry, nd.mnt, flags);
+   return dentry_open(nd.dentry, nd.mnt, open_flags, flags);
 
return ERR_PTR(error);
 }
 
-struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int mode, int 
+flags)
 {
struct file * f;
struct inode *inode;
@@ -657,7 +658,7 @@
if (!f)
goto cleanup_dentry;
f->f_flags = flags;
-   f->f_mode = (flags+1) & O_ACCMODE;
+   f->f_mode = mode;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
error = get_write_access(inode);
diff -ruN linux-2.3.99-pre6-vanilla/include/linux/fs.h 
linux-2.3.99-pre6-fist/include/linux/fs.h
--- linux-2.3.99-pre6-vanilla/include/linux/fs.hWed Apr 26 18:29:07 2000
+++ linux-2.3.99-pre6-fist/include/linux/fs.h   Sun Apr 30 06:20:24 2000
@@ -855,7 +855,7 @@
 extern void put_unused_fd(unsigned int);  /* locked inside */
 
 extern struct file *

announcing stackable file system templates and code generator

2000-05-03 Thread Erez Zadok


It is my pleasure to announce fistgen-0.0.1, the first release of the FiST
code generator, used to create stackable file systems out of templates and a
high-level language.  This package comes with stackable file system
templates for Linux, Solaris, and FreeBSD.  It also contains several sample
file systems built using the FiST language: an encryption file system, a
compression file system, and more --- all of which are written as portable
stackable file systems.

Linux 2.3 folks: my stackable templates now support Size Changing Algorithms
(SCAs) such as compression, uuencoding, etc.  See specific papers and sample
file systems for more details.

For more information, software, and papers, see the FiST home page:

http://www.cs.columbia.edu/~ezk/research/fist/

Happy stacking.

Erez Zadok.
---
Columbia University Department of Computer Science.
EMail: [EMAIL PROTECTED]   Web: http://www.cs.columbia.edu/~ezk

Re: file checksums

2000-05-10 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Thomas Pornin writes:
> On Tue, May 09, 2000 at 03:13:40PM -0400, Theodore Y. Ts'o wrote:
> > ... and what prevents the attacker from simply updating the checksum
> > when he's modifying the blocks?  
> 
> As you may have not noticed, I am talking about a block device where
> every data is enciphered. To be more specific, each 64 bit (or 128 bit)
> block is enciphered with a different key. The attacker has not access to
> the data, neither to the checksum. However, he knows where these items
> are, and may perform modifications (although they would be essentially
> random). Hence the checksum.
> 
> > Clearly you don't understand about cryptographic checksum.
> 
> Sarcasm ignored. I have been studying cryptography for the last 5 years.

Thomas, I must agree with Ted.  I've written a cryptographic f/s myself,
mostly as an exercise in stacking, and also written several papers on
related topics.  What you've described in this tread didn't inspire me that
this is a strong and secure design.  Perhaps it wasn't explained in enough
detail.  Either way, you should follow Ted's advise and pass your detailed
design by some of the security oriented mailing lists.  (Warning: security
buffs aren't as polite in their criticism as the people on this list... :-)

The reasons you've explained for separate checksums aren't very compelling.
You can get most of it by using a different cipher, perhaps in CBC mode.
This would allow you to detect corruptions or attacks in the middle of a
file, if that's what you're concerned about.

If you haven't already, I suggest you also read up on some of the more
prominent papers in the area of secure file systems:

- Matt Blaze's CFS
- Mazieres's SFS (from the last SOSP)
- others you can follow from those two

>   --Thomas Pornin

Since I'm partial to stacking anyway, let me suggest an alternate design
using stacking.  If you can pull that off, you'll have the advantage of a
f/s that works on top of any other f/s.  This allows safe and fast backups
of ciphertext, for example (assuming you're backing up via the f/s, not
dump-ing the raw device).

- use overlay stacking, so you hide the mount point.  This also helps in
  hiding certain files.

- for each file foo, you create a file called foo.ckm, which will contain
  your checksum information in whatever way you choose.  You'll have to come
  up with a fast, reliable, incremental checksum algorithm.  It may exist.
  I don't know.  If you use the xor you've suggested, you're not that
  secure.  If you use MD5, you waste CPU b/c you'll have to re-compute the
  checksum on every tiny change to the file.  The .ckm file may contain
  checksum info for the whole file, per block, or whatever you choose.  Your
  f/s will manage that file any way it wants.

- make sure that the .ckm files aren't listed by default: this means hacking
  readdir and lookup, possibly others.

- make sure that only authenticated authorized users can view/access/modify
  idx files (if you wanted that).  You can do that using special purpose
  ioctls that pass (one time?) keys to the f/s.  You can use new ioctls in
  general to create a whole API for updating the .ckm files.

- limit as much as possible root attacks.  One of my cryptfs versions stored
  keys in memory only, hashed by the uid and the session-ID of the process.
  The SID was added to limit make it harder for root users to decrypt other
  users' files.  It's not totally safe, but every bit helps.

You can do much of the above with my stacking templates.  I've distributed
f/s examples showing how to achieve various features.  You can get more info
from http://www.cs.columbia.edu/~ezk/research/fist/.  Either way, note that
this stackable method still doesn't give you all the security you want:
attackers can still get at the raw device of the lower f/s; there is window
of opportunity b/t the lower mount and the stackable f/s mount; anything
that's in memory or can swap/page over the network is vulnerable, and more.

You will find that there's a tremendous amount of effort and many details
that must be addressed to build a secure file system.  I for one would love
to see ultra-secure, fast cryptographic file systems become a standard
component of operating systems.

Good luck.

Erez.

Re: Multiple devfs mounts

2000-05-10 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Chris Wedgwood writes:
> On Tue, May 02, 2000 at 12:15:20AM -0400, Theodore Y. Ts'o wrote:
> 
>Date:  Mon, 1 May 2000 11:27:04 -0400 (EDT)
>From: Alexander Viro <[EMAIL PROTECTED]>
> 
>Keep in mind that userland may need to be taught how to deal with getdents()
>returning duplicates - there is no reasonable way to serve that in
>the kernel. 
> 
> *BSD does this in libc, for the exactly same reason; there's no good way
> to do this in the kernel.
[...]
> I'm not sure how efficient and fast the code would be to make this
> work quickly, for large numbers of file systems it might prove
> horribly slow.

IMHO the BSD hacks to libc support unionfs were ugly.  To write unionfs,
they used the existing nullfs "template", but then they had to modify the
VFS *and* other user-land stuff.

It depends what you mean by "reasonable way" and "good way".  I've done it
in my prototype implementation of unionfs which uses fan-out stackable f/s:

(1) you read directory 1, and store the names you see in a hash table.
(2) as you read each entry from subsequent directories, you check if it's in
the hash table.  If it is, skip it; if it's not, add it to the getdents
output buf, and add the entry to the hash-table.

This was a simple design and easy to implement.  Yes it added overhead to
readdir(2) but not as much as you'd think.  It was certainly not "horribly
slow", nor did it chew up lots of ram.  I tried it on several directories
with several dozen entries each (i.e., typical directory sizes), not on
directories with thousands or more entries.

I think that if we're adding directory unification features into the linux
kernel, then we should add unique-fication of names as well to the kernel.
One possible way would be to take advantage of the fact that most
readdir()'s are followed by lstat()s of each entry (hence NFSv3's
READDIRPLUS): so when you do a readdir, maybe it's best to pre-create a
mini-dentry for each such entry, in anticipation of its probable use.  The
advantage there is that the dentry already has the name, and we already have
code to do dentry lookups based on their name.

>  --cw

Erez.

Re: [RFC] Possible design for "mount traps"

2000-05-10 Thread Erez Zadok


In message <[EMAIL PROTECTED]>, Alexander 
Viro writes:
[...]
>   So what about the following trick: let's allow vfsmounts without
> associated superblock and allow to "mount" them even on the negative
> dentries? Notice that the latter will not break walk_name() - checks for
> dentry being negative are done after we try to follow mounts.
>   Notice also that once we mount something atop of such vfsmount it
> becomes completely invisible - it's wedged between two real objects and
> following mounts will walk through it without stopping.
>   So the only case when these beasts count is when they are
> "mounted", but nothing is mounted atop of them. But that's precisely the
> class of situations we are interested in. In case of autofs we want
> follow_down() into such animal to trigger mounting, in case of portalfs -
> passing the rest of pathname to daemon, in case of devfs-with-automount
> we want to kick devfsd. So let them have a method that would be called
> upon such follow_down() (i.e. one when we have nothing mounted atop of
> us). And that's it.
>   These objects are not filesystems - they rather look like a traps
> set in the unified tree. Notice that they do not waste anon device like
> "one node autofs" would do.
>   That way if autofs daemon mounted /mnt/net/foo it would not follow
> up with /mnt/net/foo/bar - it would just set the trap in /mnt/net/foo/bar
> and let the actual lookups trigger further mounts.
[...]

This sounds almost identical to what Sun did to solve similar problems in
their first version of autofs.  There's a paper in LISA '99 describing their
enhancements to the original autofs.  Your proposal, however, is better b/c
it generalizes to more than autofs.

Erez.

Re: file checksums

2000-05-11 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Thomas Pornin writes:
[...]
> To answer to the second question, I need the experience from the
> filesystem-guys, who know how a filesystem is typically used, and
> who are supposed to at least lurk in this mailing-list. Hence this
> discussion.
> It is my understanding that typical filesystem write usage is either
> creating new files and filling them, truncating files to zero length and
> filling them, and appending to files. For these operations, the checksum
> cost is zero (in terms of disk accesses). This is not true for mmaped()
> files (databases, production of executables...) but I think (and I beg
> for comments) that this may be handled with a per-file exception (for
> instance, producing the executable file in /tmp and then copying it
> into place -- since /tmp may be emptied at boottime, it is pointless to
> ensure data integrity in /tmp when the machine is not powered).
> 
> 
> Any comment is welcome, of course. I thank you for sharing part of your
> time reading my prose.
> 
> 
>   --Thomas Pornin

Thomas, f/s usage patterns vary of course.  There are at least three papers
you should take a look at concerning this topic:

L. Mummert and M. Satyanarayanan. 
Long Term Distributed File Reference Tracing: Implementation and Experience. 
Technical Report CMU-CS-94-213. 
Carnegie Mellon University, Pittsburgh, U.S., 1994. 

Werner Vogels
File system usage in Windows NT 4.0
SOSP '99

D. Roselli, J. R. Lorch, and T. E. Anderson. 
A Comparison of File System Workloads. 
To appear in USENIX Conf. Proc., June 2000.
(You're going to have to wait to read that one, unless you can get an
advance copy from the authors.  It's an interesting paper.)

Also, I've had to visit this issue recently when I added size-changing
support to my stackable templates (for compression, CBC encryption, all
kinds of encoding, etc.)  You can get a copy of a paper (entitled
"Performance of Size-Changing Algorithms in Stackable File Systems") in
http://www.cs.columbia.edu/~ezk/research/fist/.

Cheers,
Erez.

Re: [prepatch] Directory Notification

2000-05-24 Thread Erez Zadok

In message <8ghn4m$965$[EMAIL PROTECTED]>, Ton Hospel writes:
> In article <[EMAIL PROTECTED]>,
>   "Theodore Y. Ts'o" <[EMAIL PROTECTED]> writes:
> > This was discussed on IRC, but for those who weren't there  it
> > should be clear that the current implementation uses dentries, so if you
> > have a file which is hard-linked to appear in two different directories,
> > only the parent directory which was used as an access path when the file
> > was changed would get notified.
> > 
> > That is, if /usr/foo/changed_file and /usr/bar/changed_file are hard
> > links, and a user-program modifies /usr/foo/changed_file via that
> > pathname, a server who had asked for directory notification on /usr/bar
> > would not get notified that /usr/bar/changed_file had changed.
> > 
> > This is a pretty fundamental limitation, and can't really be fixed
> > without using inode numbers as the notification path; but that requires
> > a very different architecture, and that design wouldn't work for those
> > filesystems that don't use inode numbers.  Life is full of tradeoffs.
> > 
> 
> Still, that's pretty yucky. Inode based notification should be the default
> behaviour, with the others the exception (for the others the filename path
> is usually the ONLY path).

I think there are uses for notification on a change for inodes and dentries,
maybe even files.  Some information is available in one and not in the
others.  Nevertheless, it's clear that inode notification is the first most
obvious choice.

> I don't care HOW my files got changed, just if they got changed.
> 
> Good thing directory hardlinking is very rare. on the other hand,
> multiple mounting is coming

AFAIK, directory hard-linking isn't available in Linux, and that's a good
thing.  Not just was this a seldom used feature, but it wrecked havoc on
recursive programs like find, rm, and even backup tools like Legato
Networker.  The reason was that you cd to a directory in one way, but when
you cd out of it (cd ..) you find yourself ... "transported through a
wormhole to another dimension..."

You bring an interesting point, however.  With the new multiple mounting and
vfsmount stuff, I hope that we're not re-introducing the same problems that
directory hard-linking caused.

Erez.

superblock refcount

2000-05-24 Thread Erez Zadok


Al, since we are now allowing multiple struct vfsmount objects to point to
the same super_block, shouldn't struct super_block have a refcount variable?

Erez.

->truncate may need struct file

2000-05-24 Thread Erez Zadok


Ion and I ported our stacking templates to the last few releases, including
2.3.99-pre9.  We had to create a fake struct file inside our ->truncate
method so we can pass it to some as-ops methods.  We believe that sooner or
later, truncate (and notify_change and friends) may need to have a struct
file passed to them.  This has become more evident since the recent cleanup
of the as-ops.

Some background: our stacking templates support file systems that change
data size, and we used that in our compression f/s, gzipfs.  When data is
written in the middle of a compressed file, the new sequence may be shorter
or longer than the previous sequence, so we handle it by shifting the data
in the compressed file inward/outward as needed.  Truncate() can be used to
shrink a file or extend it.  When truncate is called to shrink a file, we
first check if the truncation occurs in the middle of a compressed sequence
and we re-compress the remaining bytes to ensure data validity.  When
truncate is used to enlarge a file, we often have to compress a "chunk" of
data that contains some bytes at the previous end of the file, and now with
zeros afterwards.  A truncation that increases the file's size may also
result in multiple holes created.

Anyway, there are lots more details to our stacking support for
size-changing file systems, all of which we've outlined in a paper if anyone
is interested.  The key point for this message is that our code has to
perform data and page operations on a file from inside the ->truncate
method.  But truncate doesn't have a struct file.  It has a dentry (and
hence an inode), but that's not enough because the new address ops require
passing a struct file to them.

We've managed to hack around this by creating a fake struct file, stuffing
the dentry that we get in ->truncate, and pass that file around.  It seems
to work so far, but it's an ugly solution that may not work in the long run.
Having to create fake objects is a possible indication that something is
missing from an API.

For now we're ok, but I think we should consider passing a struct file to
->truncate, and maybe even move ->truncate from inode ops to as-ops.  Here
are a brief list of reasons:

(1) truncate is fundamentally an operation that is asked to operate on a
file without being given a file to operate on.  Truncating or enlarging
a file implies munging data pages.

(2) ftruncate(2) takes an open fd, meaning that a struct file must already
exist in the kernel b/c a user process opened it.  truncate(2) takes a
pathname, which the kernel must translate somehow into a file and/or a
dentry.  This is further reason that the two syscalls are probably
implemented in the kernel using a struct file, so it should be possible
to pass that struct file further down.

(3) if we pass a struct file to ->truncate instead of a dentry, very little
change will be required to actual file systems.  The changes to the VFS
may not be simple however.  My quick inspections showed that
notify_change and friends don't have an easy access to a struct file.

(4) our stacking code can certainly use a struct file passed to ->truncate,
and that would help us get rid of the ugly fake file we have to create
now.  If at any point in the future someone (NFS?) may need to put
something inside the private data of struct file, then it would be
necessary to pass a struct file to ->truncate.

Comments?

Thanks,
Erez.

stackable f/s patches for 2.3.99-pre9

2000-05-24 Thread Erez Zadok


Al, Linus,

Ion and I ported our stacking templates to all versions up to 2.3.99-pre9.
We've cut a new release, fistgen-0.0.3, available in
http://www.cs.columbia.edu/~ezk/research/fist/.

The following kernel patches are passive and are functionally identical to the
patches we've submitted for -pre6.  They are intended to make the Linux
kernel more "stacking-friendly" so that my stacking templates can call VFS
code directly rather than reproducing it.  Here's what the patch below does:

- dentry_open cannot be called as is from stackable templates because it
  computes the 'mode' variable internally.  So we changed dentry_open such
  that it accepts a specific mode variable, separate from "flags".  Then we
  assign the mode which was passed to dentry_open to f->f_mode.

- we changed all 3 invocations of dentry_open to pass the correct mode
  information:

o fs/exec.c:open_exec(), we pass FMODE_READ to the dentry_open call.
o fs/open.c:filp_open(), we compute and pass the correct mode to
  pass to dentry_open based on the flags passed to filp_open.  This
  makes the code in filp_open more clear, rather than distributing
  mode computations inside and outside various functions.
o ipc/shm.c:sys_shmat(), we pass "prot" which is already computed in
  this function, further showing that computing the mode flags
  inside dentry_open isn't correct; it should be done by the caller
  of dentry_open().

- we clarified filp_open() by adding a new variable called open_flags, which
  is computed from the flags passed to filp_open.  This is passed to the
  dentry_open call as the "mode" argument.

- we moved the static inline function sync_page from filemap.c to mm.h so
  that it can be called from stackable file systems.  In many ways this
  sync_page function is a "VFS" callable function that could be called by
  other file systems.

Comments are welcome.  Let me know if you have any concerns about this
patch, and we can work on it some more.

Thanks,
Erez.

##
diff -ruN linux-2.3.99-pre9-vanilla/fs/exec.c linux-2.3.99-pre9-fist/fs/exec.c
--- linux-2.3.99-pre9-vanilla/fs/exec.c Sun May 21 14:38:47 2000
+++ linux-2.3.99-pre9-fist/fs/exec.cTue May 23 19:30:47 2000
@@ -334,7 +334,7 @@
file = ERR_PTR(err);
if (!err) {
lock_kernel();
-   file = dentry_open(nd.dentry, nd.mnt, O_RDONLY);
+   file = dentry_open(nd.dentry, nd.mnt, FMODE_READ, 
+O_RDONLY);
unlock_kernel();
 out:
return file;
diff -ruN linux-2.3.99-pre9-vanilla/fs/open.c linux-2.3.99-pre9-fist/fs/open.c
--- linux-2.3.99-pre9-vanilla/fs/open.c Mon May  8 16:31:40 2000
+++ linux-2.3.99-pre9-fist/fs/open.cTue May 23 19:30:47 2000
@@ -599,12 +599,12 @@
 }
 
 /*
- * Note that while the flag value (low two bits) for sys_open means:
+ * Note that while the flags value (low two bits) for sys_open means:
  * 00 - read-only
  * 01 - write-only
  * 10 - read-write
  * 11 - special
- * it is changed into
+ * when it is copied into open_flags, it is changed into
  * 00 - no permissions needed
  * 01 - read-permission
  * 10 - write-permission
@@ -614,23 +614,24 @@
  */
 struct file *filp_open(const char * filename, int flags, int mode)
 {
-   int namei_flags, error;
+   int namei_flags, open_flags, error;
struct nameidata nd;
 
namei_flags = flags;
-   if ((namei_flags+1) & O_ACCMODE)
+   open_flags = ((flags + 1) & O_ACCMODE);
+   if (open_flags)
namei_flags++;
if (namei_flags & O_TRUNC)
namei_flags |= 2;
 
error = open_namei(filename, namei_flags, mode, &nd);
if (!error)
-   return dentry_open(nd.dentry, nd.mnt, flags);
+   return dentry_open(nd.dentry, nd.mnt, open_flags, flags);
 
return ERR_PTR(error);
 }
 
-struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags)
+struct file *dentry_open(struct dentry *dentry, struct vfsmount *mnt, int mode, int 
+flags)
 {
struct file * f;
struct inode *inode;
@@ -641,7 +642,7 @@
if (!f)
goto cleanup_dentry;
f->f_flags = flags;
-   f->f_mode = (flags+1) & O_ACCMODE;
+   f->f_mode = mode;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
error = get_write_access(inode);
diff -ruN linux-2.3.99-pre9-vanilla/include/linux/fs.h 
linux-2.3.99-pre9-fist/include/linux/fs.h
--- linux-2.3.99-pre9-vanilla/include/linux/fs.hTue May 23 17:18:48 2000
+++ linux-2.3.99-pre9-fist/include/linux/fs.h   Tue May 23 19:48:52 2000
@@ -858,7 +858,7 @@
 extern void put_unused_fd(unsigned int);  /* locked inside

Re: ->truncate may need struct file

2000-05-25 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Alexander 
Viro writes:
> 
[...]
> I suspect that keeping parallel stacks in different levels will turn out
> to be a design mistake. IOW, I'm less than sure that struct file of
> underlying fs should be obtained from struct file of covering one. I think 
> that when the case of device nodes will be finally sorted out we'll see
> what the real situation is.

If by "parallel stacks" you mean that each layer keeps its own state, I
don't see how you can avoid it altogether.  All stacking work in the past
(Rosenthal, Skinner, Heidemann, etc.) have argued for layer independence,
meaning each layer keeps its own stuff.  Remember that you have to support
multiple layers, fan-in, fan-out, and all combinations of those.  This makes
stacking more modular.  Those guys argued for a massive overhaul of the VM,
VFS, cache system (centralized), and a rewrite of all file systems.  While
that is out of the question of any OS vendor nowadays, and the main reason
why no OS vendor has seriously adopted it, even their improved approaches
had each layer maintain its own state.

My stacking work took the approach of not changing anything in the VM, VFS,
or other file systems (I got away with very little changes on Linux, and
none for Solaris/FreeBSD).  This was important to me b/c no one would ever
accept or use a piece of work that requires massive kernel changes.

I have every known stacking paper on my Web site, the old papers as well as
mine, at http://www.cs.columbia.edu/~ezk/research/fist/.  Any one is welcome
to look at them, as well as my stacking sources, and suggest alternatives.
I doubt any one would be able to come up w/ a cleaner stacking interface
that does not require a major kernel overhaul.

Nevertheless, I would be very happy if I could achieve the same level of
flexibility and functionality in stacking that I have today with less code
in the template.  For example, Ion and I were able to get rid of stacking on
the vm_area_struct structures some time ago, while keeping the same
functionality.  If I can avoid stacking on, say, struct file, and keep the
same functionality, that sure would make the Wrapfs templates simpler.

Erez.

Re: FS_SINGLE queries

2000-06-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Alexander 
Viro writes:
> 
> 
> On Sat, 10 Jun 2000, Richard Gooch wrote:
> 
> > I see your point. However, that suggests that the naming of
> > /proc/mounts is wrong. Perhaps we should have a /proc/namespace that
> > shows all these VFS bindings, and separately a list of real mounts.
> 
> What's "real"? /proc/mounts would better left as it was (funny replacement
> for /etc/mtab) and there should be something along the lines of
> /proc/namespace (hell knows, we might make it compatible with /proc/ns
> from new Plan 9). That something most definitely doesn't need to share the
> format with /proc/mounts...

On a related note, since we do have /proc/mounts, and assuming that procfs
is pretty much necessary nowadays, are we going to get rid of /etc/mtab and
completely move all getmntent info into the kernel?  I never liked the fact
that people doing mounts (such as automounters) have to ensure that they
correctly maintain a separate text file in /etc. 

If we want to go crazy, we can implement mntfs ala Solaris 8, which moved
the mnt info into the kernel, but allowed for "editing" /etc/mnttab which is
now a special f/s mounted on top of a single file.

Hmmm, maybe that's a question to the glibc folks.  I guess as long as all
the necessary tools and libraries will use /proc/mounts if available, and
avoid using /etc/mtab, that'd be ok.

Erez.

Re: FS_SINGLE queries

2000-06-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Alexander Viro 
writes:
> 
> 
> On Fri, 16 Jun 2000, Erez Zadok wrote:
> 
> > On a related note, since we do have /proc/mounts, and assuming that procfs
> > is pretty much necessary nowadays, are we going to get rid of /etc/mtab and
> > completely move all getmntent info into the kernel?  I never liked the fact
> > that people doing mounts (such as automounters) have to ensure that they
> > correctly maintain a separate text file in /etc. 
> 
> I'm not sure that we need to keep it on procfs - especially with the
> union-mounts coming into the game.

Procfs or not, I'm advocating for keeping it in the kernel only, where it
belongs, and removing the kludgy need (ala Sun and many others) to maintain
a separate /etc/mtab file.

> > Hmmm, maybe that's a question to the glibc folks.  I guess as long as all
> > the necessary tools and libraries will use /proc/mounts if available, and
> > avoid using /etc/mtab, that'd be ok.
> 
>   How many programs actually need this getmntent(), in the first
> place?

Programs like df(1) need to read mtab.  Automounters (such as amd, which I
maintain) and /bin/mount need to write it.  The problem with a separate mtab
file is that there's no way to guarantee that the file in /etc is in sync w/
the actual mounts in the kernel.  There are many reasons why you can get an
mtab file that's out of sync w/ the actual in-kernel mounts.  AIX, Ultrix,
and BSD44 did the right thing by moving this mtab list into the kernel, and
rewriting "[gs]etmntent" (they also renamed them) so they query the kernel
via a syscall.  Solaris 8 move that way too, but kept backwards
compatibility using their special mntfs.

Anyway, I'd like to see a new syscall that returns a list of mounts and
associated info in linux.  Currently that can be done by reading
/proc/mounts, but not if procfs isn't available or we're going to take
/proc/mounts away.  It would make programs like df more reliable, and
programs like /bin/mount won't have to rewrite the mtab file each time a
mount(2) is made.  And it'll make amd work a little faster (I already
auto-detect in-kernel vs. in-/etc mount tables and handle that in amd).

Anyway it's not a big thing or something that we need to do right now.

Erez.

Re: FS_SINGLE queries

2000-06-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, [EMAIL PROTECTED] 
writes:
[...]
> so mount could keep a /etc/mtab2 to record this informatoin, but that's
> freaking ugly.  or we could pass a new mount option down into the kernel
> which causes it to display `loop' in that entry, bu this seems like a
> waste of a bit.  other alternatives gladly sought.

Not necessarily.  Several OSs use an "ignore" bit as a mount flag telling
programs like df(1) not to stat certain entries by default.  This is often
used for automounted/autofs entries, where normally no reasonable info can
be returned to statvfs(2), plus it's a good idea not to slow df(1) by
stating file systems that may be served by slow user-level file servers (amd
w/o autofs support).  There are cases where such file servers can return
useful info back to statvfs(2) (as amd can).

BTW, the usual reason you don't see such automounted entries is that GNU df
automatically will not list entries with statfs values of 0, but it still
will statfs(2) them which will be slow (and hand if the automounter is
hung).  It's much better if the kernel can record that certain entries were
mounted w/ the "ignore" option, and ensure that df(1) simply doesn't
statfs's 'em at all.

Erez.

Re: FS_SINGLE queries

2000-06-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Alexander Viro 
writes:
> 
> 
> On Fri, 16 Jun 2000, Erez Zadok wrote:
> 
[...]
> > Anyway, I'd like to see a new syscall that returns a list of mounts and
> 
> Sigh... We already have a crapload of syscalls that should not be there.
> If it can be done by open()/read()/write()/lseek()/close() it should be
> done that way.

Hey, we can make it yet another ioctl(2).  Then we can trade a crapload of
syscalls for a crapload of ioctls --- a time-honored Unix tradition... :-)
:-)

Seriously, an open/read/.../close would work fine, but on what file?  If
it's something inside /proc, fine, but has the Linux community as a whole
accepted that procfs is a *must* for any working system "or else"?  If the
file to open/read/close won't be in /proc, what type of file it'd be and how
it'd get created?

Erez.

Re: Are there any FS related docs on the net?

2000-07-08 Thread Erez Zadok


Vitaly, you may also find some of my papers and stackable templates useful
as forms of documentations:

http://www.cs.columbia.edu/~ezk/research/fist/

(I also have a collection of older VFS papers in there.)

Erez.

52 matches

Mail list logo