Re: Read/write counts

2007-06-04 Thread Bryan Henderson
>It is not strictly an error to read/write less than the requested amount,
>but you will find that a lot of applications don't handle this correctly.

I'd give it  a slightly different nuance.  It's not an error, and it's a 
reasonable thing to do, but there is value in not doing it.  POSIX and its 
predecessors back to the beginning of Unix say read()/write() don't have 
to transfer the full count (they must transfer at least one byte).  The 
main reason for this choice is that it may require more resources (e.g.  a 
memory buffer) than the system can allocate to do the whole request at 
once.

Programs that assume a full transfer are fairly common, but are 
universally regarded as either broken or just lazy, and when it does cause 
a problem, it is far more common to fix the application than the kernel.

Most application programs access files via libc's fread/fwrite, which 
don't have partial transfers.  GNU libc does handle partial (kernel) reads 
and writes correctly.  I'd be surprised if someone can name a major 
application that doesn't.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Bryan Henderson
>Part of the problem is that "whenever you modify a file"
>is ill-defined, or rather, if you were to take the literal meaning of it
>you'd end up with an unmanageable number of revisions.

Let me expand on that.  Do you want to save a revision every time the user 
types in an editor?  Every time he runs a "save" command?  Every time a 
program does a write() system call?  Every time a program closes a 
modified file?  If you're adding to a C program, is every draft you 
compile a revision, or just the final modification after the bugs are 
worked out?

When I was very new to coding, I used VMS and thought the automatic 
revisioning would be a great thing because it would save me when I 
modified a program and later regretted it.  The system made a revision 
every time I exited the editor.  But I soon found that the "previous 
revision" to which I wanted to revert was always many editings back, since 
I spent a lot of time trying to make the regrettable code work before 
giving up.  VMS kept a fixed number of revisions per file.  But keeping 20 
versions of other files would have been wasteful of disk space, directory 
listing space, etc.

Later, I discovered what I think are superior alternatives:  RCS-style 
version management on top of the filesystem, and automatic versioning 
based on time instead of count of "modifications."  For example, make a 
copy of every changed file every hour and keep it for a day and keep one 
of those for a week, and keep one of those for a month, etc.  This works 
even without snapshot technology and even without sub-file deltas.  But of 
course, it's better with those.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Bryan Henderson
>The question remains is where to implement versioning: directly in
>individual filesystems or in the vfs code so all filesystems can use it?

Or not in the kernel at all.  I've been doing versioning of the types I 
described for years with user space code and I don't remember feeling that 
I compromised in order not to involve the kernel.

Of course, if you want to do it with snapshots and COW, you'll have to ask 
where in the kernel to put that, but that's not a file versioning 
question; it's the larger snapshot question.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-19 Thread Bryan Henderson
>We don't need a new special character for every 
>>  new feature.  We've got one, and it's flexible enough to do what you 
want, 
>> as proven by NetApp's extremely successful implementation.

I don't know NetApp's implementation, but I assume it is more than just a 
choice of special character.  If you merely start the directory name with 
a dot, you don't fool anyone but 'ls' and shell wildcard expansion.  (And 
for some enlightened people like me, you don't even fool ls, because we 
use the --almost-all option to show the dot files by default, having been 
burned too many times by invisible files).

I assume NetApp flags the directory specially so that a POSIX directory 
read doesn't get it.  I've seen that done elsewhere.

The same thing, by the way, is possible with Jack's filename:version idea, 
and I assumed that's what he had in mind.  Not that that makes it all OK.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-20 Thread Bryan Henderson
>The directory is quite visible with a standard 'ls -a'. Instead,
>they simply mark it as a separate volume/filesystem: i.e. the fsid
>differs when you call stat(). The whole thing ends up acting rather like
>our bind mounts.

Hmm.  So it breaks user space quite a bit.  By break, I mean uses that 
work with more conventional filesystems stop working if you switch to 
NetAp.  Most programs that operate on directory trees willingly cross 
filesystems, right?  Even ones that give you an option, such as GNU cp, 
don't by default.

But if the implementation is, as described, wildly successful, that means 
users are willing to tolerate this level of breakage, so it could be used 
for versioning too.

But I think I'd rather see a truly hidden directory for this (visible only 
when looked up explicitly).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-25 Thread Bryan Henderson
>If your only purpose is to try generate a defensive patent, then just
>dumping the idea in the public domain serves the same purpose, probably
>better.
>
>I have a few patents, some of which are defensive. That has not prevented
>the USPTO issuing quite a few patents that are in clear violation of 
mine.

That's not what a defensive patent is.  Indeed, patenting something just 
so someone else can't patent it is ridiculous, because publishing is so 
much easier.

A defensive patent is one you file so that you can trade rights to it for 
rights to other patents that you need.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-26 Thread Bryan Henderson
>md/raid already works happily with different sized drives from
>different manufacturers ...

>So I still cannot see anything particularly new.

As compared to md of conventional disk partitions, it brings the ability 
to create and delete arrays without shutting down all use of the physical 
disks (to update the partition tables).  (LVM gives you that too).  It 
also makes managing space much easier because the component devices don't 
have to be carved from contiguous space on the physical disks.

Neither of those benefits is specific to RAID, but you could probably say 
that RAID multiplies the problems they address.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Bryan Henderson
> Consistent state means many different things. 

And, significantly, open/close has nothing to do with any of them 
(assuming we're talking about the system calls).  open/close does not 
identify a transaction; a program may open and close a file multiple times 
the course of making a "single" update.  Also, data and metadata updates 
remain buffered at the kernel level after a close.  And don't forget that 
a single update may span multiple files.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Bryan Henderson
>But you look around, you may find that many
>systems claim that they can take snapshot without shutdown the
>application.

The claim is true, because you can just pause the application and not shut 
it down.  While this means you can't simply add snapshot capability and 
solve your copy consistency problem (you need new applications too), this 
is a huge advance over what there was before.  Without snapshots, you do 
have to shut down the application.  Often for hours, and during that time 
any service request to the application fails.  With snapshots, you simply 
pause the application for a few seconds.  During that time it delays 
processing of service requests, but every request ultimately goes through, 
with the requester probably not noticing any difference.

If a system claims that snapshot function in the filesystem alone gets you 
consistent backups, it's wrong.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Bryan Henderson
>>we want a open/close consistency in snapshots.
>
>This depends on the transaction engine in your filesystem.  None of the
>existing linux filesystems have a way to start a transaction when the
>file opens and finish it when the file closes, or a way to roll back
>individual operations that have happened inside a given transaction.
>
>It certainly could be done, but it would also introduce a great deal of
>complexity to the FS.

And I would be opposed as a matter of architecture to making open/close 
transactional.  People often read more into open/close than is there, but 
open is just about gaining access and close is just about releasing 
resources.  It isn't appropriate for close to _mean_ anything.

There are filesystems that have transactions.  They use separate start 
transaction / end transaction system calls (not POSIX).

>> Pausing apps itself
>> does not solve this problem, because a file could be already opened
>> and in the middle of write.

Just to be clear: we're saying "pause," but we mean "quiesce."  I.e., tell 
the application to reach a point where it's not in the middle of anything 
and then tell you it's there.  Indeed, whether you use open/close or some 
other kind of transaction, just pausing the application doesn't help.  If 
you were to implement open/close transactions, the filesystem driver would 
just wait for the application to close and in the meantime block all new 
opens.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Bryan Henderson
>i dont see how blaming autotools for other people's misuse is relevant

Here's how other people's misuse of the tool can be relevant to the choice 
of the tool: some tools are easier to use right than others.  Probably the 
easiest thing to use right is the system you designed and built yourself. 
I've considered distributing code with an Autotools-based build system 
before and determined quickly that I am not up to that challenge.  (The 
bigger part of the challenge isn't writing the original input files; it's 
debugging when a user says his build doesn't work).  But as far as I know, 
my hand-rolled build system is used correctly by me.

>> checks the width of integers on i386 for projects not caring about that 
and
>> fails to find installed libraries without telling how it was supposed 
to
>> find them or how to make it find that library.
>
>no idea what this rant is about.

The second part sounds like my number 1 complaint as a user of 
Autotools-based packages: 'configure' often can't find my libraries.  I 
know exactly where they are, and even what compiler and linker options are 
needed to use them, but it often takes a half hour of tracing 'configure' 
or generated make files to figure out how to force the build to understand 
the same thing.  And that's with lots of experience.  The first five times 
it was much more frustrating.

>> Configuring the build of an autotools program is harder than 
nescensary;
>> if it used a config file, you could easily save it somewhere while 
adding
>> comments on how and why you did *that* choice, and you could possibly
>> use a set of default configs which you'd just include.
>
>history shows this is a pita to maintain.  every package has its own 
build 
>system and configuration file ...

It's my understanding that autotools _does_ provide that ability (as 
stated, though I think "config file" may have been meant here as 
"config.make").  The config file is a shell script that contains a 
'configure' command with a pile of options on it, and as many comments as 
you want, to tailor the build to your requirements.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-06 Thread Bryan Henderson
>the 
>maintainers of util-linux have well versed autotool people at their 
disposal, 
>so i really dont see this as being worrisome.

As long as that is true, I agree that the fact that so many autotool 
packages don't work well is irrelevant.

However, I think the difficulty of using autotools (I mean using by 
packagers), as evidenced by all the people who get it wrong, justifies 
people being skeptical that util-linux really has that expertise 
available.  Also, many open source projects are developed by a large 
diverse group of people, so even if there exist people who can do the 
autotools right, it doesn't mean they'll be done right.

One reason I try to minimize the number of tools/skills used in 
maintaining packages I distribute is to enable a larger group of people to 
help me maintain them.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-1.19 for v. old ext2 ?

2001-01-22 Thread Bryan Henderson

>All changes made to ext2 (even journalling) will work with the 
>same filesystem.

I can't figure out what this says.  Can you elaborate or reword?


I was dismayed just last week to find that Linux 2.0.36 (from a
rescue disk) could not mount an ext2 filesystem created recently.
The error message complained that the filesystem had a feature
than Linux didn't understand.  So apparently all is not strictly
backward and forward compatible in ext2.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Partition IDs in the New World TM

2001-01-23 Thread Bryan Henderson

>OK.  s/Linux/Well behaved operating systems that look for
>file system signatures, rather than relying on stupid Partition IDs/

Well then you're still assigning partition IDs to operating 
systems, and my point was that partition types are not strictly 
tied to operating systems.

Allow me to reword to what you probably meant:  Have a partition
ID that means "generic partition - check signatures within for
details."  (And then get people who develop file systems for use
with Linux, at least, to have a policy of always using that).

Incidentally, I just realized that the common name "partition ID"
for this value is quite a misnomer.  As far as I know, it has
never identified the partition, but rather described its contents.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: rename ops and ctime/mtime

2001-03-20 Thread Bryan Henderson

>I quite like the
>mtime change that XFS does as, finally, the ".." entry is rewritten when 
a
>directory is moved (if the new parent != old parent).

I don't understand this.  Can you explain?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)

2001-03-23 Thread Bryan Henderson

How it can be used? Well, say it you've mounted JFS on /usr/local
>% mount -t jfsmeta none /mnt -o jfsroot=/usr/local
>% ls /mnt
>stats control   bootcode whatever_I_bloody_want
>% cat /mnt/stats
>master is on /usr/local
>fragmentation = 5%
>696942 reads, yodda, yodda
>% echo "defrag 69 whatever 42 13" > /mnt/control
>% umount /mnt

There's a lot of cool simplicity in this, both in implementation and 
application, but it leaves something to be desired in functionality.  This 
is partly because the price you pay for being able to use existing, 
well-worn Unix interfaces is the ancient limitations of those interfaces 
-- like the inability to return adequate error information.

Specifically, transactional stuff looks really hard in this method.
If I want the user to know why his "defrag" command failed, how would I 
pass that information back to him?  What if I want to warn him of of a 
filesystem inconsistency I found along the way?  Or inform him of how 
effective the defrag was?  And bear in mind that multiple processes may be 
issuing commands to /mnt/control simultaneously.

With ioctl, I can easily match a response of any kind to a request.  I can 
even return an English text message if I want to be friendly.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in Linux 2.5

2001-05-04 Thread Bryan Henderson


>Solaris man-page of dup() says:

If you read this with the proper lexicon, it does in fact specify the
broken-as-designed behavior people are complaining about.

>dup() returns a new file descriptor having the following  in
> common with the original open file descriptor fildes:
>
>  Same open file (or pipe).

Locks are associated with files, and consequently with open files, so dup()
creates a new file descriptor that is associated with the same locks as the
original.


> All locks associated with a file for  a  given  process  are
> removed  when  a  file descriptor for that file is closed by
> that process

So when you close one file descriptor for a file, the file's locks are
removed, and therefore the locks associated with all of that file's file
descriptors are as well.

>I would have thought that  dup()  creates new file object which
>does not share file state with the original one

I guess it depends on what a "file object" is.  The quoted man page doesn't
use that term, and I can see it applying to a file descriptor or an open
instance, or even to a file.  I don't think it's a well defined term.

I don't really believe the man page here.  I'll bet you can mount a
filesystem twice, and then Solaris sees a single file in it as two
different files, and closing the file doesn't cause all of the file's locks
to get dropped.  I usually use the term "file image" in a case like this
instead of "file."


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: read_super()

2001-05-11 Thread Bryan Henderson


>all ->read_super() should do is read the superblock!

You're taking it kind of literally, aren't you?  These days, superblock is
a metaphor.  Lots of filesystems don't have real superblocks.  I think if
we were naming ->read_super() today, we'd name it ->mount().

>Ideally, we should let VFS do exclusion between mount/umount and remove
>lock_super() from there. Then it becomes fs-private thing. It's not too
>hard now - most of the stuff it depends on is already in the tree.

I don't think the filesystem driver can do mount/unmount exclusion.  What
you're trying to serialize is the very existence of the filesystem driver
-- Once ->write_super() is running it's too late to lock out another
process which might be in the middle of unregistering the fileystem type
and eliminating write_super().

But since we're on the topic of filesystem driver coordination of mount and
unmount, I think a useful enhancment in this area would be for ->put_super
() to be able to refuse the unmount, with a bad return code.  Then the
concept of the filesystem being busy could be pushed into the VFS layer.
In complex filesystems, there may be more to it than just files in use.

>This deadlocks ext3, which wants to call ext3_truncate()
>inside ext3_read_super().

This is the part I don't get.  Does ext3_truncate() acquire the superblock
lock?  If so, that would seem to be the problem -- it's a layering
violation.

On the other hand, along the same lines of complex stuff being done in
->read_super(), I've run into grief because read_super() is expected to go
all the way to accessing the root directory and creating an inode and
dentry for it.  That's a big job on the network filesystem I'm working on,
and it's all done under the mount semaphore so everyone else has to wait
for it.  If it crashes, that's it for mounting and mounting until a hard
reboot.

On AIX, the same filesystem type doesn't present a problem because the
->read_super() function is split in two.  The filesystem can be fully
mounted and registered and everything without the root directory having
been accessed.  The root directory gets accessed the first time someone
actually needs to get to a file.

In Linux, if we could just have separate ->read_super() and read_root(),
with the mount semaphore dropped in between, it would probably solve the
problem.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



read-only mounts

2001-05-18 Thread Bryan Henderson

I have discovered, looking at Linux 2.4.2, that the read-only status of a
mount is considered in some places to be a matter of file permissions, and
in others as something separate from file permissions.  So in some cases,
it is the responsibility of a filesystem object's ->permission routine to
check the MS_RDONLY superblock flag and deny write permission, but in other
cases FS code checks MS_RDONLY itself.

This seems to me inconsistent to the point of surely causing mistakes.  Is
there a consistent philosophy here that I'm missing?

I noticed the problem when my filesystem driver had the following quirky
behavior:  I have an easy ->permission that grants write access in spite of
the MS_RDONLY flag.  When I open(O_RDWR | O_CREAT) a new file on  a
read-only mount, the open() fails, but the file gets created anyway.
open_namei() defers to the filesystem driver for the creation part, but
fails the open on its own authority later on.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: inode->i_blksize and inode->i_blocks

2001-06-04 Thread Bryan Henderson


>Are there any deeper reasons,
>why
>a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the actual
>block size of the file system?

Well, why not?  The field tells what is a good chunk size to read or write
for maximum performance.  If the I/O is done in PAGESIZE cache page units,
then that's the best number to use.

I suppose in the very first unix filesystems, the field may have meant
filesystem block size, which was identical to the highest performing
read/write size, and that may account for its name.

>b) the number of blocks is counted in 512 Bytes and not in the actual
blocksize
>of the filesystem?

I can't see how the number of actual blocks would be helpful, especially
since as described above, we don't even know how big they are.  We don't
even know that they're fixed size or that a concept of a block even exists.

>(is this for historical reasons??)

That would be my guess.  Though I can't think of any particular programs
that would measure a file by multiplying this number by 512.

In any case, the inode fields are defined as they are because they
implement a standard stat() interface that includes these same numbers.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: inode->i_blksize and inode->i_blocks

2001-06-04 Thread Bryan Henderson


>> >a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the
>> > actual block size of the file system?
>
>>If the I/O is done in PAGESIZE(-size) cache
>> page units, then that's the best number to use.
>
>But we already know that from PAGE_SIZE, this seems like a complete
>waste.

OK, but who's "we"?  Individual filesystem drivers (some of them, that is)
know that the optimum read/write size is PAGESIZE.  But does the FS layer?
And if the FS layer doesn't, the user space program certainly can't.  The
ext2 driver sets i_blksize to PAGESIZE, but another driver might set it to
something else.  FS blindly passes i_blksize up to user space (via stat()).

>> In any case, the inode fields are defined as they are because they
>> implement a standard stat() interface that includes these same
>> numbers.
>
>We can fix things up in cp_old/new_stat if we want.

Just as long as all the information is in the inode.  Which I think is what
the issue is.

I'm more confused than ever about the i_blocks (filesize divided by 512)
field.  I note that in both the inode and the stat() result, the filesize
in bytes exists, and presumably always has.  So why would anyone ever want
to know separately how many 512 byte units of data are in the file?  FS
code appears to explicitly allow for a filesystem driver to denominate
i_blocks in other units, but any other unit would appear to break the stat
() interface.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: about BKL in VFS

2001-06-08 Thread Bryan Henderson

I think what may have gotten lost in Alexander's detailed reply is the big 
picture on the BKL in VFS.  The issue of the BLK protecting ->lookup is 
the same as for every other VFS call:

A whole bunch of filesystem drivers were designed in a time when there 
could be only one CPU, and coupled with a non-preemptive kernel, that 
meant these filesystem drivers could depend on uninterrupted access to 
data structures and filesystems.  When the multiple CPU case was 
introduced, it was not practical to update every filesystem driver, so the 
Big Kernel Lock (BKL) was added to give those drivers the uninterrupted 
access they (may) expect.  You may surmise that a "lookup" routine doesn't 
need such uninterrupted access, but you can never really assume that.

I think an individual filesystem driver that is specifically designed to 
do the fine-grained locking necessary to tolerate multiple CPUs can just 
release the BKL and avoid any bottleneck.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: about BKL in VFS

2001-06-08 Thread Bryan Henderson

Bryan:
>> introduced, it was not practical to update every filesystem driver, so 
the 
>> Big Kernel Lock (BKL) was added to give those drivers the uninterrupted 

>> access they (may) expect.  You may surmise that a "lookup" routine 
doesn't 
>> need such uninterrupted access, but you can never really assume that.
>
Al:
>Now, now. BKL _is_ worth the removal. The thing being, "oh, we just take
>BKL, so we don't have to worry about SMP races" is wrong attitude.

Yeah, I agree.  I took the question to be "why is the BKL there?" and "Can 
we just remove the lock_kernel() from FS?", not "is the BKL shield the 
best possible design for Linux?"

I'd like to see the single threaded guarantee to VFS routines revoked -- 
not only the UP side of it, but the non-preemption as well.  I have always 
been taught to assume that anything can happen between any two 
instructions, or even in the middle of one, unless I explicitly lock 
against it.

An "MP-safe" attribute that a filesystem driver can register for its VFS 
routines would be a good tool to get there.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: about BKL in VFS

2001-06-08 Thread Bryan Henderson

>IMO preemptive kernel patches are an
>exercise in masturbation (bad algorithm that can be preempted at any 
point
>is still a bad algorithm and should be fixed, not hidden)

What does this mean?  What is a preemptive kernel patch and what kind of 
bad algorithm are you contemplating, and what does it mean to hide one?

You're apparently referring back to some well known argument, but I'm not 
familiar with it myself.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: about BKL in VFS

2001-06-11 Thread Bryan Henderson


>What we ought to do in 2.5.early (possibly - in 2.4) is to
>add ->max_page to address_space. I.e. ->i_size in pages

I don't get it.  What would address_space.max_page mean and how would you
use it?  Obviously, you don't really mean for it to be defined as
inode.i_size in pages, since then it would have to be updated in lockstep
with i_size and wouldn't buy you anything.

Although I've been all over this code and am actually in the midst of
writing a network filesystem driver, I don't understand most the language
in your post, so you may have to use extra detail.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



read-only mounts

2001-06-12 Thread Bryan Henderson

I posted this earlier, but it was right at the time that linux-fsdevel
got swamped with a linux-kernel discussion, so I don't think anyone saw
it.


I have discovered, looking at Linux 2.4.2, that the read-only status of a
mount is considered in some places to be a matter of file permissions, and
in others as something separate from file permissions.  So in some cases,
it is the responsibility of a filesystem object's ->permission routine to
check the MS_RDONLY superblock flag and deny write permission, but in 
other
cases FS code checks MS_RDONLY itself.

This seems to me inconsistent to the point of surely causing mistakes.  Is
there a consistent philosophy here that I'm missing?

I noticed the problem when my filesystem driver had the following quirky
behavior:  I have an easy ->permission that grants write access in spite 
of
the MS_RDONLY flag.  When I open(O_RDWR | O_CREAT) a new file on  a
read-only mount, the open() fails, but the file gets created anyway.
open_namei() defers to the filesystem driver for the creation part, but
fails the open on its own authority later on.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



How to use page cache

2001-06-26 Thread Bryan Henderson

A filesystem driver is supposed to be able to use the page cache for file 
caching without involving the buffer cache, isn't it?  I can't find any 
examples of it, but I heard that was the case.

I had a filesystem driver doing just that with Linux 2.4.2, but now the 
interface it used is no longer available to a loadable kernel module.  How 
would a driver, having put file write data into a page cache page, get the 
page onto the dirty page list so that it might be written out to the file 
later?  I am able to do this easily by modifying the base kernel to export 
__set_page_dirty(), but there must be a better way.

Incidentally, what changed since 2.4.2 is that the lock that protects the 
page lists was moved from the address_space struct to the global 
pagecache_lock.  My code previously internally duplicated the function of 
__set_page_dirty(), but since pagecache_lock is not an exported symbol, it 
can't do so anymore.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-28 Thread Bryan Henderson
>Just putting up my hand to say "yeah, us too" - we could also make
>use of that functionality, so we can grok existing XFS filesystems
>that have blocksizes larger than the page size.

IBM Storage Tank has block size > page size and has the same problem. This 
is one of several ways that Storage Tank isn't generic enough to use 
generic_file_write() and generic_file_read(), so it doesn't.  That's not a 
terrible way to go, by the way.  At some point, making the generic 
interface complex enough to handle every possible filesystem becomes worse 
than every filesystem driver having its own code.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-31 Thread Bryan Henderson
>OOC, have you folks measured any performance improvements at all
>using larger IOs (doing multi-page bios?) with larger blocksizes?

First, let me clarify that by "larger I/O" you mean a larger unit of I/O 
submitted to the block layer (doing multi-page bios), because people often 
say "larger I/O" to mean larger units of I/O from Linux to the device, and 
the two are only barely coupled.

Blocksize > page size doesn't mean multi-page bios as long as VM is still 
managing the file cache.  VM pages in and out one page at a time.

To get multi-page bios (in any natural way), you need to throw out not 
only the generic file read/write routines, but the page cache as well.

Every time I've looked at multi-page bios, I've been unable to see any 
reason that they would be faster than multiple single-page bios.  But I 
haven't seen any experiments.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems  
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-31 Thread Bryan Henderson
Thanks for the numbers, though there are enough variables here that it's 
hard to make any hard conclusions.

When I've seen these comparisons in the past, it turned out to be one of 
two things:

1) The system with the smaller I/Os (I/O = unit seen by the device) had 
more CPU time per megabyte in the code path to start I/O, so that it 
started less I/O.  The small I/Os are a consequence of the lower 
throughput, not a cause.  You can often rule this out just by looking at 
CPU utilization.

2) The system with the smaller I/Os had a window tuning problem in which 
it was waiting for previous I/O to complete before starting more, with 
queues not full, and thus starting less I/O.  Some devices, with good 
intentions, suck the Linux queue dry, one tiny I/O at a time, and then 
perform miserably processing those tiny I/Os.  Properly tuned, the device 
would buffer fewer I/Os and thus let the queues build inside Linux and 
thus cause Linux to send larger I/Os.

People have done ugly queue plugging algorithms to try to defeat this 
queue sucking by withholding I/O from a device willing to take it.  Others 
defeat it by withholding I/O from a willing Linux block layer, instead 
saving up I/O and submitting it in large bios.

>Ext3 (writeback mode)
>
>Device:rrqm/s   wrqm/s   r/sw/s  rsec/swsec/srkB/s wkB/s 
avgrq-sz avgqu-sz   await  >svctm  %util
>sdc  0.00 21095.60 21.00 244.40  168.00 170723.2084.00 
85361.60   643.9011.15   42.15   > 3.45  91.60
>
>We see 21k merges per second going on, and an average request size of 
>only 643 sectors where the device can handle up to 1Mb (2048 sectors).
>
>Here is iostat from the same test w/ JFS instead:
>
>Device:rrqm/s  wrqm/s   r/s   w/s  rsec/swsec/srkB/s wkB/s 
avgrq-sz avgqu-sz   await  >svctm  %util
>sdc  0.00 1110.58  0.00 97.800.00 201821.96 0.00 
100910.98  2063.53   117.09 1054.11  >10.21  99.84
>
>So, in this case I think it is making a difference 1k merges and a big 
difference in
>throughput, though there could be other issues. 

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Bryan Henderson
>Or actually we wouldn't
>even care if stale pages are added as they would still be cleared in
>readpage().  And pages found and uptodate and locked simply need to be
>marked dirty and released again and if not uptodate they need to be
>cleared first.

You do need some form of locking to make sure someone doesn't add a page, 
update it, and clean it while you're independently initializing the block 
under it.  The cache locks are usually what coordinate this kind of 
activity, but I think we've established those locks aren't available at 
this level (inside a pageout).  Maybe a block lock or file lock could 
serve. 

I believe modifying a page and its status while it's locked by another 
process is a violation of page management ethics.  I wouldn't dare.

If you do figure something out with direct clearing of the block upon 
pageout of the first page in it, remember to have some reserved memory for 
the I/O buffer and bio/bh, because you can't wait for memory inside a 
pageout.

>Is your driver's source available to look at?

Not easily.  An old version is available for download (under GPL) at 
http://www-1.ibm.com/servers/storage/software/virtualization/sfs/implementation.html
 
.  You have to register.  I could email a copy (1.2M) to you, but I don't 
know if it would be worth your time to plow through it.  This (Storage 
Tank) is a multi-disk shared filesystem (multiple computers access the 
same disks, but the block maps are kept on a separate metadata server) 
with multiple block size and page size and copy-on-write snapshots.  And 
the same code works in a wide variety of 2.4 and 2.6 Linux kernels.  These 
things all complicate this area of initializing a new block.  And it's 
20,000 lines of code not counting the metadata server.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device

2005-02-03 Thread Bryan Henderson
>I did a patch which switched loop to use the file_operations.read/write
>about a year ago.  Forget what happened to it.  It always seemed the 
right
>thing to do..

This is unquestionably the right thing to do (at least compared to what we 
have now).  The loop device driver has no business assuming that the 
underlying filesystem uses the generic routines.  I always assumed it was 
a simple design error that it did.  (Such errors are easy to make because 
prepare_write and commit_write are declared as address space operations, 
when they're really private to the buffer cache and generic writer).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-03 Thread Bryan Henderson
>> > > And for the vmscan->writepage() side of things I wonder if it would 
be
>> > > possible to overload the mapping's ->nopage handler.  If the target 
page
>> > > lies in a hole, go off and allocate all the necessary pagecache 
pages, zero
>> > > them, mark them dirty?
>> > 
>> > I guess it would be possible but ->nopage is used for the read case 
and
>> > why would we want to then cause writes/allocations?
>> 
>> yup, we'd need to create a new handler for writes, or pass 
`write_access'
>> into ->nopage.  I think others (dwdm2?) have seen a need for that.
>
>That would work as long as all writable mappings are actually written to
>everywhere.  Otherwise you still get that reading the whole mmap()ped
>are but writing a small part of it would still instantiate all of it on
>disk.  As far as I understand this there is no way to hook into the mmap
>system such that we have a hook whenever a mmap()ped page gets written
>to for the first time.  (I may well be wrong on that one so please
>correct me if that is the case.)

I think the point is that we can't have a "handler for writes," because 
the writes are being done by simple CPU Store instructions in a user 
program.  The handler we're talking about is just for page faults.  Other 
operating systems approach this by actually _having_ a handler for a CPU 
store instruction, in the form of a page protection fault handler -- the 
nopage routine adds the page to the user's address space, but write 
protects it.  The first time the user tries to store into it, the 
filesystem driver gets a chance to do what's necessary to support a dirty 
cache page -- allocate a block, add additional dirty pages to the cache, 
etc.  It would be wonderful to have that in Linux.  I saw hints of such 
code in a Linux kernel once (a "write_protect" address space operation or 
something like that); I don't know what happened to it.

Short of that, I don't see any way to avoid sometimes filling in holes due 
to reads.  It's not a huge problem, though -- it requires someone to do a 
shared writable mmap and then read lots of holes and not write to them, 
which is a pretty rare situation for a normal file.

I didn't follow how the helper function solves this problem.  If it's 
something involving adding the required extra pages to the cache at 
pageout time, then that's not going to work -- you can't make adding pages 
to the cache a prerequisite for cleaning a page -- that would be Deadlock 
City.

My large-block filesystem driver does the nopage thing, and does in fact 
fill in files unnecessarily in this scenario.  :-(  The driver for the 
same filesystems on AIX does not, though.  It has the write protection 
thing.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-09 Thread Bryan Henderson
>I see much larger IO chunks and better throughput. So, I guess its
>worth doing it

I hate to see something like this go ahead based on empirical results 
without theory.  It might make things worse somewhere else.

Do you have an explanation for why the IO chunks are larger?  Is the I/O 
scheduler not building large I/Os out of small requests?  Is the queue 
running dry while the device is actually busy?

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
>I am inferring this using iostat which shows that average device
>utilization fluctuates between 83 and 99 percent and the average
>request size is around 650 sectors (going to the device) without
>writepages. 
>
>With writepages, device utilization never drops below 95 percent and
>is usually about 98 percent utilized, and the average request size to
>the device is around 1000 sectors.

Well that blows away the only two ways I know that this effect can happen. 
 The first has to do with certain code being more efficient than other 
code at assembling I/Os, but the fact that the CPU utilization is the same 
in both cases pretty much eliminates that.  The other is where the 
interactivity of the I/O generator doesn't match the buffering in the 
device so that the device ends up 100% busy processing small I/Os that 
were sent to it because it said all the while that it needed more work. 
But in the small-I/O case, we don't see a 100% busy device.

So why would the device be up to 17% idle, since the writepages case makes 
it apparent that the I/O generator is capable of generating much more 
work?  Is there some queue plugging (I/O scheduler delays sending I/O to 
the device even though the device is idle) going on?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
>Don't you think, filesystems submitting biggest chunks of IO
>possible is better than submitting 1k-4k chunks and hoping that
>IO schedulers do the perfect job ? 

No, I don't see why it would better.  In fact intuitively, I think the I/O 
scheduler, being closer to the device, should do a better job of deciding 
in what packages I/O should go to the device.  After all, there exist 
block devices that don't process big chunks faster than small ones.  But 

So this starts to look like something where you withhold data from the I/O 
scheduler in order to prevent it from scheduling the I/O wrongly because 
you (the pager/filesystem driver) know better.  That shouldn't be the 
architecture.

So I'd like still like to see a theory that explains why submitting the 
I/O a little at a time (i.e. including the bio_submit() in the loop that 
assembles the I/O) causes the device to be idle more.

>We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
>IO and making the elevator merge all the peices together.

That was CPU time, right?  In the present case, the numbers say it takes 
the same amount of CPU time to assemble the I/O above the I/O scheduler as 
inside it.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
>Its possible that by doing larger
>IOs we save CPU and use that CPU to push more data ?

This is absolutely right; my mistake -- the relevant number is CPU seconds 
per megabyte moved, not CPU seconds per elapsed second.
But I don't think we're close enough to 100% CPU utilization that this 
explains much.

In fact, the curious thing here is that neither the disk nor the CPU seems 
to be a bottleneck in the slow case.  Maybe there's some serialization I'm 
not seeing that makes less parallelism between I/O and execution.  Is this 
a single thread doing writes and syncs to a single file?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
I went back and looked more closely and see that you did more than add a 
->writepages method.  You replaced the ->prepare_write with one that 
doesn't involve the buffer cache, right?  And from your answer to Badari's 
question about that, I believe you said this is not an integral part of 
having ->writepages, but an additional enhancement.  Well, that could 
explain a lot.  First of all, there's a significant amount of CPU time 
involved in managing buffer heads.  In the profile you posted, it's one of 
the differences in CPU time between the writepages and non-writepages 
case.  But it also changes the whole way the file cache is managed, 
doesn't it?  That might account for the fact that in one case you see 
cache cleaning happening via balance_dirty_pages() (i.e. memory fills up), 
but in the other it happens via Pdflush.  I'm not really up on the buffer 
cache; I haven't used it in my own studies for years.

I also saw that while you originally said CPU utilization was 73% in both 
cases, in one of the profiles I add up at least 77% for the writepages 
case, so I'm not sure we're really comparing straight across.

To investigate these effects further, I think you should monitor 
/proc/meminfo.  And/or make more isolated changes to the code.

>So yes, there could be better parallelism in the writepages case, but
>again this behavior could be a symptom and not a cause,

I'm not really suggesting that there's better parallelism in the 
writepages case.  I'm suggesting that there's poor parallelism (compared 
to what I expect) in both cases, which means that adding CPU time directly 
affects throughput.  If the CPU time were in parallel with the I/O time, 
adding an extra 1.8ms per megabyte to the CPU time (which is what one of 
my calculation from your data gave) wouldn't affect throughput.

But I believe we've at least established doubt that submitting an entire 
file cache in one bio is faster than submitting a bio for each page and 
that smaller I/Os (to the device) cause lower throughput in the 
non-writepages case (it seems more likely that the lower throughput causes 
the smaller I/Os).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Efficient handling of sparse files

2005-02-28 Thread Bryan Henderson
>This is very similar to the Windows ability to do a query to
>get the block map of a sparse file. Might be worth looking at
>that interface to see what we can learn.

XDSM (better but incorrectly known by the generic term DMAPI) also has one 
of those, for use in migrating or backing up sparse files and restoring 
them to their original sparseness.

I'd resist any interface that exposes implementation details like that. 
The user program shouldn't know anything about block allocations.

On the other hand, I can see the value in exposing the concept of a clear 
section of file (a hole), as distinct from one filled with zeroes.

I once had to deal with this in a system that would have to transfer mass 
quantities of zero bytes over a network for sparse files.  I found then 
that the most convenient interface was a new form of the read call.  It 
returned an indicator of whether the offsets being read were clear or 
filled plus, if filled, the values.  If clear, the values are by 
definition zero.  At boundaries between clear and filled sections of the 
file, it would do a short read.  Otherwise, the semantics were pretty much 
the same as classic Unix character stream read.

My interface didn't have the ability to tell you how far the hole extends 
without you having to allocate a buffer that big (because you don't know 
until you do the read if you're reading a hole or not), but that seems 
like a reasonable addition.

If someone's expending development effort on exploiting file sparseness, 
I'd rather see it spent implementing a clear (aka punch) system call 
first.  Or has that been done when I wasn't looking?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Efficient handling of sparse files

2005-03-01 Thread Bryan Henderson
>A database or file scanner that must read a lot of data can benefit
>from having even a rough idea of the layout of the data on disk.

True.  There's always room for interfaces that dive into the lower layers 
for those users who want to be there.  (Of course, you end up crossing a 
line fairly quickly where you shouldn't be pretending to use a filesystem 
at all and should just use a block disk).

But I first want to see an abstract interface where an application can 
recognize cleared regions of file without actually knowing anything about 
how the filesystem represents them or what the filesystem does with them. 
In particular, there's no reason to give up the character stream notion of 
a file and start talking about blocks just to have visible cleared regions 
(holes).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Max mounted filesystems ?

2005-03-02 Thread Bryan Henderson
>I cannot seem to increase the maximum number of
>filesystems on my Red hat system...

What is your evidence of the maximum that you can't increase?  (E.g. does 
something fail?  How?)

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Max mounted filesystems ?

2005-03-03 Thread Bryan Henderson
>>>118 total. When I attempt to mount the 57th one, I
>>>get "Too many mounted Filesystems"
>> 
>> 
>> Sorry I don't know what the limitations are for non-anonymous 
filesystems.
>> 57 seems a bit unusual though.
>
>Is that the exact error message?
>Can you post the kernel message log with that message in it?

I don't think he said it was a kernel message.  And since the kernel has 
not traditionally issued a message when failing a mount for any reason, I 
suspect it's a message from the program doing the mount.  But I don't know 
how a mount program would recognize such a condition.  util-linux 'mount' 
has a message that includes "too many mounted filesystems" in a list of of 
possible reasons a mount may have failed.

By the way, it's always a good idea to use the -t (type) option on 
util-linux 'mount'.  You get better diagnostic information and less 
arbitrary behavior that way.  Without -t, 'mount' tries types until one 
works and can't tell the difference between a bad guess at the type and a 
legitimate mount failure.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Bryan Henderson
>Hmm, it's a bit confusing that we call both things "reservation".

I think "reservation" is wrong for one of them and anyone using it that 
way should stop.  I believe the common terminology is:

- choosing the blocks is "placement."

- committing the required number of blocks from the resource pool for the 
instant use is "reservation."

- the combination of reservation and placement is "allocation."

Obviously, traditional filesystem drivers haven't split placement from 
reservation, so don't bother to use those terms.

Most delaying schemes delay the placement but not the reservation because 
they don't want to accept the possibility that a write would fail for lack 
of space after the write() system call succeeded.

Even in non-filesystem areas, "allocate" usually means to assign 
particular resources, while "reserve" just means to make arrangements so 
that a future allocate will succeed.  For example, if you know you need up 
to 10 blocks of memory to complete a task without deadlocking, but you 
don't know yet how exactly how many, you would reserve 10 blocks and 
later, if necessary, allocate the actual blocks.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-15 Thread Bryan Henderson
>Sounds reasonable. The thing with "reservation" is that people use
>it in daily life with all kinds of meanings,

That's the way it is all over.  Normal people are very sloppy in their 
language.  Engineers have to try to narrow the meanings of the common 
words to avoid totally confusing each other in these complex discussions.

But I think "reserve" in common usage is a lot less ambiguous than you 
say.  I believe when you reserve a seat on an airplane, most of the time 
it isn't a particular seat.  When it is, the airline will call it a "seat 
assignment" and you get it only after you turn your reservation into a 
purchased ticket.

I've never worked in a restaurant, but I've always assumed that when I 
make a reservation, even the restaurant doesn't know which table it is 
until I show up.  That way, it can load balance and give people choices 
when they come in.

>E.g. if we "reserve" the next hundred blocks, so that allocation is
>contiguous, we may want to be able to take them away if some other
>file needs them.

I would not call that a reservation.  I did, incidentally, design such a 
system once, and I called it "pencilled in."  I might also call it 
preliminary placement.

But I agree that reservations can be more or less firm, owing to the fact 
that sometimes they can be broken, with more or less ease.  E.g. you might 
reserve a megabyte of space for a file, and under pathological conditions 
still be told when you go to write that there's no space for you and 
you're screwed.  Just like you can get to the restaurant and be told 
there's no table for you.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-16 Thread Bryan Henderson
>But anyway it's interesting why the resulting sparse 
>files have different size on different fs?

That looks like a bug.  Assuming you didn't see any seeks or writes fail, 
the file size on all filesystems should be 2^56 + 4.  I suspect this is 
beyond the maximum file size allowed by the filesystem in some cases, so 
the write isn't happening, which means you should get a failure return 
code.

In the results you showed, the filesize ends up being a little less than 
2^48, which is not a place that you wrote ever.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Bryan Henderson
>I found
>that for larger values, your test program is returning -1, but unsigned
>it appears as 18446744073709551615.

You mean you ran it?  Then what about the more interesting question of 
what your filesize ends up to be?  You say JFS allows files up to 2**52 
bytes, so I expect the test case would succeed up through the write at 
2**48 and leave the filesize 2**48 + 8.  But Max reports seeing 2**48 - 
4080.

It's conceivable that the reporting of the filesize is wrong, by the way.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: files of size larger than fs size

2005-03-17 Thread Bryan Henderson
>The problem appears to be mixing calls to lseek64 with calls to fread
>and fwrite.

Oh, of course.  I didn't see that.  You can't use the file descriptor of a 
file that is opened as a stream.  This test case uses the fileno() 
function to mess with the internals of the stream.

fseeko64() is the proper function to position a stream.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmap question

2005-03-21 Thread Bryan Henderson
>I want all my file system's operations to be complete uncached and
>synchronous, but I also want to support mmap.
>...
>What am I doing wrong?  Is what I'm trying to do impossible, and if
>so, how can I get as close as possible?

It looks to me like you're running into the fundamental limitation that 
the CPU doesn't notify Linux every time you store into a memory location. 
It does, though, set the dirty flag in the page table, and Linux 
eventually inspects that flag and finds out that you have stored in the 
past.  At that time, it can call set_page_dirty.

Without knowing what properties of not having a cache you were hoping for, 
I couldn't say what alternative would be closest to this.

Hypothetically, if you had a backing storage device that could do memory 
mapped I/O, you could have mmapped direct I/O.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmap question

2005-03-21 Thread Bryan Henderson
>well, we *could* know ... never map this page writable.  have a per-vma
>flag that says "emulate writes", and call the filesystem to update
>backing storage before returning to the application.

Ah yes, you mean, I take it, that the page fault handler would look at the 
user's program and emulate the faulting store instruction and return to 
the instruction after it.  Very clever.

And as long as we're going down that path, we should also consider 
changing exec() so instead of branching into the program, it just 
interprets it!

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmap question

2005-03-21 Thread Bryan Henderson
I forgot you were talking about code inside the kernel.  In that case, 
filemap_sync().

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmap question

2005-03-21 Thread Bryan Henderson
>Is there an existing interface to force it to check if the page is
>dirty

The msync() system call and libc function does that.  And then it does the 
same thing as fsync().

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Address space operations questions

2005-03-31 Thread Bryan Henderson
>So, semantics of ->sync_page() are roughly "kick underlying storage
>driver to actually perform all IO queued for this page, and, maybe, for
>other pages on this device too".

I prefer to think of it in a more modular sense.  To preserve modularity, 
the caller of sync_page() can't know anything about I/O scheduling.  So I 
think the semantics of ->sync_page() are "Someone is about to wait for the 
results of the previously requested write_page on this page."  It's 
completely up to the owner of the address space to figure out what would 
be appropriate to do given that information.

I agree that for the conventional filesystem and device types for which 
this interface was designed, the appropriate response would be to start 
any queued I/O.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Address space operations questions

2005-03-31 Thread Bryan Henderson
>what it
>*really* means to be called in sync_page() is that you're being told
>that some process is about to block on that page.  For what reason, you
>can't know from the call alone.

Ugh.  IOW it barely means anything.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Address space operations questions

2005-03-31 Thread Bryan Henderson
>It reflects the fact that the page lock can be held for a variety of
>reasons, some of which require you to kick the filesystem and some which
>don't.

So then what I don't understand is why you would make a call that tells 
you someone is trying to hold the page lock?  Why not a call that tells 
you something meaningful like, "someone is trying to read this page"?  Or 
"someone is waiting for this page to get clean?"

>I introduced the sync_page() call in 2.4.x partly in order to get rid of
>all those pathetic hard-coded calls to "run_task_queue(&tq_disk)"

That was pathetic all right, and sync_page() would be a clear improvement 
if it just replaced those modularity-busting I/O scheduling calls.  But 
did it?  Were there run_task_queue's every time the kernel waited for page 
status to change?  I thought they were in more eclectic places.

>the NFS client itself had to defer actually
>putting reads on the wire until someone requested the lock

But really, you mean the client had to defer putting reads on the wire 
until someone was ready to use the data.  That suggests a call to 
->sync_page in file read or page fault code rather than deep in page 
management.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Access content of file via inodes

2005-04-05 Thread Bryan Henderson
>How do I access/read the content of the files via using inodes
>or blocks that belong to the inode, at sys_link and vfs_link layer?

This is tricky because many interfaces that one would expect to use an 
inode as a file handle use a dentry instead.  To read the contents of a 
file via the VFS interface, you need a file pointer (struct file), and the 
file pointer identifies the file by dentry.  So you need to create a dummy 
dentry, which you can do with d_alloc_root(), and then create the file 
pointer with dentry_open(), then read the file with vfs_read().

That's for "via inodes."  I don't know what "via blocks" means.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Access content of file via inodes

2005-04-06 Thread Bryan Henderson
>What I meant by
>> via blocks is to gain knowledge of the physical blocks used by the 
inodes
>> and retrieve the content from it directly, by accessing b_data.
>
>The problem with that approach is that some filesystems may store part
>of the file outside of a complete block.

There's an even more basic problem with this approach:  The question is 
specifically about the filesystem-type-independent layer above the VFS 
interface.  At this layer, you don't even know that there is a block 
device involved.  And if you do, you don't know that the filesystem driver 
uses the buffer cache to access it.  And if you do know that it uses the 
buffer cache, you don't know that the file data you're looking for is 
presently in the buffer cache, or how to get it there if it isn't.

If you believe in the layering at all, the only interface you can consider 
at this layer for getting at file data is VFS ->read.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Address space operations - >bmap

2005-04-07 Thread Bryan Henderson
>We are about to start implementing a fs where data can move around the
>device and so a physical block address is not really useful. I have
>understood from other postings to this list that reiserfs and ntfs
>don't implement this method so I suppose we'll do the same. I'll just
>find some nice error to return.

It's appropriate only for the most classic of filesystems, really.  It was 
always a layering violation, but is handy for hackish things.

Interfaces that expose block addresses are in the same boat as all those 
fsstat fields -- block size, blocks used, blocks free, inodes used, inodes 
free.  They make sense for the original Unix File System, but get harder 
to give meaning with every new generation.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-15 Thread Bryan Henderson
>We already have compat_sys_mount that treats the mount data for smbfs and
>ncpfs specially, so you could you add an nsfv4 specific bit there?

Do we really want to pile filesystem-type-specific stuff into fs/compat.c? 
 It's bad enough that it's there for smbfs and ncpfs (and similar stuff 
for NFS server).  It's only going to get worse.

fs/compat.c is fine for interfaces implemented by fs/ code, but the 32/64 
bit translations for other interfaces ought to be done by the modules that 
know those interfaces.

A mount option structure that contains addresses should contain 
information as to whether it's in 32-bit-address format or 64-bit-address 
format.  The nfsv4 read_super method can use that to translate its own 
mount options.

Another option would be for Linux to pass that information (essentially, 
whether the mount() system call is being handled by sys_mount() or 
compat_sys_mount() as another argument to read_super.  This would allow 
better backward compatibility with user space binaries, if there are 
already 32 bit and 64bit binaries using indistinguishable mount option 
structures.

The same issue, by the way, applies to ioctls, some of which have an 
argument which is the address of a block of memory that contains other 
addresses.  fs/compat.c approaches these in a more 
filesystem-type-independent way than it does mount(), but still not 
independent enough.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-15 Thread Bryan Henderson
>Make a ->compat_read_super() just like we have a ->compat_ioctl()
>method for files, if you want to suggest a solution like what
>you describe.

Even better.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Bryan Henderson
>On Fri, Apr 15, 2005 at 01:22:59PM -0700, David S. Miller wrote:
>> 
>> Make a ->compat_read_super() just like we have a ->compat_ioctl()
>> method for files, if you want to suggest a solution like what
>> you describe.
>
>I don't think we should encourage filesystem writers to do such stupid
>things as ncfps/smbfs do.  In fact I'm totally unhappy thay nfs4 went
>down that road.

Which road is that?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Bryan Henderson
>mount() is not a documented syscall.  The binary formats for filesystems
>like NFS are only documented inside the kernels to which they apply.

What  _is_ a documented system call?  Linux is famous for not having 
documented interfaces (or, put another way, not distinguishing between an 
interface you can read in an official document and one you discover by 
reading kernel source code).  But of all interfaces in Linux, the system 
call interface is probably the most accepted as one a user of the kernel 
can rely on.

I don't think a filesystem driver designer should expect mount options to 
be private to one particular user space program.  Especially one that 
isn't even packaged with the driver.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lilo requirements (Was: Re: Address space operations questions)

2005-04-18 Thread Bryan Henderson
>- unit of disk space allocation for the kernel image file is
> block. That is, optimizations like UFS fragments or reiserfs tails are
> not applied, and
>
> - blocks that kernel image is stored into are real disk blocks (i.e.,
> there is a way to disable "delayed allocation"), and
>
> - kernel image file is not relocated, i.e., data are not moved into
> another blocks on the fly.

It also has to implement the ioctl that tells you what blocks a file is in 
(that kind of implies much of the above).  Except if the LILO installer 
makes special provisions as for Reiserfs, of course.

To be really exact, it's OK for the blocks to move, as long as it doesn't 
do so so subtly that the user doesn't know to rerun the LILO installer. 
E.g. you can move the blocks of the kernel file if someone overwrites it.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Bryan Henderson
>> Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data).
>> If you want it to be a blob, at least have a decency to use encoding
>> that would not depend on alignment rules and word size.  Hell, you
>> could use XDR - it's not that nfs would need something new to handle
>> it.  Or, better yet, use a normal string.
>
>Mount doesn't appear to permit a big enough blob though. It has a hard 
limit
>of PAGE_SIZE.

That seems to me to be orthogonal to Al's point.  You could make an 
architecture-independent format for that page that still contains 
addresses in user space of additional information.  Which would presumably 
also have an architecture-independent format.

But why is mount() special here?  It's ancient tradition for Linux system 
calls to take as parameters, and return as results, in-memory structures 
that are dependent on local word size and endianness.  Lots of them do.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Bryan Henderson
>(1) The kernel is returning EFAULT to the 32-bit userspace; this implies 
that
> userspace is handing over a bad address. It isn't, the kernel is
> malfunctioning as it stands.
>...
>Either the kernel should return ENOSYS for any 32-bit mount on a 64-bit 
kernel
>or it must support it fully.

So this point is just the error code?  If so, where do you get ENOSYS?  A 
more usual errno for where a particular filesystem type can't be mounted 
is ENODEV.  Choosing errnos is a pretty whimsical thing anyway, since 
there are so many more kinds of errors than the authors of the errno space 
contemplated, but EFAULT and ENOSYS are two that have a pretty solid 
definition.  ENOSYS is for when an entire system call type is missing.

I'm not sure we can complain about EFAULT, though, because you really are 
supplying an invalid address.  You're doing it because you're using the 
wrong mount option format, so what you think of as 4 bytes of flags 
followed by 4 bytes of address is really 8 bytes of address.

I do understand the more important issue of there being a kernel that 
understands both mount option formats; but since you enumerated the errno 
issue, I wanted to comment on that one independently.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Bryan Henderson
>My concern is that we are slowly but surely building up a bigger
>in-kernel library for parsing the binary structure than it would take to
>parse the naked mount option string.
>
>...
>If people really do need a fully documented NFS mount interface, then
>the only one that makes sense is a string interface. Looking back at the
>manpages, the string mount options are the only thing that have remained
>constant over the last 10 years.
>
>We're already up to version 6 of the binary interfaces for v2/v3, and if
>you count NFSv4 too, then that makes 7. 

I don't know the NFS mount option format, but I'm having a hard time 
imagining how a string-based format can take less code to parse and be 
more forward compatible than a binary one.  People don't even use the term 
"parse" for binary structures, because parsing typically means turning 
strings into binary structures.

Having 6 separate formats isn't the only way to have an evolving binary 
interface.  People do make extensible binary formats.

>There are only 2 reasons for doing
>that parsing in userland:
>
>  1) DNS lookups
>  2) Keeping the kernel parsing code small

I personally almost never worry about the number of bytes of code, but I 
worry a lot about its simplicity.  User space code is less costly to 
develop and less risky to make a mistake in.  I would add,
 
3) Keeping the kernel parsing code simple.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Bryan Henderson
>> routines will fail - since they assume that page->private represents
>> bufferheads. So we need a better way to do this.
>
>They are not generic then. Some file systems store things completely
>different from buffer head ring in page->private.

I've seen these instances (and worked around them because I maintain 
filesystem code that does in fact use private pages but not use the buffer 
cache to manage them).  I've always assumed they're just errors -- corners 
that were cut in the original project to abstract out the buffer cache. 
Anyone who has a problem with them should just fix them.

>I think that one reasonable way to add generic support for journalling
>is to split struct address_space into two objects: lower layer that
>represents "file" (say, struct vm_file), in which pages are linearly
>ordered, and on top of this vm_cache (representing transaction) that
>keeps track of pages from various vm_file's. vm_file is embedded into
>inode, and vm_cache has a pointer to (the analog of) struct
>address_space_operations.
>
>vm_cache's are created by file system back-end as necessary (can be
>embedded into inode for non-journalled file systems). VM scanner and
>balance_dirty_pages() call vm_cache operations to do write-out.

That looks entirely reasonable to me, but should be combined with 
divorcing address spaces from files.  An address space (or the "lower 
level" above) should be a simple virtual memory object, managed by the 
virtual memory manager.  It can be used for a file data cache, but also 
for anything else you want to participate in system memory management / 
page replacement.

We're already practically there.  Address spaces are tied to files only in 
these ways:

  1) The code is in the fs/ directory.  It needs to be be in mm/ .

  2) The "host" field is a struct inode *.  It needs to be void *.

  3) In a handful of places (and they keep moving), memory manager 
 code dereferences 'host' and looks in the inode.  I know these 
 are trivial connections, because I work around them by supplying
 a dummy inode (and sometimes a dummy superblock) with a few 
 fields filled in.

(Incidentally, _I_ am actually using address spaces for file caches; I 
just can't tie them to the files in the traditional way; the cache exists 
even when there are no inodes for the file).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-20 Thread Bryan Henderson
>In essense, I was
>thinking of splitting up the concepts of 1) accessing the filesystem on
>the HDD/device and 2) setting up a namespace for accessing the files
>into two separate concepts

I've been crusading for years to get people to understand that a classic 
Unix mount is composed of these two parts, and they don't have to be 
married together.  (1) is called creating a filesystem image and (2) is 
called mounting a filesystem image.

(2) isn't actually "setting up" a namespace.  There's one namespace. 
Mounting is adding the names in a filesystem to that namespace, and 
thereby making the named filesystem objects accessible.

The two pieces have been slowly divorcing over the years.  We now have a 
little-used ability to have a filesystem image exist without being mounted 
at all (you get that by forcibly unmounting a filesystem image that has 
open files.  The unmount happens right away, but the filesystem image 
continues to exist until the last file is closed).  We also have the bind 
mounts that add to the namespace without creating a new filesystem image. 
I would like someday to see the ability to create  a filesystem image 
without ever mounting it, and access a file in it without ever adding it 
to the master file namespace.

>bringing up 2) completely in the userspace.

That part's another issue.  The user-controls-his-namespace aspect of it 
has been commented on at length in this and another current thread.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Bryan Henderson
>> But that shouldn't be the only option - because it would be horrible
>> to use.  If I login on multiple terminals, I normally want to mount
>> filesystems in /home/jamie/mnt on one terminal, and use them on 
another.
>
>And when you log in on several terminals you usually want same $PATH.
>You don't do that by sharing VM between shell processes, do you?

I share Al's view, and would expand:  You'd _like_ to be able to add 
something to your namespace once and have it show up in multiple process' 
namespaces, but you wouldn't expect it, because Unix has been horrible to 
use in that way forever.  I am frequently frustrated when I decide to 
change my environment either by setting an environment variable or shell 
variable or alias, and I have to do it separately in every existing shell. 
 And forget about the background jobs.  But at least it's consistent.  And 
there are other times when I exploit the fact that I can set something 
differently in different shells of the same user.

We do have a few areas where a group of processes can share the same 
kernel state, but it's always based on common ancestry.  It would take a 
major new concept to have a different kind of group of processes for 
namespace purposes, and then we probably wouldn't want to base it on uid, 
because uid means other things already.  Why tie them together?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Bryan Henderson
>How about making namespace's as first class objects with some associated
>name or device in the device tree having owner/permissions etc.  any
>process which forks off a namespace shall create the
>device node for the namespace.  If some other process wants to use
>the same namespace, it can do so by attaching itself to the namespace
>dynamically? Offcourse children processes inherit the same namespace.

For the issues being discussed here, I don't think that's materially 
different from what we started with; it has the same issue concerning 
whether a user should be allowed to change his namespace and whether a 
process' namespace should change automatically when another process does 
something.

Here's one more proposal, kind of a compromise among various previous 
ones.

  - When you mount(), you say whether the names should be visible by 
default or not.  It takes system privilege to make them visible by 
default, but an ordinary user can mount a willing filesystem over a 
directory he's permitted to modify unconditionally, invisible by default

  - A process can explicitly request to see an invisible-by-default 
mounted filesystem.  Anyone can do this, but permissions on the root 
directory of the mount determine if he can actually see anything.

  - A process inherits the parent's namespace (i.e. sees the mounts the 
parent does).

This accomplishes:

  - not much of a philosophical break from where we are now.

  - users can mount their own stuff without system privilege.

  - no one, not even a fully permitted administrative process, sees user 
junk by default.

  - setuid programs see standard files where the system administrator put 
them.

  - setuid programs see user files where the user put them.

  - multiple processes, with or without the same uid, can see user-mounted 
files if they want.

  - a process can opt not to see user-mounted files, even if it has the 
same uid as processes that do.

I'm not saying how I would implement this; there's enough discussion over 
the desired result that I thought we should start there.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Bryan Henderson
>How would you request to make the mountpoint visible from _any_
>program.  It's not acceptable to expect every program to include a
>menu, command, etc. to be able to modify the visibility of
>mountpoints.

OK, I overlooked the problem of having to add commands to the shell and 
everything else.  While there's plenty of precedent for this style 
(current directory, ulimits, umask), I wouldn't like to extend it, even to 
adding a command to Bash.  But it could follow the 'nice' and 'renice' 
model.

>Would it not be better if you could specify the visibility policy when
>mounting?  Something simple like the user-group-other permission
>model would do nicely.  That would also have the advantage of being
>bound to the mountpoint, not the process.

I just don't think that gives you enough policy flexibility.  If processes 
can control visibility on a per-process basis independent from the mount 
action, they can use a much greater variety of policy, and do it in user 
space.

As for user-group-other, let me first point out that this whole namespace 
discussion started when a design based on actual file permission bits, but 
not a true implementation of Unix security (root didn't get carte blanche) 
was found unpalatable by some.  So as you say, it would be something 
_like_ the permission model, not a part of it.

We've been straining against the limitations of user/group/other for 
decades.  Sophisticated systems don't even use them for file permissions. 
So I hesitate to tie anything else to them.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Bryan Henderson
>That assumes that everyone has the same stuff in the same places.  I.e.
>that there is a universal tree with different subset hidden from 
different
>processes.  But that is obviously a wrong approach - e.g. it loses 
ability
>to bind different stuff on the same place in different namespaces.

Aren't you trying to boil another egg in my pot?  In Linux today, everyone 
(every process on the same Linux system, that is) has the same stuff in 
the same place.  I'm trying to propose an incremental improvement, and 
relaxing that restriction isn't part of it.

The only change would be that some processes wouldn't have some stuff in 
_any_ place.  (Either because they didn't ask to see a particular mount, 
or because they did and it covered up something else).

>IOW, notion that every directory has its "real" absolute pathname
>(and that's what your approach boils down to) won't match the reality
>anyway.

Not sure which reality you're talking about.  I don't think a directory 
has a real absolute pathname, because I think the person who mounts the 
filesystem that contains it chooses part of its absolute pathname for the 
lifetime of the mount.  But as between multiple processes on the same 
system at the same time, yeah, the directory has one name.

(statements above have to be modified for chroot, btw).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Bryan Henderson
>Well I am not aware of issues that can arise if a user is allowed to
>change to some namespace for which it has permission to switch.

I think I misunderstood your proposal.

>A user 'ram' creates a namespace 'n1' with a device node /dev/n1 having
>permission 700 owned by the user 'ram'. The user than tailors his
>namespace with a bunch of mount/umount/binds etc to meet his
>requirement.

How does that address the setuid problem -- that a setuid program is 
installed with the expectation that when it runs, certain names will 
identify certain files (e.g. /etc/shadow)?  But also that certain other 
names will identify a file of the invoker's choosing?

>Trying to understand your proposal to see how it could be used to solve
>the problem faced by the FUSE project.  Are you trying to use a single
>namespace with invisible mounts capability? 

Essentially.  It's a compromise.  A user can customize his namespace, but 
only within limits that preserve the integrity of the system.

Technically, we have to admit it's not one namespace today or with 
invisible mounts.  Because of the way mounts cover up mountpoints, it's 
technically possible for two processes to see different files as the same 
name, if one opened the directory before a mount and the other after. 
"Mounting over" is a curse.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-21 Thread Bryan Henderson
>It would still not work for ftp-server style programs,

True.  Users might want the mounts to show up to an ftp or not, and this 
handles only "not."

>If used in conjuction with CLONE_NEWNS it would have all the needed
>flexibility.

I don't see how.  What if my policy is that processes with a certain 
process name (command) see the mount?  What if my policy is that users in 
a certain filesystem ACL can see it?  That's the kind of flexibility you 
can't get if the policy is set up via the mount() system call.

>But the non-sophisticated case is by far the most abundant.  And for
>that the traditional UNIX permission modell is not only good enough,
>it is in fact _better_ than any sophisticated access control mechanism
>because of it's _simplicity_.

Absolutely.  And that's why I speak of flexibility.  Let the simple users 
have their simple U-G-0 and the more creative ones do something more 
complex.

I'm not opposed, by the way, to an implementation that just does U-G-O (or 
even just U) if it's done in a way amenable to future extension.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson
I wasn't aware anyone preferred defines to enums for declaring enumerated 
data types.  The practical advantages of enums are slight, but as far as I 
know, the practical advantages of defines are zero.  Isn't the only 
argument for defines, "that's what I'm used to."?

Two advantages of the enum declaration that haven't been mentioned yet, 
that help me significantly:

- if you have a typo in a define, it can be really hard to interpret the 
compiler error messages.  The same typo in an enum gets a pointed error 
message referring to the line that has the typo.

- Gcc warns you if a switch statement doesn't handle every case.  I often 
add an enumeration and Gcc lets me know where I forgot to consider it.

The macro language is one the most hated parts of the C language; it makes 
sense to try to avoid it as a general rule.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson
>If it's really enumerated data types, that's fine, but this example was 
>about bitfield masks.

Ah.  In that case, enum is a pretty tortured way to declare it, though it 
does have the practical advantages over define that have been mentioned 
because the syntax is more rigorous.

The proper way to do bitfield masks is usually C bit field declarations, 
but I understand that tradition works even more strongly against using 
those than against using enum to declare enumerated types.

>there is _nothing_ wrong with using defines for constants.

I disagree with that; I find practical and, more importantly, 
philosophical reasons not to use defines for constants.  I'm sure you've 
heard the arguments; I just didn't want to let that statement go 
uncontested.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson
>I don't see how the following is tortured: 
>
>enum {
>   PNODE_MEMBER_VFS  = 0x01,
>   PNODE_SLAVE_VFS   = 0x02
>}; 

Only because it's using a facility that's supposed to be for enumerated 
types for something that isn't.  If it were a true enumerated type, the 
codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas 
here they must fundamentally be integers whose pure binary cipher has 
exactly one 1 bit (because, as I understand it, these are used as bitmasks 
somewhere).

I can see that this paradigm has practical advantages over using macros 
(or a middle ground - integer constants), but only as a byproduct of what 
the construct is really for.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What happens to pages that failed to be written to disk?

2005-07-28 Thread Bryan Henderson
>On Thu, 28 Jul 2005, Andrew Morton wrote:
>> Martin Jambor <[EMAIL PROTECTED]> wrote:
>> >
>> > Do filesystems try to relocate the data from bad blocks of the
>> > device?
>
>Only Windows NTFS, not others AFAIK (most filesytems can mark them during
>mkfs, that's all).
>
>> Nope.  Disks will do that internally.  If a disk gets a write I/O error
>> it's generally dead.
>
>That's what I thought also for over a decade (that they are basically 
dead
>soon) so originally I disabled NTFS resizing support for such disks (the
>tool is quite widely used since it's the only free, open source NTFS
>resizer).
>
>However over the last three years users convinced me that it's quite ok
>having a few bad sectors

There's a common misunderstanding in this area.  First of all, Andrew and 
Szakacsits are talking about different things:  Szakacsits is saying that 
you don't have to throw away your whole disk because of one media error (a 
spot on the disk that won't hold data).  Andrew is saying that if you get 
an error when writing, the disk is dead, and the reasoning goes that if it 
were just a media error, the write wouldn't have failed -- the disk would 
have relocated the sector somewhere else and succeeded.

Szakacsits is right.  Andrew is too, but for a different reason.

A normal disk doesn't give you a write error when a media error prevents 
writing the data.  The disk doesn't know that the data it wrote did not 
get actually stored.  It's not going to wait for the disk to come around 
again and try to read it back to verify.  And even if it did, a lot of 
media errors cause the data to disappear after a short while, so that 
wouldn't help much.  So if a write fails, it isn't because of a media 
error; i.e. can't be fixed by relocation.  The write fails because the 
whole drive is broken.  The disk won't turn, a wire is broken, etc.

(The drive relocates a bad sector when you write to it after a previously 
failed read.  I.e. after data has already been lost).

As Andrew pointed out, write errors are becoming much more common these 
days because of network storage.  The write fails because the disk isn't 
plugged in, the network switch isn't properly configured, the storage 
server isn't up and running yet, and a bunch of other fairly common 
problems.

What makes this really interesting in relation to the question about what 
to do with these failed writes is not just that they're so common, but 
that they're all easily repairable.  If you had a few megabytes stuck in 
your cache because the storage server isn't up yet, it would be nice if 
the system could just write them out a few seconds later when the problem 
is resolved.  Or if they're stuck because the drive isn't properly plugged 
in, it would be nice if you could tell an operator to either plug it in or 
explicitly delete the file.  But the memory management issue is a major 
stumbling block.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Bryan Henderson
I don't know enough about shared subtrees to have an opinion on what 
should happen with those, but you fundamentally asked about a perceived 
weirdness in existing Linux code, and I do have an opinion on that (which 
is that there's no weirdness).

>On analysis it turns out the culprit is the current rule which says
>'expose the most-recent-mount and not the topmost mount'

I don't think the current rule is "expose the most-recent-mount."  I see 
it as "expose the topmost mount."

I think the issue is what does "mount F over directory D" mean?

Does it mean to mount F immediately over D, in spite of anything that 
might be stacked above D right now?  Or does it mean to throw F onto the 
stack which is currently sitting over D?  Your analysis assumes it's the 
former, whereas what Linux does is consistent with the latter.

Neither of them actually makes sense.  mount over "." simply doesn't make 
sense.  Mount is a namespace operation.  "mount over D" says, "when 
someone looks up name D, ignore what's really in the directory and instead 
give him this other filesystem object."  "Mount over /mnt/cdrom" doesn't 
mean mount over the directory /mnt/cdrom.  It means mount under the name 
"cdrom" in the directory /mnt.  So "mount over '.'" means any future 
lookup of "." in that directory should hyperjump to the other mount. 
That's clearly not what anyone wants, so mount ought to recognize the 
special nature of the "." directory entry and not allow mounts over it.

If you did that, and made mount into the namespace operation it's meant to 
be, there would be no such thing as inserting a mount into the stack, 
since you have no way to refer to the covered directory -- it's no longer 
in the namespace.

I have no idea if that clarifies the shared subtree dilemma, but you ask 
if there's any pressing need for the current behavior, and I would have to 
say no, because a) neither behavior has any business existing; and b) I 
have a hard time imagining anyone depending on it.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Bryan Henderson
>> Does it mean to mount F immediately over D, in spite of anything that 
>> might be stacked above D right now?  Or does it mean to throw F onto 
the 
>> stack which is currently sitting over D?  Your analysis assumes it's 
the 
>> former, whereas what Linux does is consistent with the latter.
>
>In fact those two are indistinguishable.  What linux does is an
>internal implementation detail.

Then you must have misunderstood what I meant to say, because I didn't 
touch on Linux implementation at all; I'm talking only about what a user 
sees (distinguishes).  I say a user perceives a stack of mounts over a 
directory entry D.  A lookup sees the mount which is on top of the stack. 
One could conceivably 1) add a mount to the middle of that stack -- above 
D but below everything else, such that it isn't visible until everything 
above it gets removed, or 2) add the mount to the top of the stack so it's 
visible now.

>The semantics are simple: if you
>mount over a directory, that mount will be visible (no matter what was
>previously visible) on lookup of that directory.

So in my terms, Linux adds to the top of the stack, not to the middle. 
Note that saying this is stronger than what you say above, because it 
tells you not only that the mount is visible now, but when it will be 
visible in the future as people do mounts and unmounts.

>Well, mounting over '.' may not be perfect in the mathematical sense
>of namespace operations, but it does make some practical sense.  I bet
>you anything that some script/tool/person out there depends on it.

It wouldn't surprise me if someone is depending on mount over ".".  But 
I'd be surprised if someone is doing it to a directory that's already been 
mounted over (such that the stacking behavior is relevant).  That seems 
really eccentric.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Bryan Henderson
>Bryan, what would you expect the behavior to be when somebody mounts on
>a directory what is already mounted over? 

Well, I've tried to beg the question.  I said I don't think it's 
meaningful to mount over a directory; that one actually mounts at a name. 
And that Linux's peculiar "mount over '.'" (which is in fact mounting over 
a directory and not at a name) is weird enough that there is no natural 
expectation of it except that it should fail.

But if I had to try to merge "mount over '.'" into as consistent a model 
as possible with one of the two behaviors we've been discussing, I'd say 
that "." stands for the name by which you looked up that directory in the 
first place (so in this case, it's equivalent to mount ... /mnt).  And 
that means I would expect the new mount to obscure the already existing 
mount.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount behavior question.

2005-07-28 Thread Bryan Henderson
>One problem with 1) [mounting into the middle of a mount stack]
>is that it breaks the assumption that an 'mount X;
>umount X' pair is a no-op.

A very good point.  Since unmounts are always from the top of the stack, 
for symmetry mounts should be there too.

Here's another tidbit of information I just verified:  umount of "." 
unmounts from the top of the stack, as opposed to unmounting the stuff you 
would see if you did "ls .".  So this is all consistent.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson
>Intents are meant as optimisations, not replacements for existing
>operations. I'm therefore not really comfortable about having them
>return errors at all.

That's true of normal intents, but not what are called intents here.  A 
normal intent merely expresses an intent, and it can be totally ignored 
without harm to correctness.  But these "intents" were designed to be 
responded to by actually performing the foreshadowed operation now - 
irreversibly.

Linux needs an atomic lookup/open/create in order to participate in a 
shared filesystem and provide a POSIX interface (where shared filesystem 
means a filesystem that is simultaneously accessed by something besides 
the Linux system in question).  Some operating systems do this simply with 
a VFS lookup/open/create function.  Linux does it with this intents 
interface.

It's hard to merge the concepts in code or in one's mind, which is why 
we're here now.  A filesystem driver that needs to do atomic 
lookup/open/create has to bend over backwards to split the operation 
across the three filesystem driver calls that Linux wants to make.

I've always preferred just to have a new inode operation for 
lookup/open/create (mirroring the POSIX open operation, used for all opens 
if available), but if enough arguments to lookup can do it, that's 
practically as good.  But that means returning final status from lookup, 
and not under any circumstance proceeding to create or open when the 
filesystem driver has said the entire operation is complete.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson
>Have you looked at how we're dealing with this in NFSv4?

No.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GFS, what's remaining

2005-09-02 Thread Bryan Henderson
I have to correct an error in perspective, or at least in the wording of 
it, in the following, because it affects how people see the big picture in 
trying to decide how the filesystem types in question fit into the world:

>Shared storage can be more efficient than network file
>systems like NFS because the storage access is often more efficient
>than network access

The shared storage access _is_ network access.  In most cases, it's a 
fibre channel/FCP network.  Nowadays, it's more and more common for it to 
be a TCP/IP network just like the one folks use for NFS (but carrying 
ISCSI instead of NFS).  It's also been done with a handful of other 
TCP/IP-based block storage protocols.

The reason the storage access is expected to be more efficient than the 
NFS access is because the block access network protocols are supposed to 
be more efficient than the file access network protocols.

In reality, I'm not sure there really is such a difference in efficiency 
between the protocols.  The demonstrated differences in efficiency, or at 
least in speed, are due to other things that are different between a given 
new shared block implementation and a given old shared file 
implementation.

But there's another advantage to shared block over shared file that hasn't 
been mentioned yet:  some people find it easier to manage a pool of blocks 
than a pool of filesystems.

>it is more reliable because it doesn't have a
>single point of failure in form of the NFS server.

This advantage isn't because it's shared (block) storage, but because it's 
a distributed filesystem.  There are shared storage filesystems (e.g. IBM 
SANFS, ADIC StorNext) that have a centralized metadata or locking server 
that makes them unreliable (or unscalable) in the same ways as an NFS 
server.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-15 Thread Bryan Henderson
>The idea behind the cloneset is that most of the blocks (or files)
>do not change in either source or target.  This being the case its only 
necessary
>to update the changed elements.  This means updates are incremental. Once
>the system has figured out what it needs to update its usable and if you 
access
>an element that should be updated you will see the correctly updated 
version - even 
>though backgound resyncing is still in progress.

I still can't tell what you're describing.  With RAID1 as well, only 
changed elements ever get updated.  I have two identical filesystems, 
members of a RAIF set.  I change one file.  One file in each member 
filesystem gets updated, and I again have two identical filesystems.

How would a cloneset work differently, and how would it be better?

>This type of logic is great for backups.

Can you give an example of using it for backup?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-18 Thread Bryan Henderson
>Although, it is not possible with the current code, it should be possible
>to do via failing the branches.  First, you fail the branch intended for
>backups and it becomes a backup copy.  Later you can "unfail" the same
>branch and fail the newer branch to start the on-line recovery.  If you
>enable atime updates on these lower file systems incremental (delta)
>updates should not be a problem.

So I guess you're saying that what you have now doesn't have the ability 
to recover from a temporary absence of a member by updating just the areas 
that changed while it was absent.  Given how complex the path to one of 
these member filesystems might be, and how big a filesystem can be, I 
would think that's pretty important for making RAIF practical.

Actually getting to the cloneset-like thing is a step further, though, 
because it doesn't have the instantaneous resync property -- if you fail a 
branch while it's being resynced, you can't then access that branch and 
expect to get current data.

But I didn't actually understand,  "Later you can 'unfail' the same branch 
and fail the newer branch to start the on-line recovery," so maybe you're 
talking about something different.  I would think that if you fail the 
only branch that has current data on it (the "newer branch"?) that 
recovery would be pretty much over.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-18 Thread Bryan Henderson
>A cloneset is only syncronized at the point in time that you tell it to 
resync.
>The source and target fs are useable independently.  When you resync the
>target is reset to be indentical to the source at the point in time of 
the sync.
>Its also immediatly useable - the sync and access to the source and 
target 
>are coordinated so users of the target see the correct data, even if the 
sync 
>is still running in background.
>
>This allows things likes:
>
>...

These applications sure seem like a better fit for ordinary snapshots.  It 
looks like with the cloneset, there's a whole superfluous copy of the 
filesystem, whereas with a snapshot, you have to have storage space and 
I/O time only for data that changes after the snapshot.

I'm sure I could dream up an application for this -- maybe you want that 
second copy as a backup or it gives you additional data transfer capacity. 
 I just don't see the panacea so far.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stacked filesystem cache waste

2006-12-19 Thread Bryan Henderson
>Every stackable file system caches the
>data at its own level and copies it from/to the lower file system's 
cached
>pages when necessary.  ... this effectively
>reduces the system's cache memory size by two or more times.

It should not be that bad with a decent cache replacement policy; I wonder 
if observing the problem (that you corrected in the various ways you've 
described), you got some insight as to what exactly was happening.

In the classic case of multiple caches, where each cache has a fixed size 
(example: cache in the disk drive + cache in the operating system), the 
caches tend to contain different data.  The most frequently accessed data 
is in the near cache and the less frequently in the far cache (that's 
because frequent accesses to a piece of data are always near cache hits, 
so the far cache never sees them and considers that data once-only).

In the stacked filesystem case, it should be even better because it's all 
one pool of memory.  The far cache should shrink down to nothing, since 
anything that might have been a hit in that cache is a hit in the near 
cache first.

There are certainly simplistic cache replacement algorithms, and specific 
workloads that defeat that.  Straight LRU with lots of once-only accesses 
would tend to generate twice as much cache waste.  But the reduction in 
useful cache space would be less than half, because at least some of the 
pages are frequently accessed, so stored only once.

I lost track of the Linux cache replacement policy years ago, but it used 
to have a second-chance element that should measure frequency well enough 
to stop this cache duplication -- a page read from a file was on the 
inactive list until it got referenced again, so it could not stay in 
memory long when there was contention for memory.  I believe this would 
make the far cache pages always inactive, so essentially not consuming 
resource.

So I'd be interested to know by what mechanism stacked filesystems have 
drastically reduce cache efficiency in your experiments and whether a 
simple policy change might solve the problem as well as the more complex 
approach of getting an individual filesystem driver more involved in 
memory management.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-28 Thread Bryan Henderson
>Adding a vfs call to check for file equivalence seems like a good idea to 
me.

That would be only barely useful.  It would let 'diff' say, "those are 
both the same file," but wouldn't be useful for something trying to 
duplicate a filesystem (e.g. a backup program).  Such a program can't do 
the comparison between every possible pairing of file names.

I'd rather just see a unique file identifier that's as big as it needs to 
be.  And the more unique the better.  (There are lots of degrees of 
uniqueness; unique as long as the files exist; as long as the filesystems 
are mounted, etc.).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-28 Thread Bryan Henderson
>> Well, the NFS protocol allows that [see rfc1813, p. 21: "If two file 
handles from
>> the same server are equal, they must refer to the same file, but if 
they are not
>> equal, no conclusions can be drawn."]
>> 
>Interesting. That does seem to break the method of st_dev/st_ino for 
finding 
>hardlinks. For Linux fileservers I think we generally do have 1:1 
correspondence 
>so that's not generally an issue.
>
>If we're getting into changing specs, though, I think it would be better 
to 
>change it to enforce a 1:1 filehandle to inode correspondence rather than 
making 
>new NFS ops. That does mean you can't use the filehandle for carrying 
other 
>info, but it seems like there ought to be better mechanisms for that.

The filehandle is very much the appropriate mechanism for that.  A handle 
is opaque.  The client has no business doing anything with it besides 
sending it back to the server.

Though you seem to want to avoid adding new NFS operations, what you're 
proposing is changing the nature of existing ones so that new operations 
would have to be added to get back what the existing ones do!

If it's important to know that two names refer to the same file in a 
remote filesystem, I don't see any way around adding a new concept of file 
identifier to the protocol.

BTW, a primary characteristic of an "identifier" is that it can be used to 
tell whether you've got the object you're looking for, but often can't be 
used to _find_ that object.  For the latter, you need an address.  There 
lots of examples where you can't practically use the same value for both 
an identifier and an address.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-28 Thread Bryan Henderson
>Statement 1:
>If two files have identical st_dev and st_ino, they MUST be hardlinks of
>each other/the same file.
>
>Statement 2:
>If two "files" are a hardlink of each other, they MUST be detectable
>(for example by having the same st_dev/st_ino)
>
>I personally consider statement 1 a mandatory requirement in terms of
>quality of implementation if not Posix compliance.
>
>Statement 2 for me is "nice but optional"

Statement 1 without Statement 2 provides one of those facilities where the 
computer tells you something is "maybe" or "almost certainly" true.  While 
it's useful in plenty of practical cases, in my experience, it leaves 
computer engineers uncomfortable.  Recently, there was a discussion on 
this list of a proposed case in which stat() results are "maybe correct, 
but maybe garbage" that covered some of that philosophy.

>it's an optimization for a program like tar to not have to
>back a file up twice,

I think it's a stronger need than just to make a tarball smaller.  When 
you restore the tarball in which 'foo' and 'bar' are different files, you 
get a fundamentally different tree of files than the one you started with 
in which 'foo' and 'bar' were two different names for the same file.  If, 
in the restored tree, you write to 'foo', you won't see the result in 
'bar'.  If you remove read permission from 'foo', the world can still see 
the information in 'bar'.  Plus, in some cases optimization is a matter of 
life or death -- the extra resources (storage space, cache space, access 
time, etc) for the duplicated files might be enough to move you from 
practical to impractical.

People tend to demand that restore programs faithfully restore what was 
backed up.  (I've even seen requirements that the inode numbers upon 
restore be the same).  Given the difficulty of dealing with multi-linked 
files, not to mention various nonstandard file attributes fancy filesystem 
types have, I suppose they probably don't have really high expectations of 
that nowadays, but it's still a worthy goal not to turn one file into two.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-28 Thread Bryan Henderson
>The chance of an accidental
>collision is infinitesimally small.  For a set of 
>
> 100 files: 0.03%
>   1,000,000 files: 0.03%

Hey, if you're going to use a mathematical term, use it right.  :-) 
.03% isn't infinitesimal.  It's just insignificant.  And I think 
infinitesimally small must mean infinite.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-29 Thread Bryan Henderson
>On Thu, 2006-12-28 at 16:44 -0800, Bryan Henderson wrote:
>> >Statement 1:
>> >If two files have identical st_dev and st_ino, they MUST be hardlinks 
of
>> >each other/the same file.
>> >
>> >Statement 2:
>> >If two "files" are a hardlink of each other, they MUST be detectable
>> >(for example by having the same st_dev/st_ino)
>> >
>> >I personally consider statement 1 a mandatory requirement in terms of
>> >quality of implementation if not Posix compliance.
>> >
>> >Statement 2 for me is "nice but optional"
>> 
>> Statement 1 without Statement 2 provides one of those facilities where 
the 
>> computer tells you something is "maybe" or "almost certainly" true.
>
>No it's not a "almost certainly". It's a "these ARE".

There are various "these AREs" here, but the "almost certainly" I'm 
talking about is where Statement 1 is true and Statement 2 is false and 
the inode numbers you read through two links are different.  (For example, 
consider a filesystem in which the reported inode number is the internal 
inode number truncated to 32 bits).  The links are almost certainly to 
different files.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-29 Thread Bryan Henderson
>On Fri, 2006-12-29 at 10:08 -0800, Bryan Henderson wrote:
>> >On Thu, 2006-12-28 at 16:44 -0800, Bryan Henderson wrote:
>> >> >Statement 1:
>> >> >If two files have identical st_dev and st_ino, they MUST be 
hardlinks 
>> of
>> >> >each other/the same file.
>> >> >
>> >> >Statement 2:
>> >> >If two "files" are a hardlink of each other, they MUST be 
detectable
>> >> >(for example by having the same st_dev/st_ino)
>> >> >
>> >> >I personally consider statement 1 a mandatory requirement in terms 
of
>> >> >quality of implementation if not Posix compliance.
>> >> >
>> >> >Statement 2 for me is "nice but optional"
>> >> 
>> >> Statement 1 without Statement 2 provides one of those facilities 
where 
>> the 
>
>> There are various "these AREs" here, but the "almost certainly" I'm 
>> talking about is where Statement 1 is true and Statement 2 is false and 

>> the inode numbers you read through two links are different.  (For 
example, 
>> consider a filesystem in which the reported inode number is the 
internal 
>> inode number truncated to 32 bits).  The links are almost certainly to 
>> different files.
>> 
>
>but then statement 1 is false and violated.

Whoops; wrong example.  It doesn't matter, though, since clearly there 
exist correct examples: where Statement 1 is true and Statement 2 is 
false, and in that case when the inode numbers are different, the links 
are "almost certainly" to different files.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Bryan Henderson
>On any decent filesystem st_ino should uniquely identify an object and
>reliably provide hardlink information. The UNIX world has relied upon 
this
>for decades. A filesystem with st_ino collisions without being hardlinked
>(or the other way around) needs a fix.

But for at least the last of those decades, filesystems that could not do 
that were not uncommon.  They had to present 32 bit inode numbers and 
either allowed more than 4G files or just didn't have the means of 
assigning inode numbers with the proper uniqueness to files.  And the sky 
did not fall.  I don't have an explanation why, but it makes it look to me 
like there are worse things than not having total one-one correspondence 
between inode numbers and files.  Having a stat or mount fail because 
inodes are too big, having fewer than 4G files, and waiting for the 
filesystem to generate a suitable inode number might fall in that 
category.

I fully agree that much effort should be put into making inode numbers 
work the way POSIX demands, but I also know that that sometimes requires 
more than just writing some code.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nfsv4] RE: Finding hardlinks

2007-01-04 Thread Bryan Henderson
>>> "Clients MUST use filehandle comparisons only to improve
>>> performance, not for correct behavior. All clients need to
>>> be prepared for situations in which it cannot be determined
>>> whether two filehandles denote the same object and in such
>>> cases, avoid making invalid assumptions which might cause incorrect 
behavior."
>> Don't you consider data corruption due to cache inconsistency an 
incorrect behavior?
>
>Exactly where do you see us violating the close-to-open cache
>consistency guarantees?

Let me add the information that Trond is implying:  His answer is yes, he 
doesn't consider data corruption due to cache inconsistency to be 
incorrect behavior.  And the reason is that, contrary to what one would 
expect, NFS allows that (for reasons of implementation practicality).  It 
says when you open a file via an NFS client and read it via that open 
instance, you can legally see data as old as the moment you opened it. 
Ergo, you can't use NFS in cases where that would cause unacceptable data 
corruption.

We normally think of this happening when a different client updates the 
file, in which case there's no practical way for the reading client to 
know his cache is stale.  When the updater and reader use the same client, 
we can do better, but if I'm not mistaken, the NFS protocol does not 
require us to do so.  And probably more relevant: the user wouldn't expect 
cache consistency.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-09 Thread Bryan Henderson
>but you can get a large number of >1 linked
>files, when you copy full directories with "cp -rl".  Which I do a lot
>when developing. I've done that a few times with the Linux tree.

Can you shed some light on how you use this technique?  (I.e. what does it 
do for you?)

Many people are of the opinion that since the invention of symbolic links, 
multiple hard links to files have been more trouble than they're worth.  I 
purged the last of them from my personal system years ago.  This thread 
has been a good overview of the negative side of hardlinking; it would be 
good to see what the positives are.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-10 Thread Bryan Henderson
>I did cp -rl his-tree my-tree (which completed
>quickly), edited the two files that needed to be patched, then did
>diff -urp his-tree my-tree, which also completed quickly, as diff knows
>that if two files have the same inode, they don't need to be opened.

>... download one tree from kernel.org, do a bunch of cp -lr for
>each arch you plan to play with, and then go and work on each of the 
trees
>separately.

Cool.  It's like a poor man's directory overlay (same basic concept as 
union mount, Make's VPATH, and Subversion branching).  And I guess this 
explains why the diff optimization is so important.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbolic links vs hard links

2007-01-10 Thread Bryan Henderson
>Other people are of the opinion that the invention of the symbolic link
>was a huge mistake.

I guess I haven't heard that one.  What is the argument that we were 
better off without symbolic links?

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbolic links vs hard links

2007-01-11 Thread Bryan Henderson
>On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote:
>> >Other people are of the opinion that the invention of the symbolic 
link
>> >was a huge mistake.
>> 
>> I guess I haven't heard that one.  What is the argument that we were 
>> better off without symbolic links?
>
>I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good
>a presentation of that argument as any ...

Thanks.

For those who didn't read it, this refers to the problem of ".." being 
ambiguous when there are many paths to a directory.  I.e. it's about the 
ability of a symbolic link to link to a directory, not just a file (like a 
hard link).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >