Re: [PATCH 1/2] Make cramfs little endian only

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Jörn Engel wrote:
> 
> How about shifting and masking _before_ converting to host endianness?
> 
>   static inline u32 cramfs_offset(struct cramfs_inode *inode)
>   {
>   return le32_to_cpu(node->namelen_offset >> 
> CRAMFS_NAMELEN_WIDTH);
>   }

Stop this idiocy. The code I sent out was correct. You're just making it 
worse.

"le32_to_cpu()" is a no-op on little-endian machines. So your change won't 
make any difference there. But the point is, that if we want the disk 
layout to be the *same* for both big-endian and little-endian, we need to 
switch the word as it is loaded from memory, and not do *any* operations 
on it before we've done that equalization.

This is not something unusual. It's bog-standard procedure for a lot of 
filesystems. And sparse will help you (and would have complained about 
your code that tries to shift a little-endian entity before changing it 
into a CPU-endian one).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Make cramfs little endian only

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Andi Drebes wrote:
>
> This requires changing the on-disk-structure (even the current "little endian 
> only" one).

That makes no sense.

Your whole patch is about making it little-endian only.

If you do that, then big-endian is screwed. I'm ok with that, and support 
it.

But then you just make sure that the little-endian bits are the same, and 
now you're *done*.

> For little endian machines, the data arranged in the following way:
> 
> |o02.o01.n06.n05.n04.n03.n02.n01|o10.o09.o08.o07.o06.o05.o04.o03|
> |o18.o17.o16.o15.o14.o13.o12.o11|o26.o25.o24.o23.o22.o21.o20.o19| 

That's a singularly confused way or looking at it. The point is, the 
first byte in a little-endian machine is the lowest bits, so the *correct* 
way of looking at it is to see the above as one 32-bit word, and then it 
looks like this:

   bit-in-word:  31   .. 6  5 ..  0
   bit-in-bitfield  o25 ..  o0 n5 .. n0

and my code works *perfectly*. When you do

static inline u32 cramfs_offset(struct cramfs_inode *inode)
{
return le32_to_cpu(node->namelen_offset) >> 
CRAMFS_NAMELEN_WIDTH;
}

you are getting *exactly* the bits "o25..o0" that you want (ie the 
offset).

> So masking and then shifting *the whole* masked data doesn't solve the 
> problem.

Yes it does. Try it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Make cramfs little endian only

2007-12-05 Thread Linus Torvalds


On Wed, 5 Dec 2007, Andi Drebes wrote:
>
> +#ifdef __BIG_ENDIAN
> +
> +/* Converts a cramfs_inode's offset field
> +   from little endianto cpu endian */
> +static u32 cramfs_offset_to_cpu(struct cramfs_inode *inode)
> +{
> + u8 *inode_bytes = (u8 *)inode;
> + return ((inode_bytes[8] & 0xc0) >> 6) | (inode_bytes[9] << 2) |
> + (inode_bytes[10] << 10) | (inode_bytes[11] << 18);
> +}
...
> +#elif defined(__LITTLE_ENDIAN)
> +
> +/* Converts a cramfs_inode's offset field
> +   from little endian to cpu endian */
> +static u32 cramfs_offset_to_cpu(struct cramfs_inode *inode)
> +{
> + return inode->offset;
> +}

No, no, what I meant about not having any #ifdef __LITTLE_ENDIAN was to do 
the code do the same thing *regardless* of endianness. In other words, a 
simple:

struct cramfs_inode {
__le32 mode_uid;/* CRAMFS_MODE_WIDTH:CRAMFS_UID_WIDTH */
__le32 size_gid;/* CRAMFS_SIZE_WIDTH:CRAMFS_GID_WIDTH */
__le32 namelen_offset;  /* 
CRAMFS_NAMELEN_WIDTH:CRAMFS_OFFSET_WIDTH */
};

#define CRAMFS_UID_MASK ((1ul << CRAMFS_UID_WIDTH)-1)
#define CRAMFS_GID_MASK ((1ul << CRAMFS_GID_WIDTH)-1)
#define CRAMFS_NAMELEN_MASK ((1ul << CRAMFS_NAMELEN_WIDTH)-1)

static inline u32 cramfs_mode(struct cramfs_inode *inode)
{
return le32_to_cpu(node->mode_uid) >> CRAMFS_UID_WIDTH;
}

static inline u32 cramfs_uid(struct cramfs_inode *inode)
{
return le32_to_cpu(node->mode_uid) & CRAMFS_UID_MASK;
}

static inline u32 cramfs_size(struct cramfs_inode *inode)
{
return le32_to_cpu(node->size_gid) >> CRAMFS_GID_WIDTH;
}

static inline u32 cramfs_gid(struct cramfs_inode *inode)
{
return le32_to_cpu(node->size_gid) & CRAMFS_GID_MASK;
}

static inline u32 cramfs_offset(struct cramfs_inode *inode)
{
return le32_to_cpu(node->namelen_offset) >> 
CRAMFS_NAMELEN_WIDTH;
}

static inline u32 cramfs_namelen(struct cramfs_inode *inode)
{
return le32_to_cpu(node->namelen_offset) & CRAMFS_NAMELEN_MASK;
}

See? No #ifdef's required, no different code-paths, and the code generated 
will be pretty much optimal too.

(No, the above is not tested in any way, shape, or form, and no, I didn't 
double-check that I actually extracted the right bits, but you get the 
idea).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Make cramfs little endian only

2007-12-04 Thread Linus Torvalds


On Tue, 4 Dec 2007, Andi Drebes wrote:
>
> Perhaps I'm missing somehting, but I think for cramfs, unfortunately,
> there has to be this statement. The bitfields in the cramfs_inode structure
> cause some problems.

I agree that bitfields can be painful, but they should likely be just 
rewritten to be accesses using actual masks and shifts. The thing is, 
bitfields aren't actually endianness safe *anyway*, in that a compiler may 
end up using a *different* bit order than the byte order.  So you cannot 
really use bitfields reliably on things like that (although Linux has a 
notion of a "__[BIG|LITTLE]_ENDIAN_BITFIELD", if you really want to).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Make cramfs little endian only

2007-12-04 Thread Linus Torvalds


On Tue, 4 Dec 2007, Andi Drebes wrote:
>
> Sure. This saves some definitions (and lines of code)...
> Here's the new patch (tested on the same machines mentioned in the first 
> message).
> I tried to move as many lines as possible out of the endian dependent section.

This really is the totally wrong way to do this.

You should *not* convert inodes to CPU-endian mode at all. You should 
*keep* them in the native mode, and then just use "le32_to_cpu()" when 
actually using them.

Basically, if you ever have a

#ifdef __BIG_ENDIAN

or similar in the source code, you're simply doing something wrong. 

Btw, sparse can be a big help for things like this, by just marking the 
actual disk data structures as being the right type (ie "__le32" and 
friends), and then you can verify that any users will use "le32_to_cpu()" 
as required, because sparse will warn about bad endianness.

So don't copy things around and change them from one mode to the other. 
Keep all the data structures in native LE ordering, and only when you 
really need to use them for some real data do you need to do the 
conversion.

Now, for example, your code has two different "struct cramfs_inode" that 
you use: the unconverted and the converted one. That's confusing, but it's 
also error-prone. If you just have one type - the unconverted 
little-endian one - you have none of those issues.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] cramfs: Add mount option "swapendian"

2007-11-16 Thread Linus Torvalds


On Fri, 16 Nov 2007, Andi Drebes wrote:
> 
> I'll write a new patch for inclusion in the mainline kernel that makes 
> cramfs "little endian only". For people who really want to be able to 
> mount filesystems with both kinds of endianness there will be a seperate 
> patch (not intended to be merged into mainline) available somewhere on 
> the net (my website or so). It will definitely be marked as non-official 
> in order to prevent people from creating images in big endian. 
> Officially, cramfs should only support little endian images.

Good. We do actually have precedence for exactly this kind of situation 
before: it's what happened to ext2 as well (people were trying to push a 
switch-endian version, and I said no) and the m68k people had their own 
patches for a while but we're *so* much better off from being fixed-endian 
that it's not even funny.

> If squashfs will be merged, cramfs should be marked as obsolete.

Sounds like that, yes.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] cramfs: Add mount option "swapendian"

2007-11-15 Thread Linus Torvalds


On Thu, 15 Nov 2007, Andi Drebes wrote:
>
> Actually, in my eyes, it would be better to create a new version of cramfs
> that standardizes the endianness and the block size.

Perhaps even more importantly, you can do a much better job.

The thing is, when I did cramfs originally, I had a total "senior moment", 
and the reason I didn't compress the metadata was that it appeared hard to 
do and fill it in later (ie compressing the metadata would mean that the 
location of the data changes - and since the metadata obviously contains 
pointers to the data, there's a chicken-and-egg problem there).

But it should be *trivial* to compress the metadata too if the code just 
were to put the metadata at the beginning of the image, and the real data 
at the end, and then you can build up the image from both ends and you 
*can* have a fixed starting point for the data (up at the top of the 
image) even though you are changing the size of the metadata by 
compression.

But I literally designed and wrote the thing in a couple of days, and I 
really didn't think it through right. As a result, the metadata may be 
dense, but it's totally uncompressed. It would have been better to allow a 
less dense basic format (allowing bigger uid/gid values, and offsets and 
file sizes), but compress it.

So a "v2" cramfs would be a great idea. Not just for fixing endianness 
etc.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] cramfs: Add mount option "swapendian"

2007-11-15 Thread Linus Torvalds


On Thu, 15 Nov 2007, Christoph Hellwig wrote:
> 
> Actually there are as lots of initrd are cramfs.  This means you'd need
> to update mkcramfs all big endian machines out there.

So? Normally the initrd goes with the kernel anyway.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] cramfs: Add mount option "swapendian"

2007-11-15 Thread Linus Torvalds


On Thu, 15 Nov 2007, Linus Torvalds wrote:
> 
> It would be *much* better to just standardize on one endianness, and be 
> done with it. That way there are no config options, no confusion, and the 
> code is smaller, simpler, and faster.

.. it's also statically checkable with tools like "sparse", so it avoids 
bugs not only by being simpler, but by simply being fundamentally more 
robust to start with.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] cramfs: Add mount option "swapendian"

2007-11-15 Thread Linus Torvalds


On Thu, 15 Nov 2007, Andi Drebes wrote:
>
> This patch introduces the mount option "swapendian" for cramfs.
> When this option is set, cramfs is able to mount an image that
> was created on a machine whose endianness differs from the mounting
> machine's one.

Please don't do it this way.

It would be *much* better to just standardize on one endianness, and be 
done with it. That way there are no config options, no confusion, and the 
code is smaller, simpler, and faster. Because nn unconditional byte swap 
is generally faster than a conditional non-byte-swap!

So can you please just make it little-endian? 

There can't be that many big-endian machines that really care about old 
cramfs images..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-01 Thread Linus Torvalds


On Mon, 1 Oct 2007, Christoph Hellwig wrote:
> 
> befs_lookup, which the above gem is from, returns a dentry *.

Ahh, ok. Then it actually makes sense. Although I'd prefer it if people 
planned on writing code like that more along the lines of

error = PTR_ERR(inode);
if (IS_ERR(inode)
return ERR_PTR(error);

because let's face it, the compiler can turn this into nice code, and 
humans can read it much better even if it's got an "unnecessary" extra 
variable etc..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-01 Thread Linus Torvalds


On Mon, 1 Oct 2007, Zach Brown wrote:
>
> If you're soliciting opinions, I think I tend to prefer the feel of the
> code paths after the changes.  I don't know the benefits of the change
> are worth the risk in unmaintained file systems, though.
> 
> > +   return ERR_PTR(PTR_ERR(inode));
> 
> This caught my eye.  Surely we can do better :).  It seems to happen a
> few times in the patches, the instance in this patch was the first that
> I noticed.

Yeah. The above obviously *should* be the same as

return inode;

apart from a few casts. But if the casts matter, there's something else 
wrong.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Rene Herman wrote:
> 
> I do feel larger blocksizes continue to make sense in general though. Packet
> writing on CD/DVD is a problem already today since the hardware needs 32K or
> 64K blocks and I'd expect to see more of these and similiar situations when
> flash gets (even) more popular which it sort of inevitably is going to be.

.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Rene Herman wrote:
> 
> Well, not so sure about that. What if one of your expected uses for example is
> video data storage -- lots of data, especially for multiple streams, and needs
> still relatively fast machinery. Why would you care for the overhead af
> _small_ blocks?

.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a "support legacy 
customers", I think the whole patch-series has been a total waste of time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Wed, 19 Sep 2007, Nathan Scott wrote:
> 
> FWIW (and I hate to let reality get in the way of a good conspiracy) -
> all SGI systems have always defaulted to using 4K blocksize filesystems;

Yes. And I've been told that:

> there's very few customers who would use larger

.. who apparently would like to  move to x86-64. That was what people 
implied at the kernel summit.

>especially as the Linux
> kernel limitations in this area are well known.  There's no "16K mess"
> that SGI is trying to clean up here (and SGI have offered both IA64 and
> x86_64 systems for some time now, so not sure how you came up with that
> whacko theory).

Well, if that is the case, then I vote that we drop the whole patch-series 
entirely. It clearly has no reason for existing at all.

There is *no* valid reason for 16kB blocksizes unless you have legacy 
issues. The performance issues have nothing to do with the block-size, and 
should be solvable by just making sure that your stupid "state of the art" 
crap SCSI controller gets contiguous physical memory, which is best done 
in the read-ahead code.

So get your stories straight, people.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Tue, 18 Sep 2007, Andrea Arcangeli wrote:
> 
> Many? I can't recall anything besides PF_MEMALLOC and the decision
> that the VM is oom.

*All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone 
watermarks, everything that we depend on every single day, is in the end 
just about statistically workable.

We do 1- and 2-order allocations all the time, and we "know" they work. 
Yet Nick (and this whole *idiotic* thread) has all been about how they 
cannot work.

> In general every time reliability has a low priority than performance
> I've an hard time to enjoy it.

This is not about performance. Never has been. It's about SGI wanting a 
way out of their current 16kB mess.

The way to fix performance is to move to x86-64, and use 4kB pages and be 
happy. However, the SGI people want a 16kB (and possibly bigger) 
crap-option for their people who are (often _already_) running some 
special case situation that nobody else cares about.

It's not about "performance". If it was, they would never have used ia64 
in the first place.  It's about special-case users that do odd things.

Nobody sane would *ever* argue for 16kB+ blocksizes in general. 

Linus

PS. Yes, I realize that there's a lot of insane people out there. However, 
we generally don't do kernel design decisions based on them. But we can 
pat the insane users on the head and say "we won't guarantee it works, but 
if you eat your prozac, and don't bother us, go do your stupid things".
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Linus Torvalds


On Tue, 18 Sep 2007, Nick Piggin wrote:
> 
> ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
> as the solution to the fragmentation problem? It would only have been funnier
> if you had said to reboot every so often when memory gets fragmented :)

Can we please stop this *idiotic* thread.

Nick, you and some others seem to be arguing based on a totally flawed 
base, namely:
 - we can guarantee anything at all in the VM
 - we even care about the 16kB blocksize
 - second-class citizenry is "bad"

The fact is, *none* of those things are true. The VM doesn't guarantee 
anything, and is already very much about statistics in many places. You 
seem to be arguing as if Christoph was introducing something new and 
unacceptable, when it's largely just more of the same.

And the fact is, nobody but SGI customers would ever want the 16kB 
blocksize. IOW - NONE OF THIS MATTERS!

Can you guys stop this inane thread already, or at least take it private 
between you guys, instead of forcing everybody else to listen in on your 
flamefest.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Linus Torvalds


On Sun, 16 Sep 2007, Jörn Engel wrote:
> 
> My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
> which are pinned for their entire lifetime and another for regular
> files/inodes.  One could take a three-way approach and have
> always-pinned, often-pinned and rarely-pinned.
> 
> We won't get never-pinned that way.

That sounds pretty good. The problem, of course, is that most of the time, 
the actual dentry allocation itself is done before you really know which 
case the dentry will be in, and the natural place for actually giving the 
dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" 
it with d_add() or d_instantiate().

But it turns out that most of the filesystems we care about already use a 
special case of "d_add()" that *already* replaces the dentry with another 
one in some cases: "d_splice_alias()".

So I bet that if we just taught "d_splice_alias()" to look at the inode, 
and based on the inode just re-allocate the dentry to some other slab 
cache, we'd already handle a lot of the cases!

And yes, you'd end up with the reallocation overhead quite often, but at 
least it would now happen only when filling in a dentry, not in the 
(*much* more critical) cached lookup path.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Linus Torvalds


On Sun, 16 Sep 2007, Jörn Engel wrote:
> 
> I have been toying with the idea of having seperate caches for pinned
> and movable dentries.  Downside of such a patch would be the number of
> memcpy() operations when moving dentries from one cache to the other.

Totally inappropriate.

I bet 99% of all "dentry_lookup()" calls involve turning the last dentry 
from having a count of zero ("movable") to having a count of 1 ("pinned").

So such an approach would fundamentally be broken. It would slow down all 
normal dentry lookups, since the *common* case for leaf dentries is that 
they have a zero count.

So it's much better to do it on a "directory/file" basis, on the 
assumption that files are *mostly* movable (or just freeable). The fact 
that they aren't always (ie while kept open etc), is likely statistically 
not all that important.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] block_device_operations prototype changes

2007-08-27 Thread Linus Torvalds


On Mon, 27 Aug 2007, Al Viro wrote:
> 
> I have the beginning of that series done and the rest mapped out in enough
> details to implement it over this week.  If somebody has objections,
> questions or comments - yell.

>From your description, I have no objections - everything sounds good. My 
only concern is how painful the patch ends up being (and a worry about 
whether this will affect a metric truck-load of external modules? That 
said, I can't really see us worrying about those)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a security parameter to VFS functions

2007-08-16 Thread Linus Torvalds


On Wed, 15 Aug 2007, David Howells wrote:
> 
> Would you object greatly to functions like vfs_mkdir() gaining a security
> parameter?  What I'm thinking of is this:
> 
>   int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode,
> struct security *security)

I personally consider this an affront to everythign that is decent.

Why the *hell* would mkdir() be so magical as to need something like that?

Make it something sane, like a "struct nameidata" instead, and make it at 
least try to look like the path creation that is done by "open()".  Or 
create a "struct file *" or something.

I can imagine having "mkdir()" being passed similar data as "open()" (ie 
"lookup()"), but I cannot _possibly_ imagine it ever being valid to pass 
in something totally made-up to just mkdir(), and nothing else. There's 
something fundamentally wrong there.

What makes mkdir() so magical?

Also, what about all the other ops? Why is mkdir() special, but not 
"mknod()"? Why is "mkdir()" special, but not "rmdir()"? Really, none of 
this seems to make any sense unless you describe what is so magical about 
mkdir().

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-03-29 Thread Linus Torvalds


On Thu, 29 Mar 2007, Jan Engelhardt wrote:
>
> I have to disagree, since wrapping it into a struct and copying the struct
> in kernelspace from userspace requires more code.

Not just more code, but more security issues too.

Passing system call arguments by value means that there are no subtle 
security issues - the value you use is the value you got. But once you 
pass-by-reference, you have to make damn sure that you do the proper user 
space accesses and verify the pointer correctly.

User-space (aka "user-supplied") pointers are just more dangerous. We 
obviously can't avoid them, but they need much more care than just a 
random value directly passed in a register.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix d_path for lazy unmounts

2007-02-14 Thread Linus Torvalds


On Wed, 14 Feb 2007, Andreas Gruenbacher wrote:
> 
> Mountpoints are reported relative to the chroot if they are reachable from 
> the 
> chroot, and relative to the namespace they are defined in otherwise. This is 
> big nonsense, but it's unclear to me how to best fix it:

Well, it's also what a traditional "pwd" implementation would do, so it's 
not "nonsense" in that sense.

>   - don't report unreachable mount points,
>   - somehow indicate which mountpoints are reachable and which are not,
> like by prepending a question flag?

We could prepend another '/' (so that you'd have a path that starts with 
"//"). That's still a legal path, but it's also somethign that even POSIX 
says is valid to mean something else (eg "//ftp/.." or "//socket/.." to 
escape into another namespace).

But the fact is, some things just want a path. And it's generally *better* 
to get them a 

 - path that looks ok and starts from '/' than it is to give them 
   something that looks strange and doesn't start from root (because the 
   latter gives many many more possible attack vectors: if somebody 
   actually looks up the path, a bad user can much more easily fake a 
   relative path than fake an absolute one).

 - the path we've historically always given them.

> What's the point in reporting the rootfs at all -- it's never reachable to an 
> ordinary process?

All the paths are generally useful for USER INFORMATION. That's the 
primary use of paths for anything but "getcwd()". And the primary use for 
"getcwd()" is to do the same thing that any traditional cwd implementation 
has done, except faster (and _possibly_ better, but compatibility is more 
important than extensions - so the "better" is mainly an issue about 
non-readable or non-executable path component that we can show, and 
about being able to tell _how_ you got to a point that has multiple ways 
of getting there).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pipefs unique inode numbers

2007-01-30 Thread Linus Torvalds


On Tue, 30 Jan 2007, Jeff Layton wrote:
> 
> Also, that patch would break many 32-bit programs not compiled with large
> offsets when run in compatibility mode on a 64-bit kernel. If they were to
> do a stat on this inode, it would likely generate an EOVERFLOW error since
> the pointer address would probably not fit in a 32 bit field.
> 
> That problem was the whole impetus for this set of patches.

Well, we have that problem with the slowly incrementing "last_ino" too.

Should we make "last_ino" be "static unsigned int" instead of "long"?

Does anybody actually even use the old stat() with 32-bit interfaces? We 
warn for it, and we've done so for a long time.. I dont' remember people 
even complaining about the warning, so..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pipefs unique inode numbers

2007-01-30 Thread Linus Torvalds


On Tue, 30 Jan 2007, Bodo Eggert wrote:
>
> change pipefs to use a unique inode number equal to the memory
> address unless it would be truncated.

I *really* wouldn't want to expose kernel addresses to user space, it just 
ends up being a piece of data that they shouldn't have. If we have some 
security issue, this is just too much kernel information that a bad user 
could get at.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Sat, 20 Aug 2005, Al Viro wrote:
> 
> That looks OK except for
>   * ncpfs fix is actually missing here

Well, the thing is, with the change to page_follow_link() and 
page_put_link(), ncpfs should now work fine - it doesn't need any fixing 
any more.

It was makign an assumption that used to be incorrect, but that assumption
became ok with the calling convention change - now the page_follow_link()
and page_put_link() functions work fine even if the page cache might get 
invalidated in between the calls.

So with your patch, things should be fine (and external filesystems will 
generate warnings on compiles, but should work fine, since the change 
maintains the ABI even if the API changed - which is pretty unusual).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
> 
> Yes, sure.  I have applied your patch to our 2.6.11.4 tree (with the one 
> liner change I emailed you just now) and have kicked off a compile.

Actually, hold on. The original patch had another problem: it returned an
uninitialized "page" pointer when page_getlink() failed.

This one should have that fixed, and has converted a few other 
filesystems. Most of them trivially, but I took the opportunity to just 
simplify NFS while I was at it, since it now has no reason to need to save 
off the "struct page *" any more.

It's still not tested, but at least I've looked at it a bit more ;)

Linus

---
diff --git a/fs/autofs/symlink.c b/fs/autofs/symlink.c
--- a/fs/autofs/symlink.c
+++ b/fs/autofs/symlink.c
@@ -12,11 +12,12 @@
 
 #include "autofs_i.h"
 
-static int autofs_follow_link(struct dentry *dentry, struct nameidata *nd)
+/* Nothing to release.. */
+static void *autofs_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
char *s=((struct autofs_symlink *)dentry->d_inode->u.generic_ip)->data;
nd_set_link(nd, s);
-   return 0;
+   return NULL;
 }
 
 struct inode_operations autofs_symlink_inode_operations = {
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -83,8 +83,8 @@ extern int cifs_dir_notify(struct file *
 extern struct dentry_operations cifs_dentry_ops;
 
 /* Functions related to symlinks */
-extern int cifs_follow_link(struct dentry *direntry, struct nameidata *nd);
-extern void cifs_put_link(struct dentry *direntry, struct nameidata *nd);
+extern void *cifs_follow_link(struct dentry *direntry, struct nameidata *nd);
+extern void cifs_put_link(struct dentry *direntry, struct nameidata *nd, void 
*);
 extern int cifs_readlink(struct dentry *direntry, char __user *buffer, 
 int buflen);
 extern int cifs_symlink(struct inode *inode, struct dentry *direntry,
diff --git a/fs/cifs/link.c b/fs/cifs/link.c
--- a/fs/cifs/link.c
+++ b/fs/cifs/link.c
@@ -92,7 +92,7 @@ cifs_hl_exit:
return rc;
 }
 
-int
+void *
 cifs_follow_link(struct dentry *direntry, struct nameidata *nd)
 {
struct inode *inode = direntry->d_inode;
@@ -148,7 +148,7 @@ out:
 out_no_free:
FreeXid(xid);
nd_set_link(nd, target_path);
-   return 0;
+   return NULL;/* No cookie */
 }
 
 int
@@ -330,7 +330,7 @@ cifs_readlink(struct dentry *direntry, c
return rc;
 }
 
-void cifs_put_link(struct dentry *direntry, struct nameidata *nd)
+void cifs_put_link(struct dentry *direntry, struct nameidata *nd, void *cookie)
 {
char *p = nd_get_link(nd);
if (!IS_ERR(p))
diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c
--- a/fs/ext2/symlink.c
+++ b/fs/ext2/symlink.c
@@ -21,11 +21,11 @@
 #include "xattr.h"
 #include 
 
-static int ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct ext2_inode_info *ei = EXT2_I(dentry->d_inode);
nd_set_link(nd, (char *)ei->i_data);
-   return 0;
+   return NULL;
 }
 
 struct inode_operations ext2_symlink_inode_operations = {
diff --git a/fs/ext3/symlink.c b/fs/ext3/symlink.c
--- a/fs/ext3/symlink.c
+++ b/fs/ext3/symlink.c
@@ -23,11 +23,11 @@
 #include 
 #include "xattr.h"
 
-static int ext3_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void * ext3_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct ext3_inode_info *ei = EXT3_I(dentry->d_inode);
nd_set_link(nd, (char*)ei->i_data);
-   return 0;
+   return NULL;
 }
 
 struct inode_operations ext3_symlink_inode_operations = {
diff --git a/fs/namei.c b/fs/namei.c
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -501,6 +501,7 @@ struct path {
 static inline int __do_follow_link(struct path *path, struct nameidata *nd)
 {
int error;
+   void *cookie;
struct dentry *dentry = path->dentry;
 
touch_atime(path->mnt, dentry);
@@ -508,13 +509,15 @@ static inline int __do_follow_link(struc
 
if (path->mnt == nd->mnt)
mntget(path->mnt);
-   error = dentry->d_inode->i_op->follow_link(dentry, nd);
-   if (!error) {
+   cookie = dentry->d_inode->i_op->follow_link(dentry, nd);
+   error = PTR_ERR(cookie);
+   if (!IS_ERR(cookie)) {
char *s = nd_get_link(nd);
+   error = 0;
if (s)
error = __vfs_follow_link(nd, s);
if (dentry->d_inode->i_op->put_link)
-   dentry->d_inode->i_op->put_link(dentry, nd);
+   dentry->d_inode->i_op->put_link(dentry, nd, cookie);
}
dput(dentry);
mntput(path->mnt);
@@ -2344,15 +2347,17 @@ out:
 int generic_readlink(struct dentry *dentry, char __user *buffer, int buflen)
 {
struct nameidata nd;
-   int res;
+   void *cookie;
+
nd.dept

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
> 
> You are throwing away the error return from vfs_readlink().  I suspect you 
> wanted:
> 
> + cookie = ERR_PTR(res);

Yes, thanks.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
> 
> It does disable link caching.  But I didn't make this up.  This is exactly 
> what smbfs uses.  I just copied smbfs given ncpfs copies almost everything 
> smbfs does anyway...

Can you test whether the untested test-patch I sent out seems to work for 
your case that bugs out? You'll probably get a few new "initialization 
from incompatible pointer type" warnings, but I do believe that they 
should be harmless at least on normal architectures.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Al Viro wrote:
> 
> FWIW, I'd rather take page_symlink(), page_symlink_inode_operations,
> page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(),
> generic_readlink() and vfs_readlink() to the same place where these guys
> would live.  They all belong together and none of them has any business
> in fs/namei.c.  Options: fs/libfs.c or separate library since fs/libfs.c
> is getting crowded.  Linus, do you have any objections to that or suggestions
> on filename here?

I'm not sure if this merits a new file or new organization (hey,
fs/lib/xxx might be good in theory). In particular, I had actually been
hoping to release 2.6.13 today, but this seems like a valid thing to hold 
things up for - but not if we're going to re-organize things.

The one thing that strikes me is that we might actually have less pain if
we just changed the calling convention for follow_link/put_link slightly
instead of creating a new library function. The existing "page cache"  
thing really _does_ work very well, and would work fine for NFS and ncpfs
too, if we just allowed an extra cookie to be passed along from
"follow_link()" to "put_link()".

A patch like this (totally untested, and you'd need to update any
filesystems that don't use the regular page_follow_link interface) would
seem to clean up the mess nicely.. The basic change is that follow_link() 
returns a error-pointer cookie instead of just zero or error, and that is 
passed into put_link().

That simple calling convention change solves all problems. It so _happens_ 
that any old binary code also continues to work (the cookie will be zero, 
and put_link will ignore it), so it shouldn't even break any unconverted 
stuff (unless they mix using their own functions _and_ using the helpher 
functions, which is of course possible).

The "shouldn't break nonconverted filesystems" makes me think this is a 
safe change. Comments?

NOTE NOTE NOTE! Let me say again that it's untested. It might not break 
nonconverted filesystems, but it equally well migth break even the 
converted ones ;)

Linus


diff --git a/fs/namei.c b/fs/namei.c
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -501,6 +501,7 @@ struct path {
 static inline int __do_follow_link(struct path *path, struct nameidata *nd)
 {
int error;
+   void *cookie;
struct dentry *dentry = path->dentry;
 
touch_atime(path->mnt, dentry);
@@ -508,13 +509,15 @@ static inline int __do_follow_link(struc
 
if (path->mnt == nd->mnt)
mntget(path->mnt);
-   error = dentry->d_inode->i_op->follow_link(dentry, nd);
-   if (!error) {
+   cookie = dentry->d_inode->i_op->follow_link(dentry, nd);
+   error = PTR_ERR(cookie);
+   if (!IS_ERR(cookie)) {
char *s = nd_get_link(nd);
+   error = 0;
if (s)
error = __vfs_follow_link(nd, s);
if (dentry->d_inode->i_op->put_link)
-   dentry->d_inode->i_op->put_link(dentry, nd);
+   dentry->d_inode->i_op->put_link(dentry, nd, cookie);
}
dput(dentry);
mntput(path->mnt);
@@ -2345,14 +2348,17 @@ int generic_readlink(struct dentry *dent
 {
struct nameidata nd;
int res;
+   void *cookie;
+
nd.depth = 0;
-   res = dentry->d_inode->i_op->follow_link(dentry, &nd);
-   if (!res) {
+   cookie = dentry->d_inode->i_op->follow_link(dentry, &nd);
+   if (!IS_ERR(cookie)) {
res = vfs_readlink(dentry, buffer, buflen, nd_get_link(&nd));
if (dentry->d_inode->i_op->put_link)
-   dentry->d_inode->i_op->put_link(dentry, &nd);
+   dentry->d_inode->i_op->put_link(dentry, &nd, cookie);
+   cookie = ERR_PTR(0);
}
-   return res;
+   return PTR_ERR(cookie);
 }
 
 int vfs_follow_link(struct nameidata *nd, const char *link)
@@ -2395,23 +2401,19 @@ int page_readlink(struct dentry *dentry,
return res;
 }
 
-int page_follow_link_light(struct dentry *dentry, struct nameidata *nd)
+void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd)
 {
struct page *page;
nd_set_link(nd, page_getlink(dentry, &page));
-   return 0;
+   return page;
 }
 
-void page_put_link(struct dentry *dentry, struct nameidata *nd)
+void page_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie)
 {
if (!IS_ERR(nd_get_link(nd))) {
-   struct page *page;
-   page = find_get_page(dentry->d_inode->i_mapping, 0);
-   if (!page)
-   BUG();
+   struct page *page = cookie;
kunmap(page);
page_cache_release(page);
-   page_cache_release(page);
}
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -993,8 +993,8 @@ struct inod

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Linus Torvalds wrote:
> 
>  - document this as a fundamental fact, and apply the ncpfs patch. Local 
>filesystems can still continue to use the generic helper functions 
>(all other users _are_ local filesystems).

Actually, looking at the ncpfs patch, I'd rather not apply that patch 
as-is. It looks like it will totally disable symlink caching, which would 
be kind of sad. Somebody willing to do the same thing NFS does?

NFS hides away the "struct page *" pointer at the end of the page data, so
that when we pass the pointer to the virtual address around, we can 
trivially look up the "struct page".

An alternative is to make the symlink address-space use a gfp_mask of 
GFP_KERNEL (no highmem), at which point ncpfs_put_link() just becomes 
something like

void ncpfs_put_link(struct dentry *dentry, struct nameidata *nd)
{
char *addr = nd_get_link(nd);

if (!IS_ERR(addr))
page_cache_release(virt_to_page(addr));
}

which is pretty ugly, but even simpler than the NFS trick.

Anybody?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Linus Torvalds wrote:
> 
> The generic "page cache for symlinks" code does _not_ support invalidating 
> the cache while it's being used. A local filesystem will obviously never 
> invalidate the cache at all. 
> 
> Hmm.. NFS _does_ use the page cache for symlinks [..]

Looking more and more at this, I'm convinced this is it.

Basically, page_follow_link_light() and page_put_link() depend on the fact
that the page in the page cache is the same one the whole time:  
page_follow_link_light() will increment the page count of the page it
finds at offset 0, and page_put_link() will decrement it. If the page has 
changed, they increment/decrement different pages.

There's two ways to fix this:

 - document this as a fundamental fact, and apply the ncpfs patch. Local 
   filesystems can still continue to use the generic helper functions 
   (all other users _are_ local filesystems).

 - make "nameidata" contain not just the virtual addresses of the names, 
   but also have a "struct page *pages[MAX_NESTED_LINKS + 1]", and save 
   away the page there. That will fix ncpfs, and we could then make NFS 
   also use the generic routines.

I suspect that #1 is the prudent one. We have a patch already, and we
don't want to grow nameidata. I'll commit a comment at the head of
page_follow_link_light() too.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Linus Torvalds


On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
>
>   struct page *page;
>   page = find_get_page(dentry->d_inode->i_mapping, 0);
>   if (!page)
> > BUG();

Something has truncated the mapping. 

My guess is that you had a cache invalidate event that removed the page
from the mapping while it was being used. That might also explain why this
only happens for ncpfs. I bet that in the other cases, the mapping was 
also invalidated, but re-filled immediately, and your strace slowed the 
other process down enough that it didn't get to re-fill the cache it 
invalidated or something like that.

The fact that it happens only for cross-volume things might also be 
explained that way - is there something ncpfs does when switching volumes 
that might trigger a cache invalidate for another volume (either on the 
client side or the server side - maybe the server tends to try to break 
the connection for the old volume when you start traversing a new one?)

The generic "page cache for symlinks" code does _not_ support invalidating 
the cache while it's being used. A local filesystem will obviously never 
invalidate the cache at all. 

Hmm.. NFS _does_ use the page cache for symlinks, but uses it slightly 
differently: instead of relying on the page cache entry being the same 
when freeing the page, it just caches the page it looked up in the page 
cache (ie "nfs_follow_link()" does look up the page cache, but then hides 
the page pointer inside the page data itself (uglee), and thus does not 
depend on the mapping staying the same (nfs_put_link() just takes the page 
from the symlink data).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2001, Alexander Viro wrote:
>
> We get inode initilization (generic parts) spread all over the place and
> sooner or later it's going to bite us, for one thing.

I don't really think so. The "struct inode" has become less and less
important as far as the VFS layer is concerned, and most of the VFS layer
functions handling it are really all just helper functions anyway.

Instead of calling "get_empty_inode()", you'd just call
"initialize_inode()" instead.

The current exceptions to this is the "iget()" family of interfaces, but
those are on the killing block anyway, due to other problems.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-27 Thread Linus Torvalds


On Wed, 27 Jun 2001, Ben LaHaise wrote:
> On Wed, 27 Jun 2001, Linus Torvalds wrote:
>
> > If we're going to do this for a major filesystem, then I'd really just
> > rather see this being done generically during 2.5.x, making "u" be a
> > pointer only, and having the generic "iput()" just always free the dang
> > thing.
>
> Are you certain that adding yet another level of pointer indirection here
> is a good idea?  The whole cache miss and indirection issue is exactly why
> almost all systems in the world make use of a VNODE style interface.

Oh, I'm not. But if we do this for NFS, then we'd better think hard about
whether we should (a) do it at all or (b) do it for everything.

The alternative, of course, is to just have different allocation-pools for
different filesystems. There's nothing that says that we should be able to
re-use an inode allocation done for one filesystem for another filesystem.
This would be my favourite approach, in fact. Instead of having

struct inode {
generic fields
union {
specific-ext2 struct
specific-nfs struct
...
}
}

we could _easily_ have the setup

struct ext2_inode {
struct inode inode; /* Generic fields */
specific-ext2 struct;   /* specific fields */
};

and then when ext2 calls down to the generic VFS layer it just passes

&ext2_inode->inode

down, and when it gets a "struct inode *" it uses "inode_to_ext2()" to
convert it to an ext2 inode pointer.

This is what the "struct list_head" thing does, and it works remarkably
well. It allows for embedding a list (or a hundred) into any object. The
above would take the same approach, and allow embedding an inode (and
maybe several) into any object.

Advantages: no extra memory use, no indirection, no memory allocation
overhead.

Disadvantages: ??

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-27 Thread Linus Torvalds


If we're going to do this for a major filesystem, then I'd really just
rather see this being done generically during 2.5.x, making "u" be a
pointer only, and having the generic "iput()" just always free the dang
thing.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Bad interaction between sync and truncate

2001-06-12 Thread Linus Torvalds


On Tue, 12 Jun 2001, Alexander Viro wrote:
> 
> We need a bit more - set_blocksize() is another source of the same problem,
> AFAICS. How about the following?

Makes sense.

The "set_blocksize()" thing looks like it could have bit us before - it
didn't check "page->mapping", so it could make mapped pages do bad bad
things already. 

Although setting the blocksize on something that is already mounted is
obviously bad in itself, so maybe the old problems just weren't a real
issue ;)

Applied.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Bad interaction between sync and truncate

2001-06-12 Thread Linus Torvalds


On Tue, 12 Jun 2001, Alexander Viro wrote:
> 
> OK, let me describe the race again:

And let me descibe the BUG again.

> * block_flushpage() makes highmem page anonymous.
> * at some point, after the buffers are not busy, but before page_launder()
> sees the page again we get invalidate_buffers() (normally - from umount()).

And this is buggy. We already _have_ code to check that invalidate_buffers
shouldn't touch page-cache buffers. That one-liner was just buggy.

-   if (bh->b_page->mapping)
+   if (!bh->b_pprev)

Problem solved the right way, logically, so that invalidate_buffers() does
the same thing regardless of whether the mapping has been removed or not,
instead of having pseudo-random behaviour.

No?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Bad interaction between sync and truncate

2001-06-12 Thread Linus Torvalds


On Tue, 12 Jun 2001, Alexander Viro wrote:
> 
> invalidate_buffers() will happily take such buffer_head from LRU lists
> and put it on free list. With ->b_data still pointing to the highmem
> page.

Hey, that's easy to fix.

We have a real bug, and the real bug is that we have them on the dirty
list, and we already have this

/* Part of a mapping? */
if (bh->b_page->mapping)
continue;

special case. It's just that the test is wrong. It should probably be
something like

/* Not hashed? */
if (list_empty(bh->b_hash)
continue;

instead (yeah, yeah, I know the buffer heads don't use the list
structures, I wish it did. The real way to check for the bh being hashed
to check "bh->b_pprev" for NULL, but I wish it wasn't).

Don't blame the VM layer, I think page_launder() and friends do the right
thing. It's the act of invalidation that just isn't supposed to look at
non-hashed buffers.

(That makes a lot more sense than the current test anyway: if the buffer
isn't hashed, then it isn't reachable by "dev", so it clearly also
shouldn't be invalidated by "dev").

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Bad interaction between sync and truncate

2001-06-12 Thread Linus Torvalds


On Tue, 12 Jun 2001, Alexander Viro wrote:
> 
> On Tue, 12 Jun 2001, Linus Torvalds wrote:
> 
> > release the page cache part of the page count, but we'll leave the buffers
> > untouched because somebody is still using them. It's really nothing but a
> > "the page is still busy, we'll delay the free" thing.
> 
> Erm... Sorry, but what is going to free them? It will not happen until
> all buffer_heads are on the free list. And AFAICS it won't happen until
> we call set_blocksize() or invalidate_buffers().

What?

They'll get free'd the next time through the page_launder() code, when
there is no longer anybody trying to push them out or otherwise keeping
their counts elevated.

"page->buffers". That's the important part.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Bad interaction between sync and truncate

2001-06-12 Thread Linus Torvalds


On Tue, 12 Jun 2001, Alexander Viro wrote:
> 
> On Tue, 12 Jun 2001, Steve Lord wrote:
> 
> > We now return to truncate_complete_page() and remove the page from the inode,
> > the only reference to the page is now the buffer head.
> 
> The real problem is in block_flushpage() - we blindly convert page into
> buffer cache one, polluting the latter with highmem pages.

Tell me the problem again..

Yes, we will convert a mapped page into a "buffer page", but we will _not_
add any of the buffers to any LRU lists or anything like that, so the
buffers won't be reachable from anything that they weren't reachable from
before already.

So either the buffers had been reachable before - in which case the
anonymization adds no new cases, or they had not been reachable before -
in which case the anonymization will basically "lose" the buffers until
the page can later be reclaimed.

> Do we really need that logics at all? Reuse of page cache pages for
> buffer cache, that is...

We _do_ need the logic. And don't think of it as a "reuse of page cache
pages" issue. Think of it as a pure memory management issue - we will
release the page cache part of the page count, but we'll leave the buffers
untouched because somebody is still using them. It's really nothing but a
"the page is still busy, we'll delay the free" thing.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Linus Torvalds


On Tue, 22 May 2001, Andreas Dilger wrote:
> 
> Actually, the LVM snapshot interface has (optional) hooks into the filesystem
> to ensure that it is consistent at the time the snapshot is created.

Note that this is still fundamentally a broken interface: the filesystem
may not _have_ a block device underneath it, yet you might very well like
to do defragmentation and backup none-the-less.

Also, lvm snapshots are fundamentally limited to read-only data, which
means that the LVM interfaces cannot be used for defragmentation and lazy
fsck etc anyway. You _have_ to do those at a filesystem level.

disk snapshots are useful, but they are not the answer.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Linus Torvalds


On Tue, 22 May 2001, Jan Harkes wrote:
> 
> something like,
> 
> ssize_t kioctl(int fd, int type, int cmd, void *inbuf, size_t inlen,
>  void *outbuf, size_t outlen);
> 
> As far as functionality and errors it works like read/write in a single
> call, pretty much what Richard proposed earlier with a new 'transaction'
> syscall. Maybe type is not needed, and cmd can be part of the inbuf in
> which case it would be identical.

I'd rather have type and cmd there, simply because right now the
"cmd" passed in to the ioctl is not well-defined, as several different
drivers use the same numbers for different things (which is why I want to
expand that to  to get uniqueness).

Also, I think the cmd is separate from the data, so I don't think it
necessarily makes sense to mix the two. Even if we want to have an ASCII
command, I'd think that should be separate from the arguments, ie we'd
have 

ssize_t kioctl(int fd, const char *cmd, const void *inbuf ...

instead of trying to mix them. This is especially true as the
"inbuf" would be a user-mode pointer, while "cmd" would come from kernel
space (whether in the form of a  number pair or as a kernel
string).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Linus Torvalds



On Tue, 22 May 2001, Matthew Wilcox wrote:
>
> > command + rw-transaction: "dear device please mangle this data"
> >(crypto processors come to mind...)
>
> I can't think of a reasonable tool-based approach to this, but I can
> definitely see that a program could use this well.  It simply requires
> that you use the filp to store your state.

Nope. You can (and people do, quite often) share filps. So you can't
associate state with it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Linus Torvalds



On Mon, 21 May 2001, Alan Cox wrote:
>
> > Sure. But we have to do two syscalls only if ioctl has both in- and out-
> > arguments that way. Moreover, we are talking about non-trivial in- arguments.
> > How many of these are in hotspots?
>
> There is also a second question. How do you ensure the read is for the right
> data when you are sharing a file handle with another thread..
>
> ioctl() has the nice property that an in/out ioctl is implicitly synchronized

I don't think we can generically replace ioctl's with read-write, and we
shouldn't bend over backwards even _trying_.

The important thing would be to give them more structure, and as far as
I'm personally concerned I don't even care if the user-level access method
ends up being the same old thing. After all, we have magic numbers
everywhere: even a system call uses magic numbers for the syscall entry
numbering. The thing that makes system call numbers nice is that the
number gets turned into a more structured thing with proper type checking
and well-defined semantics very very early on indeed.

It shouldn't be impossible to do the same thing to ioctl numbers. Nastier,
yes. No question about it. But we don't necessarily have to redesign the
whole approach - we only want to re-design the internal kernel interfaces.

That, in turn, might be as simple as changing the ioctl incoming arguments
of  into a structure like .

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Mon, 21 May 2001, Ingo Molnar wrote:
> 
> On Sun, 20 May 2001, Alexander Viro wrote:
> 
> > Linus, as much as I'd like to agree with you, you are hopeless
> > optimist. 90% of drivers contain code written by stupid gits.
> 
> 90% of drivers contain code written by people who do driver development in
> their spare time, with limited resources, most of the time serving as a
> learning excercise. And they do this freely and for fun. Accusing them of
> being 'stupid gits' is just micharacterising the situation.

I would disagree with both of you.

The problem is not whether people do it with limited resources or time, or
whether they are stupid or not.

The problem is that if you expect to get nice code, you have to have nice
interfaces and infratructure. And ioctl's aren't it.

The reason we _can_ write beautiful filesystems these days is that the VFS
layer _supports_ it. In fact, the VFS layer has tons of infrastructure and
structure that makes it _hard_ to write bad filesystem code (which is not
to say that we don't have ugly code there - but much of it is due to
historically not having had quite the same level of infrastructure).

If we had nice infrastructure to make ioctl's more palatable, we could
probably make do even with the current binary-number interfaces, simply
because people would use the infrastructure without ever even _seeing_ how
lacking the user-level accesses are.

But that absolutely _requires_ that the driver writers should never see
the silly "pass a random number and a random argument type" kind of
interface with no structure or infrastructure in place.

Because right now even _good_ programmers make a mess of the fact that
they get passed a bad interface.

Think of it this way: the user interface to opening a file is
"open()" with pathnames and magic flags. But a filesystem never even
_sees_ that interface, it sees a very nicely structured setup where all
the argument parsing and locking has already been done for it, and the
magic flags don't even exist any more as far as the low-level FS is
concerned. Which is why filesystems _can_ be clean.

In contrast, ioctl's are passed through directly, with no help to make
them clean. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


Davem, check the last thing, please.

On Sun, 20 May 2001, Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Linus Torvalds wrote:
> 
> > > How about moratorium on new ioctls in the meanwhile? Whatever we do in
> > > fs/ioctl.c, it _will_ take time.
> > 
> > Ehh.. Telling people "don't do that" simply doesn't work. Not if they can
> > do it easily anyway. Things really don't get fixed unless people have a
> > certain pain-level to induce it to get fixed.
> 
> Umm... How about the following:  you hit delete on patches that introduce
> new ioctls, I help to provide required level of pain.  Deal?

It still doesn't work.

That only makes people complain about my fascist tendencies. See the
thread about device numbers, where Alan just says "ok, I'll do it without
Linus then". 

The whole point of open source is that I don't have that kind of power. I
can only guide, but the most powerful guide is by guiding the _design_,
not micro-managing.

> BTW, -pre4 got new bunch of ioctls. On procfs, no less.

I know. David has zero taste. 

Davem, why didn't you just make new entries in /proc/bus/pci and let
people do "mmap(/proc/bus/pci//mem)" instead of having idiotic ioctl's
to set "this is a IO handle" and "this is a MEM handle"? This particular
braindamage is not too late to fix..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Sun, 20 May 2001, Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Matthew Wilcox wrote:
> 
> > On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
> > > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?
> > 
> > No.
> > 
> > $ ls -l arch/*/kernel/ioctl32*.c
> > -rw-r--r--1 willywilly   22479 Jan 24 16:59 
>arch/mips64/kernel/ioctl32.c
> > -rw-r--r--1 willywilly  109475 May 18 16:39 
>arch/parisc/kernel/ioctl32.c
> > -rw-r--r--1 willywilly  117605 Feb  1 20:35 
>arch/sparc64/kernel/ioctl32.c
> > 
> > only about 100k.
> 
> You are missing all x86-only drivers.

Now, the point is that it _is_ doable, and by doing it in one standard
place (instead of letting each architecture fight it on its own) we'd
expose the problem better, and maybe get rid of some of those
architecture-specific ones.

For example, right now the fact that part of the work _has_ been done by
things like Sparc64 has not actually had any advantages: the sparc64 work
has not allowed people to say "let's try to merge this work", because it
has not been globally relevant, and a sparc64-only file has not been a
single point of contact that could be used to clean up things.

In contrast, a generic file has the possibility of creating new VFS or
device-level interfaces. You can catch block device ioctl's and turn them
into proper block device requests - and send them down the right request
queue. Suddenly a block device driver doesn't just get READ/WRITE
requests, it gets EJECT/SERIALIZE requests too. Without having to add
magic ioctl's that are specific to just one device driver. 

So by having a common point of access, you can actually encourage _fixing_
some of the problems. Historically, sparc64 etc have not been able to do
that - they can only try to convert different ioctl's into another format
and then re-submitting it..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Sun, 20 May 2001, Alexander Viro wrote:
> 
> Pheeew... Could you spell "about megabyte of stuff in ioctl.c"?

I agree. But it would certainly force people to think about this. And it
may turn out that a lot of it can be streamlined, and not that much ends
up being used very much.

It would also allow a single place of catching the generic ones, and as
such be a place to try to make things like the network ioctl's more
regular: setting things like network device duplex with _real_ interfaces
instead of hiding it in ioctl routines.

Also, note that many ioctl's actually do have fairly regular meaning, and
that it _is_ possible to catch a number of them with those regular
things:

switch (_IOC_TYPE(number)) {
case 'x':
xfs_ioctl(..);

and actually try to enforce the things that Documentation/ioctl-number.txt
tries to document. And make the clashes _explicit_ and thus make people
have more incentive to really try to fix it.

> How about moratorium on new ioctls in the meanwhile? Whatever we do in
> fs/ioctl.c, it _will_ take time.

Ehh.. Telling people "don't do that" simply doesn't work. Not if they can
do it easily anyway. Things really don't get fixed unless people have a
certain pain-level to induce it to get fixed.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Sun, 20 May 2001, Russell King wrote:
>
> On Sun, May 20, 2001 at 11:46:33AM -0700, Linus Torvalds wrote:
> > Nobody will expect the above to work, and everybody will agree that the
> > above is a BUG if the read() call will actually follow the pointer.
> 
> I didn't say anything about read().  I said write().  Obviously it
> wouldn't work for read()!

No, but the point is, everybody _would_ consider it a bug if a
low-level driver "write()" did anything but touched the explicit buffer.

Code like that would not pass through anybody's yuck-o-meter. People would
point fingers and say "That is not a legal write() function". Anybody who
tried to make write() follow pointers would be laughed at as a stupid git.

Anybody who makes "ioctl()" do the same is just following years of
standard practice, and the yuck-o-meter doesn't even register.

THAT is the importance of psychology.

Technology is meaningless. What matters is how people _think_ of it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Sun, 20 May 2001, David Woodhouse wrote:
> 
> If that had been done right the first time, you wouldn't have had to either.
> For that matter, it's often the case that if the ioctl had been done right
> the first time, nobody would have had to fix it up for any architecture.

The problem with ioctl's is, let me repeat, not technology.

It's people.

ioctl's are a way to do ugly things. That's what they have ALWAYS been.
And because of that, people don't care about following the rules - if
ioctl's followed the rules, they wouldn't _be_ ioctls in the first place,
but instead have a good interface (say, read()/write()).

Basically, ioctl's will _never_ be done right, because of the way people
think about them. They are a back door. They are by design typeless and
without rules. They are, in fact, the Microsoft of UNIX.

The only way to fix ioctl's is to force people to think about them in
another way. Because if you don't, there is always going to be another
driver writer who adds his own ioctl because it's the easy way to do
whatever he wants without giving it a second of _design_ thought.

Now, a good way to force the issue may be to just remove the "ioctl"
function pointer from the file operations structure altogether. We don't
have to force peopel to use "read/write" - we can just make it clear that
ioctl's _have_ to be wrapped, and that the only ioctl's that are
acceptable are the ones that are well-designed enough to be wrappable. So
we'd have a "linux/fs/ioctl.c" that would do all the wrapping, and would
also be able to do all the stuff that is currently done by pretty much
every single architecture out there (ie emulation of ioctl's for different
native modes).

It would probably not be that horrible. Many ioctl's are probably not all
that much used by any real programs any more. The most common ones by far
are the tty ones - and the truly generic ones like "FIONREAD" that it
actually would make sense to generalize more.

Catching stuff like EJECT at a higher layer and turning THOSE kinds of
things into real block device operations would clean up drivers and make
them more uniform.

Would fs/ioctl.c be an ugly mess of some special cases? Yes. But would
that make the ugliness explicit and possibly easier to try to manage and
fix? Very probably. And it would mean that driver writers could not just
say "fuck design, I'm going to do this my own really ugly way". 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Linus Torvalds


On Sun, 20 May 2001, Russell King wrote:
>
> On Sat, May 19, 2001 at 08:26:20PM -0700, Linus Torvalds wrote:
> > You're missing the point.
> 
> I don't think Richard is actually.  I think Richard has hit a nail
> dead on its head.
> 
> > It's ok to do "read()/write()" on structures.
> 
> Ok, we can read()/write() structures.  So someone invents the following
> structure:
> 
>   struct foo {
>   int cmd;
>   void *data;
>   } foo;
> 
> Now they use write(fd, &foo, sizeof(foo)); Haven't they just swapped
> the ioctl() interface for write() instead?

Wrong.

Nobody will expect the above to work, and everybody will agree that the
above is a BUG if the read() call will actually follow the pointer.

Read my email. And read the last line: "psychology is important".

Step #1 in programming: understand people.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Linus Torvalds


On Sat, 19 May 2001, Richard Gooch wrote:
>
> Matthew Wilcox writes:
> > On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
> > > The transaction(2) syscall can be just as easily abused as ioctl(2) in
> > > this respect.
> > 
> > But read() and write() cannot.
> 
> Sure they can. I can pass a pointer to a structure to either of them.

You're missing the point.

It's ok to do "read()/write()" on structures. In fact, people do that all
the time (and then they complain about the file not being portable ;)

The problem with ioctl is that not only are people passing ioctl's
pointers to structures, but:
 - they're not telling how big the structure is
 - the structure can have pointers to other places
 - sometimes it modifies the structure passed in

None of which are "network-nice". Basically, ioctl() is historically used
as a "pass any crap into driver , and the driver - and ONLY the driver
- will know what to do with it".

And when _only_ a driver knows what the arguments mean, upper layers can't
encapsulate them. Upper layers cannot make a packet of the argument and
send it over the network to another machine. Upper layers cannot do
sanity-checking on things like "is this argument a valid pointer". Which
means, for example, that not only can you not send the ioctl arguments
anywhere, but ioctl's have also historically been a hot-bed of bugs.

Example traditional ioctl bugs: use kernel pointers to access the argument
(because it just happens to work on x86, never mind the fact that if the
argument is bad you'll get a kernel oops and/or a serious security error).
Other example: different drivers/f ilesystems implementing the same ioctl,
but disagreeing on what the argument means (is it a pointer to an integer
argument, or the integer itself?).

Now, the advantage of using read()/write() is (a) that it's unambiguous
where the argument comes from and how big it is and (b) because of that
the _psychology_ is different. You don't get into this "pass random crap
around, let the kernel modify user data structures directly" mentality.

And psychology is important.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion codeinuserspace

2001-05-19 Thread Linus Torvalds


On Sat, 19 May 2001, Brad Boyer wrote:
> 
> If I understand the status of stuff correctly, I think this would make it
> a lot more painful to admin if it became a requirement to use initrd on
> everything just to be able to boot.

Don't get too hung up on initrd. Symbolic links really _are_ workable ways
to basically cache the information across boots, and the real problems are
elsewhere.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Linus Torvalds


On Sat, 19 May 2001, Alan Cox wrote:
>
> > Now that I'm awake and refreshed, yeah, that's awful.  But
> > echo "hot-add,slot=5,device=/dev/sda" >/dev/md0/control *is* sane.  Heck,
> > the system can even send back result codes that way.
> 
> Only to an English speaker. I suspect Quebec City canadians would prefer a
> different command set.

I was waiting for the "anglo-saxon" argument.

I don't think it's a valid argument. You already have "/dev". You already
have english names for the numbers in ioctl's (and let's not be mentally
dishonest and say "numbers are cross-cultural", because NOBODY MUST EVER
USE THE RAW NUMBERS - you have to use the anglo-saxon #define'd names
because the numbers aren't even cross-platform on Linux, much less
portable to other systems).

So the "English is bad" argument is a complete non-argument.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion codein userspace

2001-05-19 Thread Linus Torvalds


On Sat, 19 May 2001, Ben LaHaise wrote:
> 
> 1. Generic lookup method and argument parsiing (fs/lookupargs.c)

Looks sane.

> 2. Restricted block device (drivers/block/blkrestrict.c)

This is not very user-friendly, but along with symlinks this makes perfect
sense. It would make partition handling a _lot_ simpler.

Note, however, that I think the "restricted block device" is a much more
generic issue than just block devices. I've already discussed with Alan
the possibility of making _all_ file descriptors have the notion of
"restrictions", notably the "start, end" kind of things.

It is very useful for other things too - imagine opening /dev/mem, and
wanting to pass a restricted portiong of it to other processes with the
standard file descriptor passing facilities (think "secure DGA" for the X
server, but also think untrusted users that can read parts of shared files
etc - a suid program that opens a file, restricts it, drops privileges and
knows that the program can only access a specific part of the file)

> 3. Userspace partition code proposal

Yes and no.

I absolutely thihnk the idea that users actually _using_ these names is a
horrible one, and fraught with potential for much too easy mistakes that
end up being disastrous.

But having symlinks that are created by a special program would be ok.

[ Also, note how symlinks would make the point of initrd completely
  moot. You don't have to have initrd to initialize the thing, you can
  initialize the thing at installation time and when doing fdisk, and the
  symlinks would act as the permanent markers. ]

HOWEVER, you have to realize that there are serious security and
maintenance issues here, and I think your idea breaks down completely
because of that.

The thing is, you only have permissions on a "per-object" basis, and it's
common practice to have different permissions for different partitions.
Your scheme does not allow this. Which means that it is fundamentally
broken. Sorry.

So don't go overboard. The name-based thing is useful, but it's useful for
only certain things. And you must _never_ forget the security and
management issues.

For example, if you can open a serial port in the first place, you can set
its baud-rate. So it's ok to make baud-rate part of the name. And once you
have permission to read /dev/fd0 it doesn't make sense to limit you to one
particular format. So it's ok to have the disk format be part of the name.

But it's not possible to make the partition be a "name" issue. Because
while you obviously need different names, you _also_ need different
permissions.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Linus Torvalds


On Sat, 19 May 2001, Alexander Viro wrote:
>
>   Folks, before you get all excited about cramming side effects into
> open(2), consider the following case:

Your argument is stupid, imnsho.

Side-effects are perfectly fine if they are _local_ to the file
descriptor. Your example is contrieved and idiotic.

Filename extensions would not replace ioctl's. But they are wonderful ways
to avoid unnecessary binary name-spaces, like the ones we have with
"callout" TTY names, and the one that the fb people had.

For example, do a "ls -l /dev/fd0*", and ponder. Also, realize that we
have these hard-coded names in _addition_ to the magic ioctl to set even
more parameters. These are all stupid and bad, and it would have been a
_lot_ cleaner to be able to do

open("/dev/fd0/H1440", O_RDWR)..

or

open("/dev/fd0/HD,18,85", O_RDWD)

to open special non-standard high-density modes.

We already did this, in a very limited and stupid way, by encoding the
minor number and generating a standard naming scheme. We can do the same
thing in a _much_ more generic way by just realizing that we wanted the
open to be name-based in the first place.

These are _not_ side effects. They are very much naming conventions. If I
want to open a the floppy in one of the special extended modes, it makes a
LOT more sense to just open it with the naming, than to open a "generic"
floppy device only to them use a magic and very unreadable ioctl to set
the mode of the device.

In short, I don't buy your arguments for one single second.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: The INN/mmap bug

2000-09-18 Thread Linus Torvalds



On Mon, 18 Sep 2000, Alexander Viro wrote:
> Umm... OK. Let me put it that way:
>   * uptodate pages should never become non-uptodate.

Agreed. That "ClearPageUptodate()" thing definitely looks like a bug.

>   * we do multiple read requests on the same buffer.

Right now, yes. And that is a bad bug.

>   * the only thing saving us from data loss is that non-uptodate
> pages never lose buffers when they have data newer than on disk (uptodate
> pages do, but they never have data _older_ than on disk). Thus we can
> afford rereading a buffer on non-uptodate page if bh was lost and never
> have to reread it on uptodate page.

Right. Basically we will drop the buffers only after they have been
written out, so normal read/write can never lose data.

It's only the case of shared mmap's that can lose data due to the stupid
"let's read it in from disk again" thing.

>   * look at the explanation above and see if it looks brittle. It
> really needs to be documented

Oh, I certainly agree.

And I also agree that we should work on making the "create_page_buffers"
thing more easily used - the code duplication in the different parts of
fs/buffer.c is quite horrible. It tends to be the same kind of logic,
except it has small differences in four-five different versions.

I think it would be better to do the code part first, actually: once that
is done, a single comment above "create_page_buffers" will explain
everything that is going on..

>   That's what makes me unhappy about the current situation + obvious
> fixes. It works, but the proof is... well, not pretty.

Oh, agreed. I think we should clean up the code. I looked at it yesterday,
and it didn't look all that horribly bad, but I lost interest. I don't
know if it is worth doing before 2.4.x, as the current code certainly
should work with the small changes already proposed.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]