Re: Finding hardlinks

2007-01-03 Thread Pavel Machek
Hi!

   the use of a good hash function.  The chance of an accidental
   collision is infinitesimally small.  For a set of 
   
100 files: 0.03%
  1,000,000 files: 0.03%
  
  I do not think we want to play with probability like this. I mean...
  imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
  unreasonable, and collision probability is going to be ~100% due to
  birthday paradox.
  
  You'll still want to back up your 4TB server...
 
 Certainly, but tar isn't going to remember all the inode numbers.
 Even if you solve the storage requirements (not impossible) it would
 have to do (4e9^2)/2=8e18 comparisons, which computers don't have
 enough CPU power just yet.

Storage requirements would be 16GB of RAM... that's small enough. If
you sort, you'll only need 32*2^32 comparisons, and that's doable.

I do not claim it is _likely_. You'd need hardlinks, as you
noticed. But system should work, not work with high probability, and
I believe we should solve this in long term.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Miklos Szeredi
the use of a good hash function.  The chance of an accidental
collision is infinitesimally small.  For a set of 

 100 files: 0.03%
   1,000,000 files: 0.03%
   
   I do not think we want to play with probability like this. I mean...
   imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
   unreasonable, and collision probability is going to be ~100% due to
   birthday paradox.
   
   You'll still want to back up your 4TB server...
  
  Certainly, but tar isn't going to remember all the inode numbers.
  Even if you solve the storage requirements (not impossible) it would
  have to do (4e9^2)/2=8e18 comparisons, which computers don't have
  enough CPU power just yet.
 
 Storage requirements would be 16GB of RAM... that's small enough. If
 you sort, you'll only need 32*2^32 comparisons, and that's doable.
 
 I do not claim it is _likely_. You'd need hardlinks, as you
 noticed. But system should work, not work with high probability, and
 I believe we should solve this in long term.

High probability is all you have.  Cosmic radiation hitting your
computer will more likly cause problems, than colliding 64bit inode
numbers ;)

But you could add a new interface for the extra paranoid.  The
proposed 'samefile(fd1, fd2)' syscall is severly limited by the heavy
weight of file descriptors.

Another idea is to export the filesystem internal ID as an arbitray
length cookie through the extended attribute interface.  That could be
stored/compared by the filesystem quite efficiently.

But I think most apps will still opt for the portable intefaces which
while not perfect, are good enough.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nfsv4] RE: Finding hardlinks

2007-01-03 Thread Benny Halevy
Trond Myklebust wrote:
 On Sun, 2006-12-31 at 16:25 -0500, Halevy, Benny wrote:
 Trond Myklebust wrote:
  
 On Thu, 2006-12-28 at 15:07 -0500, Halevy, Benny wrote:
 Mikulas Patocka wrote:
 BTW. how does (or how should?) NFS client deal with cache coherency if 
 filehandles for the same file differ?

 Trond can probably answer this better than me...
 As I read it, currently the nfs client matches both the fileid and the
 filehandle (in nfs_find_actor). This means that different filehandles
 for the same file would result in different inodes :(.
 Strictly following the nfs protocol, comparing only the fileid should
 be enough IF fileids are indeed unique within the filesystem.
 Comparing the filehandle works as a workaround when the exported filesystem
 (or the nfs server) violates that.  From a user stand point I think that
 this should be configurable, probably per mount point.
 Matching files by fileid instead of filehandle is a lot more trouble
 since fileids may be reused after a file has been deleted. Every time
 you look up a file, and get a new filehandle for the same fileid, you
 would at the very least have to do another GETATTR using one of the
 'old' filehandles in order to ensure that the file is the same object as
 the one you have cached. Then there is the issue of what to do when you
 open(), read() or write() to the file: which filehandle do you use, are
 the access permissions the same for all filehandles, ...

 All in all, much pain for little or no gain.
 See my answer to your previous reply.  It seems like the current
 implementation is in violation of the nfs protocol and the extra pain
 is required.
 
 ...and we should care because...?
 
 Trond
 

Believe it or not, but server companies like Panasas try to follow the standard
when designing and implementing their products while relying on client vendors
to do the same.

I sincerely expect you or anybody else for this matter to try to provide
feedback and object to the protocol specification in case they disagree
with it (or think it's ambiguous or self contradicting) rather than ignoring
it and implementing something else. I think we're shooting ourselves in the
foot when doing so and it is in our common interest to strive to reach a
realistic standard we can all comply with and interoperate with each other.

Benny

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Pavel Machek
Hi!

 the use of a good hash function.  The chance of an accidental
 collision is infinitesimally small.  For a set of 
 
  100 files: 0.03%
1,000,000 files: 0.03%

I do not think we want to play with probability like this. I mean...
imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
unreasonable, and collision probability is going to be ~100% due to
birthday paradox.

You'll still want to back up your 4TB server...
   
   Certainly, but tar isn't going to remember all the inode numbers.
   Even if you solve the storage requirements (not impossible) it would
   have to do (4e9^2)/2=8e18 comparisons, which computers don't have
   enough CPU power just yet.
  
  Storage requirements would be 16GB of RAM... that's small enough. If
  you sort, you'll only need 32*2^32 comparisons, and that's doable.
  
  I do not claim it is _likely_. You'd need hardlinks, as you
  noticed. But system should work, not work with high probability, and
  I believe we should solve this in long term.
 
 High probability is all you have.  Cosmic radiation hitting your
 computer will more likly cause problems, than colliding 64bit inode
 numbers ;)

As I have shown... no, that's not right. 32*2^32 operations is small
enough not to have problems with cosmic radiation.

 But you could add a new interface for the extra paranoid.  The
 proposed 'samefile(fd1, fd2)' syscall is severly limited by the heavy
 weight of file descriptors.

I guess that is the way to go. samefile(path1, path2) is unfortunately
inherently racy.

 Another idea is to export the filesystem internal ID as an arbitray
 length cookie through the extended attribute interface.  That could be
 stored/compared by the filesystem quite efficiently.

How will that work for FAT?

Or maybe we can relax that inode may not change over rename and
zero length files need unique inode numbers...

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Martin Mares
Hello!

 High probability is all you have.  Cosmic radiation hitting your
 computer will more likly cause problems, than colliding 64bit inode
 numbers ;)

No.

If you assign 64-bit inode numbers randomly, 2^32 of them are sufficient
to generate a collision with probability around 50%.

Have a nice fortnight
-- 
Martin `MJ' Mares  [EMAIL PROTECTED]   
http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
A Bash poem: time for echo in canyon; do echo $echo $echo; done
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-03 Thread Evgeniy Polyakov
On Tue, Jan 02, 2007 at 02:38:13PM -0700, Dan Williams ([EMAIL PROTECTED]) 
wrote:
 Would you have time to comment on the approach I have taken to
 implement a standard asynchronous memcpy interface?  It seems it would
 be a good complement to what you are proposing.  The entity that
 describes the aio operation could take advantage of asynchronous
 engines to carry out copies or other transforms (maybe an acrypto tie
 in as well).
 
 Here is the posting for 2.6.19.  There has since been updates for
 2.6.20, but the overall approach remains the same.
 intro: http://marc.theaimsgroup.com/?l=linux-raidm=116491661527161w=2
 async_tx: http://marc.theaimsgroup.com/?l=linux-raidm=116491753318175w=2

My first impression is that it has too many lists :)

Looks good, but IMHO there are steps to implement further.
I have not found there any kind of scheduler - what if system has two
async engines? What if sync engine faster than async in some cases (and
it is indeed the case for small buffers), and should be selected that time?
What if you will want to add additional transformations for some
devices like crypto processing or checksumming?

I would just create a driver for low-level engine, and exported its
functionality - iop3_async_copy(), iop3_async_checksum(), iop3_async_crypto_1(),
iop3_async_crypto_2() and so on.

There will be a lot of potential users of exactly that functionality,
but not stricly hardcoded higher layer operations like raidX.

More generic solution must be used to select appropriate device.
We had a very brief discussion about asynchronous crypto layer (acrypto)
and how its ideas could be used for async dma engines - user should not
even know how his data has been transferred - it calls async_copy(),
which selects appropriate device (and sync copy is just an additional 
usual device in that case) from the list of devices, exported its
functionality, selection can be done in millions of different ways from
getting the fisrt one from the list (this is essentially how your
approach is implemented right now), or using special (including run-time
updated) heueristics (like it is done in acrypto).

Thinking further, async_copy() is just a usual case for async class of
operations. So the same above logic must be applied on this layer too.

But 'layers are the way to design protocols, not implement them'.
David Miller on netchannels

So, user should not even know about layers - it should just say 'copy
data from pointer A to pointer B', or 'copy data from pointer A to
socket B' or even 'copy it from file /tmp/file to 192.168.0.1:80:tcp',
without ever knowing that there are sockets and/or memcpy() calls,
and if user requests to perform it asynchronously, it must be later
notified (one might expect, that I will prefer to use kevent :)
The same approach thus can be used by NFS/SAMBA/CIFS and other users.

That is how I start to implement AIO (it looks like it becomes popular):
1. system exports set of operations it supports (send, receive, copy,
crypto, )
2. each operation has subsequent set of suboptions (different crypto 
types, for example)
3. each operation has set of low-level drivers, which support it (with
optional performance or any other parameters)
4. each driver when loaded publishes its capabilities (async copy with
speed A, XOR and so on)

From user's point of view its aio_sendfile() or async_copy() will look
following:
1. call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
1. call aio_schedule_file_socket(source='/tmp/file', dest='socket')
1. call aio_schedule_file_addr(source='/tmp/file',
dest='192.168.0.1:80:tcp')

or any other similar call

then wait for received descriptor in kevent_get_events() or provide own
cookie in each call.

Each request is then converted into FIFO of smaller request like 'open file',
'open socket', 'get in user pages' and so on, each of which should be
handled on appropriate device (hardware or software), completeness of
each request starts procesing of the next one.

Reading microthreading design notes I recall comparison of the NPTL and
Erlang threading models on Debian site - they are _completely_ different 
models, NPTL creates real threads, which is supposed (I hope NOT) 
to be implemented in microthreading design too. It is slow. 
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking 
state machine. TUX state machine is similar to what I had in my first kevent 
based FS and network AIO patchset, and what I will use for current async 
processing work.


A bit of empty words actually, but it can provide some food for
thoughts.

 Regards,
 
 Dan

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Matthew Wilcox
On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote:
 High probability is all you have.  Cosmic radiation hitting your
 computer will more likly cause problems, than colliding 64bit inode
 numbers ;)

Some of us have machines designed to cope with cosmic rays, and would be
unimpressed with a decrease in reliability.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix memory corruption from misinterpreted bad_inode_ops return values

2007-01-03 Thread Eric Sandeen
CVE-2006-5753 is for a case where an inode can be marked bad, switching 
the ops to bad_inode_ops, which are all connected as:

static int return_EIO(void)
{
return -EIO;
}

#define EIO_ERROR ((void *) (return_EIO))

static struct inode_operations bad_inode_ops =
{
.create = bad_inode_create
...etc...

The problem here is that the void cast causes return types to not be 
promoted, and for ops such as listxattr which expect more than 32 bits of
return value, the 32-bit -EIO is interpreted as a large positive 64-bit 
number, i.e. 0xfffa instead of 0xfffa.

This goes particularly badly when the return value is taken as a number of
bytes to copy into, say, a user's buffer for example...

I originally had coded up the fix by creating a return_EIO_TYPE macro
for each return type, like this:

static int return_EIO_int(void)
{
return -EIO;
}
#define EIO_ERROR_INT ((void *) (return_EIO_int))

static struct inode_operations bad_inode_ops =
{
.create = EIO_ERROR_INT,
...etc...

but Al felt that it was probably better to create an EIO-returner for each 
actual op signature.  Since so few ops share a signature, I just went ahead 
 created an EIO function for each individual file  inode op that returns
a value.

So here's the first stab at fixing it.  I'm sure there are style points
to be hashed out.  Putting all the functions as static inlines in a header
was just to avoid hundreds of lines of simple function declarations before 
we get to the meat of bad_inode.c, but it's probably technically wrong to 
put it in a header.  Also if putting a copyright on that trivial header file
is going overboard, just let me know.  Or if anyone has a less verbose
but still correct way to address this problem, I'm all ears.

Thanks,

-Eric

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

Index: linux-2.6.20-rc3/fs/bad_inode.h
===
--- /dev/null
+++ linux-2.6.20-rc3/fs/bad_inode.h
@@ -0,0 +1,266 @@
+/* fs/bad_inode.h
+ * bad_inode / bad_file internal definitions
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by Eric Sandeen ([EMAIL PROTECTED])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/* Bad file ops */
+
+static inline loff_t bad_file_llseek(struct file *file, loff_t offset,
+   int origin)
+{
+   return -EIO;
+}
+
+static inline ssize_t bad_file_read(struct file *filp, char __user *buf,
+   size_t size, loff_t *ppos)
+{
+return -EIO;
+}
+
+static inline ssize_t bad_file_write(struct file *filp, const char __user *buf,
+   size_t siz, loff_t *ppos)
+{
+return -EIO;
+}
+
+static inline ssize_t bad_file_aio_read(struct kiocb *iocb,
+   const struct iovec *iov, unsigned long nr_segs, loff_t pos)
+{
+   return -EIO;
+}
+
+static inline ssize_t bad_file_aio_write(struct kiocb *iocb,
+   const struct iovec *iov, unsigned long nr_segs, loff_t pos)
+{
+   return -EIO;
+}
+
+static inline int bad_file_readdir(struct file * filp, void * dirent,
+   filldir_t filldir)
+{
+   return -EIO;
+}
+
+static inline unsigned int bad_file_poll(struct file *filp, poll_table *wait)
+{
+   return -EIO;
+}
+
+static inline int bad_file_ioctl (struct inode * inode, struct file * filp,
+   unsigned int cmd, unsigned long arg)
+{
+   return -EIO;
+}
+
+static inline long bad_file_unlocked_ioctl(struct file *file, unsigned cmd,
+   unsigned long arg)
+{
+   return -EIO;
+}
+
+static inline long bad_file_compat_ioctl(struct file *file, unsigned int cmd,
+   unsigned long arg)
+{
+   return -EIO;
+}
+
+static inline int bad_file_mmap(struct file * file, struct vm_area_struct * 
vma)
+{
+   return -EIO;
+}
+
+static inline int bad_file_open(struct inode * inode, struct file * filp)
+{
+   return -EIO;
+}
+
+static inline int bad_file_flush(struct file *file, fl_owner_t id)
+{
+   return -EIO;
+}
+
+static inline int bad_file_release(struct inode * inode, struct file * filp)
+{
+   return -EIO;
+}
+
+static inline int bad_file_fsync(struct file * file, struct dentry *dentry,
+   int datasync)
+{
+   return -EIO;
+}
+
+static inline int bad_file_aio_fsync(struct kiocb *iocb, int datasync)
+{
+   return -EIO;
+}
+
+static inline int bad_file_fasync(int fd, struct file *filp, int on)
+{
+   return -EIO;
+}
+
+static inline int bad_file_lock(struct file *file, int cmd,
+   struct file_lock *fl)
+{
+   return -EIO;
+}
+
+static inline ssize_t bad_file_sendfile(struct file *in_file, loff_t *ppos,
+   size_t 

Re: Finding hardlinks

2007-01-03 Thread Frank van Maarseveen
On Tue, Jan 02, 2007 at 01:04:06AM +0100, Mikulas Patocka wrote:
 
 I didn't hardlink directories, I just patched stat, lstat and fstat to 
 always return st_ino == 0 --- and I've seen those failures. These failures 
 are going to happen on non-POSIX filesystems in real world too, very 
 rarely.

I don't want to spoil your day but testing with st_ino==0 is a bad choice
because it is a special number. Anyway, one can only find breakage,
not prove that all the other programs handle this correctly so this is
kind of pointless.

On any decent filesystem st_ino should uniquely identify an object and
reliably provide hardlink information. The UNIX world has relied upon this
for decades. A filesystem with st_ino collisions without being hardlinked
(or the other way around) needs a fix.

Synthetic filesystems such as /proc are special due to their dynamic
nature and I think st_ino uniqueness is far more important than being able
to provide hardlinks there. Most tree handling programs (cp, rm, ...)
break horribly when the tree underneath changes at the same time.

-- 
Frank
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Mikulas Patocka



On Wed, 3 Jan 2007, Frank van Maarseveen wrote:


On Tue, Jan 02, 2007 at 01:04:06AM +0100, Mikulas Patocka wrote:


I didn't hardlink directories, I just patched stat, lstat and fstat to
always return st_ino == 0 --- and I've seen those failures. These failures
are going to happen on non-POSIX filesystems in real world too, very
rarely.


I don't want to spoil your day but testing with st_ino==0 is a bad choice
because it is a special number. Anyway, one can only find breakage,
not prove that all the other programs handle this correctly so this is
kind of pointless.

On any decent filesystem st_ino should uniquely identify an object and
reliably provide hardlink information. The UNIX world has relied upon this
for decades. A filesystem with st_ino collisions without being hardlinked
(or the other way around) needs a fix.


... and that's the problem --- the UNIX world specified something that 
isn't implementable in real world.


You can take a closed box and say this is POSIX cerified --- but how 
useful such box could be, if you can't access CDs, diskettes and USB 
sticks with it?


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Mikulas Patocka

I didn't hardlink directories, I just patched stat, lstat and fstat to
always return st_ino == 0 --- and I've seen those failures. These failures
are going to happen on non-POSIX filesystems in real world too, very
rarely.


I don't want to spoil your day but testing with st_ino==0 is a bad choice
because it is a special number. Anyway, one can only find breakage,
not prove that all the other programs handle this correctly so this is
kind of pointless.

On any decent filesystem st_ino should uniquely identify an object and
reliably provide hardlink information. The UNIX world has relied upon this
for decades. A filesystem with st_ino collisions without being hardlinked
(or the other way around) needs a fix.


... and that's the problem --- the UNIX world specified something that
isn't implementable in real world.


Sure it is. Numerous popular POSIX filesystems do that. There is a lot of
inode number space in 64 bit (of course it is a matter of time for it to
jump to 128 bit and more)


If the filesystem was designed by someone not from Unix world (FAT, SMB, 
...), then not. And users still want to access these filesystems.


64-bit inode numbers space is not yet implemented on Linux --- the problem 
is that if you return ino = 2^32, programs compiled without 
-D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this 
failure is specified in POSIX, but not very useful.


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Bryan Henderson
On any decent filesystem st_ino should uniquely identify an object and
reliably provide hardlink information. The UNIX world has relied upon 
this
for decades. A filesystem with st_ino collisions without being hardlinked
(or the other way around) needs a fix.

But for at least the last of those decades, filesystems that could not do 
that were not uncommon.  They had to present 32 bit inode numbers and 
either allowed more than 4G files or just didn't have the means of 
assigning inode numbers with the proper uniqueness to files.  And the sky 
did not fall.  I don't have an explanation why, but it makes it look to me 
like there are worse things than not having total one-one correspondence 
between inode numbers and files.  Having a stat or mount fail because 
inodes are too big, having fewer than 4G files, and waiting for the 
filesystem to generate a suitable inode number might fall in that 
category.

I fully agree that much effort should be put into making inode numbers 
work the way POSIX demands, but I also know that that sometimes requires 
more than just writing some code.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Frank van Maarseveen
On Wed, Jan 03, 2007 at 01:09:41PM -0800, Bryan Henderson wrote:
 On any decent filesystem st_ino should uniquely identify an object and
 reliably provide hardlink information. The UNIX world has relied upon 
 this
 for decades. A filesystem with st_ino collisions without being hardlinked
 (or the other way around) needs a fix.
 
 But for at least the last of those decades, filesystems that could not do 
 that were not uncommon.  They had to present 32 bit inode numbers and 
 either allowed more than 4G files or just didn't have the means of 
 assigning inode numbers with the proper uniqueness to files.  And the sky 
 did not fall.  I don't have an explanation why,

I think it's mostly high end use and high end users tend to understand
more. But we're going to see more really large filesystems in normal
use so..

Currently, large file support is already necessary to handle dvd and
video. It's also useful for images for virtualization. So the failing stat()
calls should already be a thing of the past with modern distributions.

-- 
Frank
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Andrew Morton
On Thu, 28 Dec 2006 13:53:08 +0530
Suparna Bhattacharya [EMAIL PROTECTED] wrote:

 This patchset implements changes to make filesystem AIO read
 and write asynchronous for the non O_DIRECT case.

Unfortunately the unplugging changes in Jen's block tree have trashed these
patches to a degree that I'm not confident in my repair attempts.  So I'll
drop the fasio patches from -mm.

Zach's observations regarding this code's reliance upon things at *current
sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Plus Jens's unplugging changes add more reliance upon context inside
*current, for the plugging and unplugging operations.  I expect that the
fsaio patches will need to be aware of the protocol which those proposed
changes add.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Pavel Machek
Hi!

 Sure it is. Numerous popular POSIX filesystems do that. There is a lot of
 inode number space in 64 bit (of course it is a matter of time for it to
 jump to 128 bit and more)
 
 If the filesystem was designed by someone not from Unix world (FAT, SMB, 
 ...), then not. And users still want to access these filesystems.
 
 64-bit inode numbers space is not yet implemented on Linux --- the problem 
 is that if you return ino = 2^32, programs compiled without 
 -D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this 
 failure is specified in POSIX, but not very useful.

Hehe, can we simply -EOVERFLOW on VFAT all the time? ...probably not
useful :-(. But ability to say unknown in st_ino field would
help

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix memory corruption from misinterpreted bad_inode_ops return values

2007-01-03 Thread Stephen Rothwell
Hi Eric,

On Wed, 03 Jan 2007 12:42:47 -0600 Eric Sandeen [EMAIL PROTECTED] wrote:

 So here's the first stab at fixing it.  I'm sure there are style points
 to be hashed out.  Putting all the functions as static inlines in a header
 was just to avoid hundreds of lines of simple function declarations before
 we get to the meat of bad_inode.c, but it's probably technically wrong to
 put it in a header.  Also if putting a copyright on that trivial header file
 is going overboard, just let me know.  Or if anyone has a less verbose
 but still correct way to address this problem, I'm all ears.

Since the only uses of these functions is to take their addresses, the
inline gains you nothing and since the only uses are in the one file, you
should just define them in that file.

--
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpOGxJ9ZrGGS.pgp
Description: PGP signature


Re: Finding hardlinks

2007-01-03 Thread Frank van Maarseveen
On Thu, Jan 04, 2007 at 12:43:20AM +0100, Mikulas Patocka wrote:
 On Wed, 3 Jan 2007, Frank van Maarseveen wrote:
 Currently, large file support is already necessary to handle dvd and
 video. It's also useful for images for virtualization. So the failing 
 stat()
 calls should already be a thing of the past with modern distributions.
 
 As long as glibc compiles by default with 32-bit ino_t, the problem exists 
 and is severe --- programs handling large files, such as coreutils, tar, 
 mc, mplayer, already compile with 64-bit ino_t and off_t, but the user (or 
 script) may type something like:
 
 cat file.c EOF
 #include sys/types.h
 #include sys/stat.h
 main()
 {
   int h;
   struct stat st;
   if ((h = creat(foo, 0600))  0) perror(creat), exit(1);
   if (fstat(h, st)) perror(stat), exit(1);
   close(h);
   return 0;
 }
 EOF
 gcc file.c; ./a.out
 
 --- and you certainly do not want this to fail (unless you are out of disk 
 space).
 
 The difference is, that with 32-bit program and 64-bit off_t, you get 
 deterministic failure on large files, with 32-bit program and 64-bit 
 ino_t, you get random failures.

What's (technically) the problem with changing the gcc default?

Alternatively we could make the error deterministic in various ways. Start
st_ino numbering from 4G (except for a few special ones maybe such
as root/mounts). Or make old and new programs look differently at the
ELF level or by sys_personality() and/or check against a ino64 mount
flag/filesystem feature. Lots of possibilities.

-- 
Frank
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nfsv4] RE: Finding hardlinks

2007-01-03 Thread Trond Myklebust
On Wed, 2007-01-03 at 14:35 +0200, Benny Halevy wrote:
 Believe it or not, but server companies like Panasas try to follow the 
 standard
 when designing and implementing their products while relying on client vendors
 to do the same.

I personally have never given a rats arse about standards if they make
no sense to me. If the server is capable of knowing about hard links,
then why does it need all this extra crap in the filehandle that just
obfuscates the hard link info?

The bottom line is that nothing in our implementation will result in
such a server performing sub-optimally w.r.t. the client. The only
result is that we will conform to close-to-open semantics instead of
strict POSIX caching semantics when two processes have opened the same
file via different hard links.

 I sincerely expect you or anybody else for this matter to try to provide
 feedback and object to the protocol specification in case they disagree
 with it (or think it's ambiguous or self contradicting) rather than ignoring
 it and implementing something else. I think we're shooting ourselves in the
 foot when doing so and it is in our common interest to strive to reach a
 realistic standard we can all comply with and interoperate with each other.

This has nothing to do with the protocol itself: it has only to do with
caching semantics. As far as caching goes, the only guarantees that NFS
clients give are the close-to-open semantics, and this should indeed be
respected by the implementation in question.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
 On Thu, 28 Dec 2006 13:53:08 +0530
 Suparna Bhattacharya [EMAIL PROTECTED] wrote:
 
  This patchset implements changes to make filesystem AIO read
  and write asynchronous for the non O_DIRECT case.
 
 Unfortunately the unplugging changes in Jen's block tree have trashed these
 patches to a degree that I'm not confident in my repair attempts.  So I'll
 drop the fasio patches from -mm.

I took a quick look and the conflicts seem pretty minor to me, the unplugging
changes mostly touch nearby code. Please let know how you want this fixed
up.

From what I can tell the comments in the unplug patches seem to say that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.

 
 Zach's observations regarding this code's reliance upon things at *current
 sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Not really, at least nothing that I can see needing a change.
As I mentioned there is no reliance on *current in the code that
runs in the aio threads that we need to worry about. 

The generic_write_checks etc that Zach was referring to all happens in the
context of submitting process, not in retry context. The model is to perform
all validation at the time of io submission. And of course things like
copy_to_user() are already taken care of by use_mm().

Lets look at it this way - the kernel already has the ability to do 
background writeout on behalf of a task from a kernel thread and likewise
read(ahead) pages that may be consumed by another task. There is also the
ability to operate another task's address space (as used by ptrace).

So there is nothing groundbreaking here.

In fact on most occasions, all the IO is initiated in the context of the
submitting task, so the aio threads mainly deal with checking for completion
and transfering completed data to user space.

 
 Plus Jens's unplugging changes add more reliance upon context inside
 *current, for the plugging and unplugging operations.  I expect that the
 fsaio patches will need to be aware of the protocol which those proposed
 changes add.

Whatever logic applies to background writeout etc should also just apply
as is to aio worker threads, shouldn't it ? At least at a quick glance I
don't see anything special that needs to be done for fsaio, but its good
to be aware of this anyway, thanks !

Regards
Suparna

 
 --
 To unsubscribe, send a message with 'unsubscribe linux-aio' in
 the body to [EMAIL PROTECTED]  For more info on Linux AIO,
 see: http://www.kvack.org/aio/
 Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Nick Piggin

Suparna Bhattacharya wrote:

On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:



So long as AIO threads do the same, there would be no problem (plugging
is optional, of course).



Yup, the AIO threads run the same code as for regular IO, i.e in the rare
situations where they actually end up submitting IO, so there should
be no problem. And you have already added plug/unplug at the appropriate
places in those path, so things should just work. 


Yes I think it should.


This (is supposed to) give a number of improvements over the traditional
plugging (although some downsides too). Most notably for me, the VM gets
cleaner ;)

However AIO could be an interesting case to test for explicit plugging
because of the way they interact. What kind of improvements do you see
with samba and do you have any benchmark setups?



I think aio-stress would be a good way to test/benchmark this sort of
stuff, at least for a start. 
Samba (if I understand this correctly based on my discussions with Tridge)

is less likely to generate the kind of io patterns that could benefit from
explicit plugging (because the file server has no way to tell what the next
request is going to be, it ends up submitting each independently instead of
batching iocbs).


OK, but I think that after IO submission, you do not run sync_page to
unplug the block device, like the normal IO path would (via lock_page,
before the explicit plug patches).

However, with explicit plugging, AIO requests will be started immediately.
Maybe this won't be noticable if the device is always busy, but I would
like to know there isn't a regression.


In future there may be optimization possibilities to consider when
submitting batches of iocbs, i.e. on the io submission path. Maybe
AIO - O_DIRECT would be interesting to play with first in this regardi ? 


Well I've got some simple per-process batching in there now, each process
has a list of pending requests. Request merging is done locklessly against
the last request added; and submission at unplug time is batched under a
single block device lock.

I'm sure more merging or batching could be done, but also consider that
most programs will not ever make use of any added complexity.

Regarding your patches, I've just had a quick look and have a question --
what do you do about blocking in page reclaim and dirty balancing? Aren't
those major points of blocking with buffered IO? Did your test cases
dirty enough to start writeout or cause a lot of reclaim? (admittedly,
blocking in reclaim will now be much less common since the dirty mapping
accounting).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html