Re: Finding hardlinks
Hi! the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Certainly, but tar isn't going to remember all the inode numbers. Even if you solve the storage requirements (not impossible) it would have to do (4e9^2)/2=8e18 comparisons, which computers don't have enough CPU power just yet. Storage requirements would be 16GB of RAM... that's small enough. If you sort, you'll only need 32*2^32 comparisons, and that's doable. I do not claim it is _likely_. You'd need hardlinks, as you noticed. But system should work, not work with high probability, and I believe we should solve this in long term. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Certainly, but tar isn't going to remember all the inode numbers. Even if you solve the storage requirements (not impossible) it would have to do (4e9^2)/2=8e18 comparisons, which computers don't have enough CPU power just yet. Storage requirements would be 16GB of RAM... that's small enough. If you sort, you'll only need 32*2^32 comparisons, and that's doable. I do not claim it is _likely_. You'd need hardlinks, as you noticed. But system should work, not work with high probability, and I believe we should solve this in long term. High probability is all you have. Cosmic radiation hitting your computer will more likly cause problems, than colliding 64bit inode numbers ;) But you could add a new interface for the extra paranoid. The proposed 'samefile(fd1, fd2)' syscall is severly limited by the heavy weight of file descriptors. Another idea is to export the filesystem internal ID as an arbitray length cookie through the extended attribute interface. That could be stored/compared by the filesystem quite efficiently. But I think most apps will still opt for the portable intefaces which while not perfect, are good enough. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nfsv4] RE: Finding hardlinks
Trond Myklebust wrote: On Sun, 2006-12-31 at 16:25 -0500, Halevy, Benny wrote: Trond Myklebust wrote: On Thu, 2006-12-28 at 15:07 -0500, Halevy, Benny wrote: Mikulas Patocka wrote: BTW. how does (or how should?) NFS client deal with cache coherency if filehandles for the same file differ? Trond can probably answer this better than me... As I read it, currently the nfs client matches both the fileid and the filehandle (in nfs_find_actor). This means that different filehandles for the same file would result in different inodes :(. Strictly following the nfs protocol, comparing only the fileid should be enough IF fileids are indeed unique within the filesystem. Comparing the filehandle works as a workaround when the exported filesystem (or the nfs server) violates that. From a user stand point I think that this should be configurable, probably per mount point. Matching files by fileid instead of filehandle is a lot more trouble since fileids may be reused after a file has been deleted. Every time you look up a file, and get a new filehandle for the same fileid, you would at the very least have to do another GETATTR using one of the 'old' filehandles in order to ensure that the file is the same object as the one you have cached. Then there is the issue of what to do when you open(), read() or write() to the file: which filehandle do you use, are the access permissions the same for all filehandles, ... All in all, much pain for little or no gain. See my answer to your previous reply. It seems like the current implementation is in violation of the nfs protocol and the extra pain is required. ...and we should care because...? Trond Believe it or not, but server companies like Panasas try to follow the standard when designing and implementing their products while relying on client vendors to do the same. I sincerely expect you or anybody else for this matter to try to provide feedback and object to the protocol specification in case they disagree with it (or think it's ambiguous or self contradicting) rather than ignoring it and implementing something else. I think we're shooting ourselves in the foot when doing so and it is in our common interest to strive to reach a realistic standard we can all comply with and interoperate with each other. Benny - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Hi! the use of a good hash function. The chance of an accidental collision is infinitesimally small. For a set of 100 files: 0.03% 1,000,000 files: 0.03% I do not think we want to play with probability like this. I mean... imagine 4G files, 1KB each. That's 4TB disk space, not _completely_ unreasonable, and collision probability is going to be ~100% due to birthday paradox. You'll still want to back up your 4TB server... Certainly, but tar isn't going to remember all the inode numbers. Even if you solve the storage requirements (not impossible) it would have to do (4e9^2)/2=8e18 comparisons, which computers don't have enough CPU power just yet. Storage requirements would be 16GB of RAM... that's small enough. If you sort, you'll only need 32*2^32 comparisons, and that's doable. I do not claim it is _likely_. You'd need hardlinks, as you noticed. But system should work, not work with high probability, and I believe we should solve this in long term. High probability is all you have. Cosmic radiation hitting your computer will more likly cause problems, than colliding 64bit inode numbers ;) As I have shown... no, that's not right. 32*2^32 operations is small enough not to have problems with cosmic radiation. But you could add a new interface for the extra paranoid. The proposed 'samefile(fd1, fd2)' syscall is severly limited by the heavy weight of file descriptors. I guess that is the way to go. samefile(path1, path2) is unfortunately inherently racy. Another idea is to export the filesystem internal ID as an arbitray length cookie through the extended attribute interface. That could be stored/compared by the filesystem quite efficiently. How will that work for FAT? Or maybe we can relax that inode may not change over rename and zero length files need unique inode numbers... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Hello! High probability is all you have. Cosmic radiation hitting your computer will more likly cause problems, than colliding 64bit inode numbers ;) No. If you assign 64-bit inode numbers randomly, 2^32 of them are sufficient to generate a collision with probability around 50%. Have a nice fortnight -- Martin `MJ' Mares [EMAIL PROTECTED] http://mj.ucw.cz/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth A Bash poem: time for echo in canyon; do echo $echo $echo; done - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
On Tue, Jan 02, 2007 at 02:38:13PM -0700, Dan Williams ([EMAIL PROTECTED]) wrote: Would you have time to comment on the approach I have taken to implement a standard asynchronous memcpy interface? It seems it would be a good complement to what you are proposing. The entity that describes the aio operation could take advantage of asynchronous engines to carry out copies or other transforms (maybe an acrypto tie in as well). Here is the posting for 2.6.19. There has since been updates for 2.6.20, but the overall approach remains the same. intro: http://marc.theaimsgroup.com/?l=linux-raidm=116491661527161w=2 async_tx: http://marc.theaimsgroup.com/?l=linux-raidm=116491753318175w=2 My first impression is that it has too many lists :) Looks good, but IMHO there are steps to implement further. I have not found there any kind of scheduler - what if system has two async engines? What if sync engine faster than async in some cases (and it is indeed the case for small buffers), and should be selected that time? What if you will want to add additional transformations for some devices like crypto processing or checksumming? I would just create a driver for low-level engine, and exported its functionality - iop3_async_copy(), iop3_async_checksum(), iop3_async_crypto_1(), iop3_async_crypto_2() and so on. There will be a lot of potential users of exactly that functionality, but not stricly hardcoded higher layer operations like raidX. More generic solution must be used to select appropriate device. We had a very brief discussion about asynchronous crypto layer (acrypto) and how its ideas could be used for async dma engines - user should not even know how his data has been transferred - it calls async_copy(), which selects appropriate device (and sync copy is just an additional usual device in that case) from the list of devices, exported its functionality, selection can be done in millions of different ways from getting the fisrt one from the list (this is essentially how your approach is implemented right now), or using special (including run-time updated) heueristics (like it is done in acrypto). Thinking further, async_copy() is just a usual case for async class of operations. So the same above logic must be applied on this layer too. But 'layers are the way to design protocols, not implement them'. David Miller on netchannels So, user should not even know about layers - it should just say 'copy data from pointer A to pointer B', or 'copy data from pointer A to socket B' or even 'copy it from file /tmp/file to 192.168.0.1:80:tcp', without ever knowing that there are sockets and/or memcpy() calls, and if user requests to perform it asynchronously, it must be later notified (one might expect, that I will prefer to use kevent :) The same approach thus can be used by NFS/SAMBA/CIFS and other users. That is how I start to implement AIO (it looks like it becomes popular): 1. system exports set of operations it supports (send, receive, copy, crypto, ) 2. each operation has subsequent set of suboptions (different crypto types, for example) 3. each operation has set of low-level drivers, which support it (with optional performance or any other parameters) 4. each driver when loaded publishes its capabilities (async copy with speed A, XOR and so on) From user's point of view its aio_sendfile() or async_copy() will look following: 1. call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578') 1. call aio_schedule_file_socket(source='/tmp/file', dest='socket') 1. call aio_schedule_file_addr(source='/tmp/file', dest='192.168.0.1:80:tcp') or any other similar call then wait for received descriptor in kevent_get_events() or provide own cookie in each call. Each request is then converted into FIFO of smaller request like 'open file', 'open socket', 'get in user pages' and so on, each of which should be handled on appropriate device (hardware or software), completeness of each request starts procesing of the next one. Reading microthreading design notes I recall comparison of the NPTL and Erlang threading models on Debian site - they are _completely_ different models, NPTL creates real threads, which is supposed (I hope NOT) to be implemented in microthreading design too. It is slow. (Or is it not, Zach, we are intrigued :) It's damn bloody slow to create a thread compared to the correct non-blocking state machine. TUX state machine is similar to what I had in my first kevent based FS and network AIO patchset, and what I will use for current async processing work. A bit of empty words actually, but it can provide some food for thoughts. Regards, Dan -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote: High probability is all you have. Cosmic radiation hitting your computer will more likly cause problems, than colliding 64bit inode numbers ;) Some of us have machines designed to cope with cosmic rays, and would be unimpressed with a decrease in reliability. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
CVE-2006-5753 is for a case where an inode can be marked bad, switching the ops to bad_inode_ops, which are all connected as: static int return_EIO(void) { return -EIO; } #define EIO_ERROR ((void *) (return_EIO)) static struct inode_operations bad_inode_ops = { .create = bad_inode_create ...etc... The problem here is that the void cast causes return types to not be promoted, and for ops such as listxattr which expect more than 32 bits of return value, the 32-bit -EIO is interpreted as a large positive 64-bit number, i.e. 0xfffa instead of 0xfffa. This goes particularly badly when the return value is taken as a number of bytes to copy into, say, a user's buffer for example... I originally had coded up the fix by creating a return_EIO_TYPE macro for each return type, like this: static int return_EIO_int(void) { return -EIO; } #define EIO_ERROR_INT ((void *) (return_EIO_int)) static struct inode_operations bad_inode_ops = { .create = EIO_ERROR_INT, ...etc... but Al felt that it was probably better to create an EIO-returner for each actual op signature. Since so few ops share a signature, I just went ahead created an EIO function for each individual file inode op that returns a value. So here's the first stab at fixing it. I'm sure there are style points to be hashed out. Putting all the functions as static inlines in a header was just to avoid hundreds of lines of simple function declarations before we get to the meat of bad_inode.c, but it's probably technically wrong to put it in a header. Also if putting a copyright on that trivial header file is going overboard, just let me know. Or if anyone has a less verbose but still correct way to address this problem, I'm all ears. Thanks, -Eric Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: linux-2.6.20-rc3/fs/bad_inode.h === --- /dev/null +++ linux-2.6.20-rc3/fs/bad_inode.h @@ -0,0 +1,266 @@ +/* fs/bad_inode.h + * bad_inode / bad_file internal definitions + * + * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. + * Written by Eric Sandeen ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +/* Bad file ops */ + +static inline loff_t bad_file_llseek(struct file *file, loff_t offset, + int origin) +{ + return -EIO; +} + +static inline ssize_t bad_file_read(struct file *filp, char __user *buf, + size_t size, loff_t *ppos) +{ +return -EIO; +} + +static inline ssize_t bad_file_write(struct file *filp, const char __user *buf, + size_t siz, loff_t *ppos) +{ +return -EIO; +} + +static inline ssize_t bad_file_aio_read(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) +{ + return -EIO; +} + +static inline ssize_t bad_file_aio_write(struct kiocb *iocb, + const struct iovec *iov, unsigned long nr_segs, loff_t pos) +{ + return -EIO; +} + +static inline int bad_file_readdir(struct file * filp, void * dirent, + filldir_t filldir) +{ + return -EIO; +} + +static inline unsigned int bad_file_poll(struct file *filp, poll_table *wait) +{ + return -EIO; +} + +static inline int bad_file_ioctl (struct inode * inode, struct file * filp, + unsigned int cmd, unsigned long arg) +{ + return -EIO; +} + +static inline long bad_file_unlocked_ioctl(struct file *file, unsigned cmd, + unsigned long arg) +{ + return -EIO; +} + +static inline long bad_file_compat_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + return -EIO; +} + +static inline int bad_file_mmap(struct file * file, struct vm_area_struct * vma) +{ + return -EIO; +} + +static inline int bad_file_open(struct inode * inode, struct file * filp) +{ + return -EIO; +} + +static inline int bad_file_flush(struct file *file, fl_owner_t id) +{ + return -EIO; +} + +static inline int bad_file_release(struct inode * inode, struct file * filp) +{ + return -EIO; +} + +static inline int bad_file_fsync(struct file * file, struct dentry *dentry, + int datasync) +{ + return -EIO; +} + +static inline int bad_file_aio_fsync(struct kiocb *iocb, int datasync) +{ + return -EIO; +} + +static inline int bad_file_fasync(int fd, struct file *filp, int on) +{ + return -EIO; +} + +static inline int bad_file_lock(struct file *file, int cmd, + struct file_lock *fl) +{ + return -EIO; +} + +static inline ssize_t bad_file_sendfile(struct file *in_file, loff_t *ppos, + size_t
Re: Finding hardlinks
On Tue, Jan 02, 2007 at 01:04:06AM +0100, Mikulas Patocka wrote: I didn't hardlink directories, I just patched stat, lstat and fstat to always return st_ino == 0 --- and I've seen those failures. These failures are going to happen on non-POSIX filesystems in real world too, very rarely. I don't want to spoil your day but testing with st_ino==0 is a bad choice because it is a special number. Anyway, one can only find breakage, not prove that all the other programs handle this correctly so this is kind of pointless. On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. Synthetic filesystems such as /proc are special due to their dynamic nature and I think st_ino uniqueness is far more important than being able to provide hardlinks there. Most tree handling programs (cp, rm, ...) break horribly when the tree underneath changes at the same time. -- Frank - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, 3 Jan 2007, Frank van Maarseveen wrote: On Tue, Jan 02, 2007 at 01:04:06AM +0100, Mikulas Patocka wrote: I didn't hardlink directories, I just patched stat, lstat and fstat to always return st_ino == 0 --- and I've seen those failures. These failures are going to happen on non-POSIX filesystems in real world too, very rarely. I don't want to spoil your day but testing with st_ino==0 is a bad choice because it is a special number. Anyway, one can only find breakage, not prove that all the other programs handle this correctly so this is kind of pointless. On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. ... and that's the problem --- the UNIX world specified something that isn't implementable in real world. You can take a closed box and say this is POSIX cerified --- but how useful such box could be, if you can't access CDs, diskettes and USB sticks with it? Mikulas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
I didn't hardlink directories, I just patched stat, lstat and fstat to always return st_ino == 0 --- and I've seen those failures. These failures are going to happen on non-POSIX filesystems in real world too, very rarely. I don't want to spoil your day but testing with st_ino==0 is a bad choice because it is a special number. Anyway, one can only find breakage, not prove that all the other programs handle this correctly so this is kind of pointless. On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. ... and that's the problem --- the UNIX world specified something that isn't implementable in real world. Sure it is. Numerous popular POSIX filesystems do that. There is a lot of inode number space in 64 bit (of course it is a matter of time for it to jump to 128 bit and more) If the filesystem was designed by someone not from Unix world (FAT, SMB, ...), then not. And users still want to access these filesystems. 64-bit inode numbers space is not yet implemented on Linux --- the problem is that if you return ino = 2^32, programs compiled without -D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this failure is specified in POSIX, but not very useful. Mikulas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. But for at least the last of those decades, filesystems that could not do that were not uncommon. They had to present 32 bit inode numbers and either allowed more than 4G files or just didn't have the means of assigning inode numbers with the proper uniqueness to files. And the sky did not fall. I don't have an explanation why, but it makes it look to me like there are worse things than not having total one-one correspondence between inode numbers and files. Having a stat or mount fail because inodes are too big, having fewer than 4G files, and waiting for the filesystem to generate a suitable inode number might fall in that category. I fully agree that much effort should be put into making inode numbers work the way POSIX demands, but I also know that that sometimes requires more than just writing some code. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, Jan 03, 2007 at 01:09:41PM -0800, Bryan Henderson wrote: On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. But for at least the last of those decades, filesystems that could not do that were not uncommon. They had to present 32 bit inode numbers and either allowed more than 4G files or just didn't have the means of assigning inode numbers with the proper uniqueness to files. And the sky did not fall. I don't have an explanation why, I think it's mostly high end use and high end users tend to understand more. But we're going to see more really large filesystems in normal use so.. Currently, large file support is already necessary to handle dvd and video. It's also useful for images for virtualization. So the failing stat() calls should already be a thing of the past with modern distributions. -- Frank - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Thu, 28 Dec 2006 13:53:08 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. Unfortunately the unplugging changes in Jen's block tree have trashed these patches to a degree that I'm not confident in my repair attempts. So I'll drop the fasio patches from -mm. Zach's observations regarding this code's reliance upon things at *current sounded pretty serious, so I expect we'd be seeing changes for that anyway? Plus Jens's unplugging changes add more reliance upon context inside *current, for the plugging and unplugging operations. I expect that the fsaio patches will need to be aware of the protocol which those proposed changes add. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Hi! Sure it is. Numerous popular POSIX filesystems do that. There is a lot of inode number space in 64 bit (of course it is a matter of time for it to jump to 128 bit and more) If the filesystem was designed by someone not from Unix world (FAT, SMB, ...), then not. And users still want to access these filesystems. 64-bit inode numbers space is not yet implemented on Linux --- the problem is that if you return ino = 2^32, programs compiled without -D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this failure is specified in POSIX, but not very useful. Hehe, can we simply -EOVERFLOW on VFAT all the time? ...probably not useful :-(. But ability to say unknown in st_ino field would help Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix memory corruption from misinterpreted bad_inode_ops return values
Hi Eric, On Wed, 03 Jan 2007 12:42:47 -0600 Eric Sandeen [EMAIL PROTECTED] wrote: So here's the first stab at fixing it. I'm sure there are style points to be hashed out. Putting all the functions as static inlines in a header was just to avoid hundreds of lines of simple function declarations before we get to the meat of bad_inode.c, but it's probably technically wrong to put it in a header. Also if putting a copyright on that trivial header file is going overboard, just let me know. Or if anyone has a less verbose but still correct way to address this problem, I'm all ears. Since the only uses of these functions is to take their addresses, the inline gains you nothing and since the only uses are in the one file, you should just define them in that file. -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ pgpOGxJ9ZrGGS.pgp Description: PGP signature
Re: Finding hardlinks
On Thu, Jan 04, 2007 at 12:43:20AM +0100, Mikulas Patocka wrote: On Wed, 3 Jan 2007, Frank van Maarseveen wrote: Currently, large file support is already necessary to handle dvd and video. It's also useful for images for virtualization. So the failing stat() calls should already be a thing of the past with modern distributions. As long as glibc compiles by default with 32-bit ino_t, the problem exists and is severe --- programs handling large files, such as coreutils, tar, mc, mplayer, already compile with 64-bit ino_t and off_t, but the user (or script) may type something like: cat file.c EOF #include sys/types.h #include sys/stat.h main() { int h; struct stat st; if ((h = creat(foo, 0600)) 0) perror(creat), exit(1); if (fstat(h, st)) perror(stat), exit(1); close(h); return 0; } EOF gcc file.c; ./a.out --- and you certainly do not want this to fail (unless you are out of disk space). The difference is, that with 32-bit program and 64-bit off_t, you get deterministic failure on large files, with 32-bit program and 64-bit ino_t, you get random failures. What's (technically) the problem with changing the gcc default? Alternatively we could make the error deterministic in various ways. Start st_ino numbering from 4G (except for a few special ones maybe such as root/mounts). Or make old and new programs look differently at the ELF level or by sys_personality() and/or check against a ino64 mount flag/filesystem feature. Lots of possibilities. -- Frank - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nfsv4] RE: Finding hardlinks
On Wed, 2007-01-03 at 14:35 +0200, Benny Halevy wrote: Believe it or not, but server companies like Panasas try to follow the standard when designing and implementing their products while relying on client vendors to do the same. I personally have never given a rats arse about standards if they make no sense to me. If the server is capable of knowing about hard links, then why does it need all this extra crap in the filehandle that just obfuscates the hard link info? The bottom line is that nothing in our implementation will result in such a server performing sub-optimally w.r.t. the client. The only result is that we will conform to close-to-open semantics instead of strict POSIX caching semantics when two processes have opened the same file via different hard links. I sincerely expect you or anybody else for this matter to try to provide feedback and object to the protocol specification in case they disagree with it (or think it's ambiguous or self contradicting) rather than ignoring it and implementing something else. I think we're shooting ourselves in the foot when doing so and it is in our common interest to strive to reach a realistic standard we can all comply with and interoperate with each other. This has nothing to do with the protocol itself: it has only to do with caching semantics. As far as caching goes, the only guarantees that NFS clients give are the close-to-open semantics, and this should indeed be respected by the implementation in question. Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote: On Thu, 28 Dec 2006 13:53:08 +0530 Suparna Bhattacharya [EMAIL PROTECTED] wrote: This patchset implements changes to make filesystem AIO read and write asynchronous for the non O_DIRECT case. Unfortunately the unplugging changes in Jen's block tree have trashed these patches to a degree that I'm not confident in my repair attempts. So I'll drop the fasio patches from -mm. I took a quick look and the conflicts seem pretty minor to me, the unplugging changes mostly touch nearby code. Please let know how you want this fixed up. From what I can tell the comments in the unplug patches seem to say that it needs more work and testing, so perhaps a separate fixup patch may be a better idea rather than make the fsaio patchset dependent on this. Zach's observations regarding this code's reliance upon things at *current sounded pretty serious, so I expect we'd be seeing changes for that anyway? Not really, at least nothing that I can see needing a change. As I mentioned there is no reliance on *current in the code that runs in the aio threads that we need to worry about. The generic_write_checks etc that Zach was referring to all happens in the context of submitting process, not in retry context. The model is to perform all validation at the time of io submission. And of course things like copy_to_user() are already taken care of by use_mm(). Lets look at it this way - the kernel already has the ability to do background writeout on behalf of a task from a kernel thread and likewise read(ahead) pages that may be consumed by another task. There is also the ability to operate another task's address space (as used by ptrace). So there is nothing groundbreaking here. In fact on most occasions, all the IO is initiated in the context of the submitting task, so the aio threads mainly deal with checking for completion and transfering completed data to user space. Plus Jens's unplugging changes add more reliance upon context inside *current, for the plugging and unplugging operations. I expect that the fsaio patches will need to be aware of the protocol which those proposed changes add. Whatever logic applies to background writeout etc should also just apply as is to aio worker threads, shouldn't it ? At least at a quick glance I don't see anything special that needs to be done for fsaio, but its good to be aware of this anyway, thanks ! Regards Suparna -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to [EMAIL PROTECTED] For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: a href=mailto:[EMAIL PROTECTED][EMAIL PROTECTED]/a -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write
Suparna Bhattacharya wrote: On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote: So long as AIO threads do the same, there would be no problem (plugging is optional, of course). Yup, the AIO threads run the same code as for regular IO, i.e in the rare situations where they actually end up submitting IO, so there should be no problem. And you have already added plug/unplug at the appropriate places in those path, so things should just work. Yes I think it should. This (is supposed to) give a number of improvements over the traditional plugging (although some downsides too). Most notably for me, the VM gets cleaner ;) However AIO could be an interesting case to test for explicit plugging because of the way they interact. What kind of improvements do you see with samba and do you have any benchmark setups? I think aio-stress would be a good way to test/benchmark this sort of stuff, at least for a start. Samba (if I understand this correctly based on my discussions with Tridge) is less likely to generate the kind of io patterns that could benefit from explicit plugging (because the file server has no way to tell what the next request is going to be, it ends up submitting each independently instead of batching iocbs). OK, but I think that after IO submission, you do not run sync_page to unplug the block device, like the normal IO path would (via lock_page, before the explicit plug patches). However, with explicit plugging, AIO requests will be started immediately. Maybe this won't be noticable if the device is always busy, but I would like to know there isn't a regression. In future there may be optimization possibilities to consider when submitting batches of iocbs, i.e. on the io submission path. Maybe AIO - O_DIRECT would be interesting to play with first in this regardi ? Well I've got some simple per-process batching in there now, each process has a list of pending requests. Request merging is done locklessly against the last request added; and submission at unplug time is batched under a single block device lock. I'm sure more merging or batching could be done, but also consider that most programs will not ever make use of any added complexity. Regarding your patches, I've just had a quick look and have a question -- what do you do about blocking in page reclaim and dirty balancing? Aren't those major points of blocking with buffered IO? Did your test cases dirty enough to start writeout or cause a lot of reclaim? (admittedly, blocking in reclaim will now be much less common since the dirty mapping accounting). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html