Re: [PATCH 8/11] eCryptfs: Convert mmap functions to use persistent file

2007-09-19 Thread Andrew Morton
On Mon, 17 Sep 2007 16:50:16 -0500 Michael Halcrow <[EMAIL PROTECTED]> wrote:

> Convert readpage, prepare_write, and commit_write to use read_write.c
> routines. Remove sync_page; I cannot think of a good reason for
> implementing that in eCryptfs.
> 
> Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
> ---
>  fs/ecryptfs/mmap.c |  199 
> +++-
>  1 files changed, 103 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
> index 60e635e..dd68dd3 100644
> --- a/fs/ecryptfs/mmap.c
> +++ b/fs/ecryptfs/mmap.c
> @@ -267,9 +267,78 @@ static void set_header_info(char *page_virt,
>  }
>  
>  /**
> + * ecryptfs_copy_up_encrypted_with_header
> + * @page: Sort of a ``virtual'' representation of the encrypted lower
> + *file. The actual lower file does not have the metadata in
> + *the header.
> + * @crypt_stat: The eCryptfs inode's cryptographic context
> + *
> + * The ``view'' is the version of the file that userspace winds up
> + * seeing, with the header information inserted.
> + */
> +static int
> +ecryptfs_copy_up_encrypted_with_header(struct page *page,
> +struct ecryptfs_crypt_stat *crypt_stat)
> +{
> + loff_t extent_num_in_page = 0;
> + loff_t num_extents_per_page = (PAGE_CACHE_SIZE
> +/ crypt_stat->extent_size);
> + int rc = 0;
> +
> + while (extent_num_in_page < num_extents_per_page) {
> + loff_t view_extent_num = ((page->index * num_extents_per_page)

overflow?

> +   + extent_num_in_page);
> +
> + if (view_extent_num < crypt_stat->num_header_extents_at_front) {
> + /* This is a header extent */
> + char *page_virt;
> +
> + page_virt = kmap_atomic(page, KM_USER0);
> + memset(page_virt, 0, PAGE_CACHE_SIZE);
> + /* TODO: Support more than one header extent */
> + if (view_extent_num == 0) {
> + rc = ecryptfs_read_xattr_region(
> + page_virt, page->mapping->host);
> + set_header_info(page_virt, crypt_stat);
> + }
> + kunmap_atomic(page_virt, KM_USER0);
> + flush_dcache_page(page);
> + if (rc) {
> + ClearPageUptodate(page);
> + printk(KERN_ERR "%s: Error reading xattr "
> +"region; rc = [%d]\n", __FUNCTION__, rc);
> + goto out;
> + }
> + SetPageUptodate(page);

I don't know what sort of page `page' refers to here, but normally we only
manipulate the page uptodate status under lock_page().


> + } else {
> + /* This is an encrypted data extent */
> + loff_t lower_offset =
> + ((view_extent_num -
> +   crypt_stat->num_header_extents_at_front)
> +  * crypt_stat->extent_size);

overflow?

> + rc = ecryptfs_read_lower_page_segment(
> + page, (lower_offset >> PAGE_CACHE_SHIFT),
> + (lower_offset & ~PAGE_CACHE_MASK),
> + crypt_stat->extent_size, page->mapping->host);
> + if (rc) {
> + printk(KERN_ERR "%s: Error attempting to read "
> +"extent at offset [%lld] in the lower "
> +"file; rc = [%d]\n", __FUNCTION__,
> +lower_offset, rc);
> + goto out;
> + }
> + }
> + extent_num_in_page++;
> + }
> +out:
> + return rc;
> +}
> +
> +/**
>   * ecryptfs_readpage
> - * @file: This is an ecryptfs file
> - * @page: ecryptfs associated page to stick the read data into
> + * @file: An eCryptfs file
> + * @page: Page from eCryptfs inode mapping into which to stick the read data
>   *
>   * Read in a page, decrypting if necessary.
>   *
> @@ -277,59 +346,35 @@ static void set_header_info(char *page_virt,
>   */
>  static int ecryptfs_readpage(struct file *file, struct page *page)
>  {
> + struct ecryptfs_crypt_stat *crypt_stat =
> + 
> &ecryptfs_inode_to_private(file->f_path.dentry->d_inode)->crypt_stat;
>   int rc = 0;
> - struct ecryptfs_crypt_stat *crypt_stat;
>  
> - BUG_ON(!(file && file->f_path.dentry && file->f_path.dentry->d_inode));
> - crypt_stat = &ecryptfs_inode_to_private(file->f_path.dentry->d_inode)
> - ->crypt_stat;
>   if (!crypt_stat
>   || !(crypt_stat->flags & ECRYPTFS_ENCRYPTED)
>   || (cr

Re: [PATCH 6/11] eCryptfs: Update metadata read/write functions

2007-09-19 Thread Andrew Morton
On Mon, 17 Sep 2007 16:48:44 -0500 Michael Halcrow <[EMAIL PROTECTED]> wrote:

> + if ((rc = ecryptfs_write_lower(ecryptfs_dentry->d_inode,

checkpatch missed the assignment-in-an-if here.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/11] eCryptfs: Replace encrypt, decrypt, and inode size write

2007-09-19 Thread Andrew Morton
On Mon, 17 Sep 2007 16:47:10 -0500 Michael Halcrow <[EMAIL PROTECTED]> wrote:

> Replace page encryption and decryption routines and inode size write
> routine with versions that utilize the read_write.c functions.
> 
> Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
> ---
>  fs/ecryptfs/crypto.c  |  427 ++--
>  fs/ecryptfs/ecryptfs_kernel.h |   14 +-
>  fs/ecryptfs/inode.c   |   12 +-
>  fs/ecryptfs/mmap.c|  131 -
>  fs/ecryptfs/read_write.c  |   12 +-
>  5 files changed, 290 insertions(+), 306 deletions(-)
> 
> diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
> index 5d8a553..b829d3c 100644
> --- a/fs/ecryptfs/crypto.c
> +++ b/fs/ecryptfs/crypto.c
> @@ -467,8 +467,91 @@ out:
>  }
>  
>  /**
> + * ecryptfs_lower_offset_for_extent
> + *
> + * Convert an eCryptfs page index into a lower byte offset
> + */
> +void ecryptfs_lower_offset_for_extent(loff_t *offset, loff_t extent_num,
> +   struct ecryptfs_crypt_stat *crypt_stat)
> +{
> + (*offset) = ((crypt_stat->extent_size
> +   * crypt_stat->num_header_extents_at_front)
> +  + (crypt_stat->extent_size * extent_num));
> +}
> +
> +/**
> + * ecryptfs_encrypt_extent
> + * @enc_extent_page: Allocated page into which to encrypt the data in
> + *   @page
> + * @crypt_stat: crypt_stat containing cryptographic context for the
> + *  encryption operation
> + * @page: Page containing plaintext data extent to encrypt
> + * @extent_offset: Page extent offset for use in generating IV
> + *
> + * Encrypts one extent of data.
> + *
> + * Return zero on success; non-zero otherwise
> + */
> +static int ecryptfs_encrypt_extent(struct page *enc_extent_page,
> +struct ecryptfs_crypt_stat *crypt_stat,
> +struct page *page,
> +unsigned long extent_offset)
> +{
> + unsigned long extent_base;
> + char extent_iv[ECRYPTFS_MAX_IV_BYTES];
> + int rc;
> +
> + extent_base = (page->index
> +* (PAGE_CACHE_SIZE / crypt_stat->extent_size));

I think this is vulnerable to overflow on 32-bit machines?

> + rc = ecryptfs_derive_iv(extent_iv, crypt_stat,
> + (extent_base + extent_offset));
> + if (rc) {
> + ecryptfs_printk(KERN_ERR, "Error attempting to "
> + "derive IV for extent [0x%.16x]; "
> + "rc = [%d]\n", (extent_base + extent_offset),
> + rc);
> + goto out;
> + }
> + if (unlikely(ecryptfs_verbosity > 0)) {
> + ecryptfs_printk(KERN_DEBUG, "Encrypting extent "
> + "with iv:\n");
> + ecryptfs_dump_hex(extent_iv, crypt_stat->iv_bytes);
> + ecryptfs_printk(KERN_DEBUG, "First 8 bytes before "
> + "encryption:\n");
> + ecryptfs_dump_hex((char *)
> +   (page_address(page)
> ++ (extent_offset * crypt_stat->extent_size)),
> +   8);
> + }
> + rc = ecryptfs_encrypt_page_offset(crypt_stat, enc_extent_page, 0,
> +   page, (extent_offset
> +  * crypt_stat->extent_size),

maybe that is too, dunno.

> +   crypt_stat->extent_size, extent_iv);
> + if (rc < 0) {
> + printk(KERN_ERR "%s: Error attempting to encrypt page with "
> +"page->index = [%ld], extent_offset = [%ld]; "
> +"rc = [%d]\n", __FUNCTION__, page->index, extent_offset,
> +rc);
> + goto out;
> + }
> + rc = 0;
> + if (unlikely(ecryptfs_verbosity > 0)) {
> + ecryptfs_printk(KERN_DEBUG, "Encrypt extent [0x%.16x]; "
> + "rc = [%d]\n", (extent_base + extent_offset),
> + rc);
> + ecryptfs_printk(KERN_DEBUG, "First 8 bytes after "
> + "encryption:\n");
> + ecryptfs_dump_hex((char *)(page_address(enc_extent_page)), 8);
> + }
> +out:
> + return rc;
> +}
> +
> +/**
>   * ecryptfs_encrypt_page
> - * @ctx: The context of the page
> + * @page: Page mapped from the eCryptfs inode for the file; contains
> + *decrypted content that needs to be encrypted (to a temporary
> + *page; not in place) and written out to the lower file
>   *
>   * Encrypt an eCryptfs page. This is done on a per-extent basis. Note
>   * that eCryptfs pages may straddle the lower pages -- for instance,
> @@ -478,128 +561,121 @@ out:
>   * file, 24K of page 0 of the lower file will be read and decrypted,
>   * and then 8K of page 1 of the lower file will be read and decrypted.

Re: [PATCH 3/11] eCryptfs: read_write.c routines

2007-09-19 Thread Andrew Morton
On Mon, 17 Sep 2007 16:46:32 -0500 Michael Halcrow <[EMAIL PROTECTED]> wrote:

> Add a set of functions through which all I/O to lower files is
> consolidated. This patch adds a new inode_info reference to a
> persistent lower file for each eCryptfs inode; another patch later in
> this series will set that up. This persistent lower file is what the
> read_write.c functions use to call vfs_read() and vfs_write() on the
> lower filesystem, so even when reads and writes come in through
> aops->readpage and aops->writepage, we can satisfy them without
> resorting to direct access to the lower inode's address space.
> Several function declarations are going to be changing with this
> patchset. For now, in order to keep from breaking the build, I am
> putting dummy parameters in for those functions.
> 
> ..
>
> +/**
> + * ecryptfs_write_lower_page_segment
> + * @ecryptfs_inode: The eCryptfs inode
> + * @page_for_lower: The page containing the data to be written to the
> + *  lower file
> + * @offset_in_page: The offset in the @page_for_lower from which to
> + *  start writing the data
> + * @size: The amount of data from @page_for_lower to write to the
> + *lower file
> + *
> + * Determines the byte offset in the file for the given page and
> + * offset within the page, maps the page, and makes the call to write
> + * the contents of @page_for_lower to the lower inode.
> + *
> + * Returns zero on success; non-zero otherwise
> + */
> +int ecryptfs_write_lower_page_segment(struct inode *ecryptfs_inode,
> +   struct page *page_for_lower,
> +   size_t offset_in_page, size_t size)
> +{
> + char *virt;
> + loff_t offset;
> + int rc;
> +
> + offset = (page_for_lower->index << PAGE_CACHE_SHIFT) + offset_in_page;

bug.  You need to cast page.index to loff_t before shifting.

I'd fix it on the spot, but this would be a good time to review the whole
patchset and perhaps the whole fs for this easy-to-do, hard-to-find bug.

> + virt = kmap(page_for_lower);
> + rc = ecryptfs_write_lower(ecryptfs_inode, virt, offset, size);
> + kunmap(page_for_lower);
> + return rc;
> +}

argh, kmap.  http://lkml.org/lkml/2007/9/15/55

> +/**
> + * ecryptfs_write
> + * @ecryptfs_file: The eCryptfs file into which to write
> + * @data: Virtual address where data to write is located
> + * @offset: Offset in the eCryptfs file at which to begin writing the
> + *  data from @data
> + * @size: The number of bytes to write from @data
> + *
> + * Write an arbitrary amount of data to an arbitrary location in the
> + * eCryptfs inode page cache. This is done on a page-by-page, and then
> + * by an extent-by-extent, basis; individual extents are encrypted and
> + * written to the lower page cache (via VFS writes). This function
> + * takes care of all the address translation to locations in the lower
> + * filesystem; it also handles truncate events, writing out zeros
> + * where necessary.
> + *
> + * Returns zero on success; non-zero otherwise
> + */
> +int ecryptfs_write(struct file *ecryptfs_file, char *data, loff_t offset,
> +size_t size)
> +{
> + struct page *ecryptfs_page;
> + char *ecryptfs_page_virt;
> + u64 ecryptfs_file_size = i_size_read(ecryptfs_file->f_dentry->d_inode);

Not loff_t?

> + loff_t data_offset = 0;
> + loff_t pos;
> + int rc = 0;
> +
> + if (offset > ecryptfs_file_size)
> + pos = ecryptfs_file_size;

loff_t = u64.  The typing seems a bit confused?

> + else
> + pos = offset;
> + while (pos < (offset + size)) {
> + pgoff_t ecryptfs_page_idx = (pos >> PAGE_CACHE_SHIFT);
> + size_t start_offset_in_page = (pos & ~PAGE_CACHE_MASK);
> + size_t num_bytes = (PAGE_CACHE_SIZE - start_offset_in_page);
> + size_t total_remaining_bytes = ((offset + size) - pos);
> +
> + if (num_bytes > total_remaining_bytes)
> + num_bytes = total_remaining_bytes;
> + if (pos < offset) {
> + size_t total_remaining_zeros = (offset - pos);
> +
> + if (num_bytes > total_remaining_zeros)
> + num_bytes = total_remaining_zeros;
> + }
> + ecryptfs_page = ecryptfs_get1page(ecryptfs_file,
> +   ecryptfs_page_idx);
> + if (IS_ERR(ecryptfs_page)) {
> + rc = PTR_ERR(ecryptfs_page);
> + printk(KERN_ERR "%s: Error getting page at "
> +"index [%ld] from eCryptfs inode "
> +"mapping; rc = [%d]\n", __FUNCTION__,
> +ecryptfs_page_idx, rc);
> + goto out;
> + }
> + if (start_offset_in_page) {
> + /* Read in the page from the lower
> +

Re: [PATCH 3/11] eCryptfs: read_write.c routines

2007-09-19 Thread Andrew Morton
On Mon, 17 Sep 2007 16:46:32 -0500 Michael Halcrow <[EMAIL PROTECTED]> wrote:

> +/**
> + * ecryptfs_write_lower
> + * @ecryptfs_inode: The eCryptfs inode
> + * @data: Data to write
> + * @offset: Byte offset in the lower file to which to write the data
> + * @size: Number of bytes from @data to write at @offset in the lower
> + *file
> + *
> + * Write data to the lower file.
> + *
> + * Returns zero on success; non-zero on error
> + */

That might come out looking odd in the kernel doc?  Normally the documentation
would start out with

+/**
+ * ecryptfs_write_lower - write data to the lower file

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Andreas Dilger
On Sep 19, 2007  12:22 -0700, Mingming Cao wrote:
> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = kmalloc(sizeof(*journal), GFP_NOFS);
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"?  We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL.  I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread David Chinner
On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> > Ok, let's step back for a moment and look at a basic, fundamental
> > constraint of disks - seek capacity. A decade ago, a terabyte of
> > filesystem had 30 disks behind it - a seek capacity of about
> > 6000 seeks/s. Nowdays, that's a single disk with a seek
> > capacity of about 200/s. We're going *rapidly* backwards in
> > terms of seek capacity per terabyte of storage.
> > 
> > Now fill that terabyte of storage and index it in the most efficient
> > way - let's say btrees are used because lots of filesystems use
> > them. Hence the depth of the tree is roughly O((log n)/m) where m is
> > a factor of the btree block size.  Effectively, btree depth = seek
> > count on lookup of any object.
> 
> I agree. btrees will clearly benefit if the nodes are larger. We've an
> excess of disk capacity and an huge gap between seeking and contiguous
> bandwidth.
> 
> You don't need largepages for this, fsblocks is enough.

Sure, and that's what I meant when I said VPC + large pages was
a means to the end, not the only solution to the problem.

> Plus of course you don't like fsblock because it requires work to
> adapt a fs to it, I can't argue about that.

No, I don't like fsblock because it is inherently a "struture
per filesystem block" construct, just like buggerheads. You
still need to allocate millions of them when you have millions
dirty pages around. Rather than type it all out again, read
the fsblocks thread from here:

http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2

FWIW, with Chris mason's extent-based block mapping (which btrfs
is using and Christoph Hellwig is porting XFS over to) we completely
remove buggerheads from XFS and so fsblock would be a pretty
major step backwards for us if Chris's work goes into mainline.

> > Ok, so let's set the record straight. There were 3 justifications
> > for using *large pages* to *support* large filesystem block sizes
> > The justifications for the variable order page cache with large
> > pages were:
> > 
> > 1. little code change needed in the filesystems
> > -> still true
> 
> Disagree, the mmap side is not a little change.

That's not in the filesystem, though. ;)

However, I agree that if you don't have mmap then it's not
worthwhile and the changes for VPC aren't trivial.

> > 3. avoiding the need for vmap() as it has great
> >overhead and does not scale
> > -> Nick is starting to work on that and has
> >already had good results.
> 
> Frankly I don't follow this vmap thing. Can you elaborate?

We current support metadata blocks larger than page size for
certain types of metadata in XFS. e.g. directory blocks.
This however, requires vmap()ing a bunch of individual,
non-contiguous pages out of a block device address space
in exactly the fashion that was proposed by Nick with fsblock
originally.

vmap() has severe scalability problems - read this subthread
of this discussion between Nick and myself:

http://lkml.org/lkml/2007/9/11/508

> > Everyone seems to be focussing on #2 as the entire justification for
> > large block sizes in filesystems and that this is an "SGI" problem.
> 
> I agree it's not a SGI problem and this is why I want a design that
> has a _slight chance_ to improve performance on x86-64 too. If
> variable order page cache will provide any further improvement on top
> of fsblock will be only because your I/O device isn't fast with small
> sg entries.



There we go - back to the bloody I/O devices. Can ppl please stop
bringing this up because it *is not an issue any more*.

> config-page-shift + fsblock IMHO is the way to go for x86-64, with one
> additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
> top of fsblocks.

Hmm - so you'll need page cache tail packing as well in that case
to prevent memory being wasted on small files. That means any way
we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
we've got some non-trivial VM  modifications to make. 

If VPC can be separated from the large contiguous page requirement
(i.e. virtually mapped compound page support), I still think it
comes out on top because it doesn't require every filesystem to be
modified and you can use standard pages where they are optimal
(i.e. on filesystems were block size <= PAGE_SIZE).

But, I'm not going to argue endlessly for one solution or another;
I'm happy to see different solutions being chased, so may the
best VM win ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [05/17] vunmap: return page array

2007-09-19 Thread KAMEZAWA Hiroyuki
On Wed, 19 Sep 2007 15:15:58 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 19 Sep 2007, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, I don't like returning array which someone allocated in past and 
> > forgot.
> 
> But that is exactly the point. There is no need to keep track of the 
> information that is of no interest until the disposition of the mapping.
> 
yes. But I think neat style of this kind function is
==
/* If array != NULL, pointer of unmapped pages are stored in array[] */
extern int vunmap(const void *addr, struct page **array);
==
But yes, this costs.

> > And, area->page[] array under vmalloc() is allocated as following (in 
> > -rc6-mm1)
> > ==
> >if (array_size > PAGE_SIZE) {
> > pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
> > PAGE_KERNEL, node);
> > area->flags |= VM_VPAGES;
> > } else {
> > pages = kmalloc_node(array_size,
> > (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
> > node);
> > }
> > ==
> > A bit complicating.
> 
> Not at all. You can pass a __vmalloc'ed entity to vmap if you add VM_VPAGES 
> to the flags passed to it.
> 
> > At least, please add comments "How to free page-array returned by vummap"
> 
> But that depends on how the vmap was called. The caller knows what he has 
> done to acquire the memory and therefore also knows how to get rid of it.
>  
Hm, it seems all your patch are for hiding usage of vmalloc()/vmap().
If freeing pages[] array is also hidden in your patch's context, no problem.

Thanks,
-Kame


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [05/17] vunmap: return page array

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, KAMEZAWA Hiroyuki wrote:

> Hmm, I don't like returning array which someone allocated in past and forgot.

But that is exactly the point. There is no need to keep track of the 
information that is of no interest until the disposition of the mapping.

> And, area->page[] array under vmalloc() is allocated as following (in 
> -rc6-mm1)
> ==
>if (array_size > PAGE_SIZE) {
> pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
> PAGE_KERNEL, node);
> area->flags |= VM_VPAGES;
> } else {
> pages = kmalloc_node(array_size,
> (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
> node);
> }
> ==
> A bit complicating.

Not at all. You can pass a __vmalloc'ed entity to vmap if you add VM_VPAGES 
to the flags passed to it.

> At least, please add comments "How to free page-array returned by vummap"

But that depends on how the vmap was called. The caller knows what he has 
done to acquire the memory and therefore also knows how to get rid of it.
 
The knowledge of how to displose of things is only important when we use
vfree() to free up a vmap() (we never do that today) and expect the page 
array to go with it.

In that case the user needs to specific VM_VPAGES if the page array was 
vmalloced.

I can add a comment explaining that?

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Mingming Cao
On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote:
> On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> > @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
> >  
> >  alloc_transaction:
> > if (!journal->j_running_transaction) {
> > -   new_transaction = kmalloc(sizeof(*new_transaction),
> > -   GFP_NOFS|__GFP_NOFAIL);
> > +   new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
> 
> This should probably be a __GFP_NOFAIL if we are trying to start a new
> handle in truncate, as there is no way to propagate an error to the caller.
> 

Thanks, updated version.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases
they are not needed.

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/jbd/journal.c  |2 +-
 fs/jbd2/journal.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:47:58.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 14:23:45.0 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c  2007-09-19 14:23:45.0 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Mingming Cao
On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote:
> On Wed, 19 Sep 2007 12:22:09 -0700
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> > with the rest of kmalloc flag used in the JBD/JBD2 layer.
> > 
> > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> > 
> > ---
> >  fs/jbd/journal.c  |6 +++---
> >  fs/jbd/revoke.c   |8 
> >  fs/jbd2/journal.c |6 +++---
> >  fs/jbd2/revoke.c  |8 
> >  4 files changed, 14 insertions(+), 14 deletions(-)
> > 
> > Index: linux-2.6.23-rc6/fs/jbd/journal.c
> > ===
> > --- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:51:10.0 
> > -0700
> > +++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 11:51:57.0 
> > -0700
> > @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> > journal_t *journal;
> > int err;
> >  
> > -   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> > +   journal = kmalloc(sizeof(*journal), GFP_NOFS);
> > if (!journal)
> > goto fail;
> > memset(journal, 0, sizeof(*journal));
> > @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> > journal->j_blocksize = blocksize;
> > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > journal->j_wbufsize = n;
> > -   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> > if (!journal->j_wbuf) {
> > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> > __FUNCTION__);
> > @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> > /* journal descriptor can store up to n blocks -bzzz */
> > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > journal->j_wbufsize = n;
> > -   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> > if (!journal->j_wbuf) {
> > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> > __FUNCTION__);
> > Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> > ===
> > --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-19 11:51:30.0 
> > -0700
> > +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 
> > -0700
> > @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
> > while((tmp >>= 1UL) != 0UL)
> > shift++;
> >  
> > -   journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
> > GFP_KERNEL);
> > +   journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
> > GFP_NOFS);
> > if (!journal->j_revoke_table[0])
> > return -ENOMEM;
> > journal->j_revoke = journal->j_revoke_table[0];
> > @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
> > journal->j_revoke->hash_shift = shift;
> >  
> > journal->j_revoke->hash_table =
> > -   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> > if (!journal->j_revoke->hash_table) {
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > journal->j_revoke = NULL;
> > @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
> > for (tmp = 0; tmp < hash_size; tmp++)
> > INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
> >  
> > -   journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
> > GFP_KERNEL);
> > +   journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
> > GFP_NOFS);
> > if (!journal->j_revoke_table[1]) {
> > kfree(journal->j_revoke_table[0]->hash_table);
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
> > journal->j_revoke->hash_shift = shift;
> >  
> > journal->j_revoke->hash_table =
> > -   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> > if (!journal->j_revoke->hash_table) {
> > kfree(journal->j_revoke_table[0]->hash_table);
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> 
> These were all OK using GFP_KERNEL.
> 
> GFP_NOFS should only be used when the caller is holding some fs locks which
> might cause a deadlock if that caller reentered the fs in ->writepage (and
> maybe put_inode and such).  That isn't the case in any of the above code,
> which is all mount time stuff (I think).
> 

You are right they are all occur at initialization time.

> ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
> a page locked, is holding i_mutex, etc.
> 

Thanks for your 

Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Andrew Morton
On Wed, 19 Sep 2007 12:22:09 -0700
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> 
> ---
>  fs/jbd/journal.c  |6 +++---
>  fs/jbd/revoke.c   |8 
>  fs/jbd2/journal.c |6 +++---
>  fs/jbd2/revoke.c  |8 
>  4 files changed, 14 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6.23-rc6/fs/jbd/journal.c
> ===
> --- linux-2.6.23-rc6.orig/fs/jbd/journal.c2007-09-19 11:51:10.0 
> -0700
> +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
>   journal_t *journal;
>   int err;
>  
> - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = kmalloc(sizeof(*journal), GFP_NOFS);
>   if (!journal)
>   goto fail;
>   memset(journal, 0, sizeof(*journal));
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
>   journal->j_blocksize = blocksize;
>   n = journal->j_blocksize / sizeof(journal_block_tag_t);
>   journal->j_wbufsize = n;
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>   if (!journal->j_wbuf) {
>   printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>   __FUNCTION__);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
>   /* journal descriptor can store up to n blocks -bzzz */
>   n = journal->j_blocksize / sizeof(journal_block_tag_t);
>   journal->j_wbufsize = n;
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>   if (!journal->j_wbuf) {
>   printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>   __FUNCTION__);
> Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> ===
> --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 
> -0700
> +++ linux-2.6.23-rc6/fs/jbd/revoke.c  2007-09-19 11:52:34.0 -0700
> @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
>   while((tmp >>= 1UL) != 0UL)
>   shift++;
>  
> - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
> GFP_KERNEL);
> + journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
> GFP_NOFS);
>   if (!journal->j_revoke_table[0])
>   return -ENOMEM;
>   journal->j_revoke = journal->j_revoke_table[0];
> @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
>   journal->j_revoke->hash_shift = shift;
>  
>   journal->j_revoke->hash_table =
> - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>   if (!journal->j_revoke->hash_table) {
>   kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
>   journal->j_revoke = NULL;
> @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
>   for (tmp = 0; tmp < hash_size; tmp++)
>   INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
>  
> - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
> GFP_KERNEL);
> + journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
> GFP_NOFS);
>   if (!journal->j_revoke_table[1]) {
>   kfree(journal->j_revoke_table[0]->hash_table);
>   kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
>   journal->j_revoke->hash_shift = shift;
>  
>   journal->j_revoke->hash_table =
> - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>   if (!journal->j_revoke->hash_table) {
>   kfree(journal->j_revoke_table[0]->hash_table);
>   kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in ->writepage (and
maybe put_inode and such).  That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Unionfs: cache coherency after lower objects are removed

2007-09-19 Thread Erez Zadok
Prevent an oops if a lower file is deleted and then it is stat'ed from the
upper layer.  Ensure that we return a negative dentry so the user will get
an ENOENT.  Properly dput/mntput so we don't leak references at the lower
file system.

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/lookup.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/unionfs/lookup.c b/fs/unionfs/lookup.c
index 8eb9749..963d622 100644
--- a/fs/unionfs/lookup.c
+++ b/fs/unionfs/lookup.c
@@ -92,7 +92,7 @@ struct dentry *unionfs_lookup_backend(struct dentry *dentry,
struct dentry *lower_dir_dentry = NULL;
struct dentry *parent_dentry = NULL;
struct dentry *d_interposed = NULL;
-   int bindex, bstart, bend, bopaque;
+   int bindex, bstart = -1, bend, bopaque;
int dentry_count = 0;   /* Number of positive dentries. */
int first_dentry_offset = -1; /* -1 is uninitialized */
struct dentry *first_dentry = NULL;
@@ -413,7 +413,14 @@ out:
if (!err && UNIONFS_D(dentry)) {
BUG_ON(dbend(dentry) > UNIONFS_D(dentry)->bcount);
BUG_ON(dbend(dentry) > sbmax(dentry->d_sb));
-   BUG_ON(dbstart(dentry) < 0);
+   if (dbstart(dentry) < 0 &&
+   dentry->d_inode && bstart >= 0 &&
+   (!UNIONFS_I(dentry->d_inode) ||
+!UNIONFS_I(dentry->d_inode)->lower_inodes)) {
+   unionfs_mntput(dentry->d_sb->s_root, bstart);
+   dput(first_lower_dentry);
+   UNIONFS_I(dentry->d_inode)->stale = 1;
+   }
}
kfree(whname);
if (locked_parent)
@@ -423,6 +430,9 @@ out:
unionfs_unlock_dentry(dentry);
if (!err && d_interposed)
return d_interposed;
+   if (dentry->d_inode && UNIONFS_I(dentry->d_inode)->stale &&
+   first_dentry_offset >= 0)
+   unionfs_mntput(dentry->d_sb->s_root, first_dentry_offset);
return ERR_PTR(err);
 }
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] Unionfs: check integrity only if validated dentry successfully

2007-09-19 Thread Erez Zadok
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/dentry.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
index 91f9780..9e0742d 100644
--- a/fs/unionfs/dentry.c
+++ b/fs/unionfs/dentry.c
@@ -418,7 +418,8 @@ static int unionfs_d_revalidate(struct dentry *dentry, 
struct nameidata *nd)
unionfs_lock_dentry(dentry);
err = __unionfs_d_revalidate_chain(dentry, nd, false);
unionfs_unlock_dentry(dentry);
-   unionfs_check_dentry(dentry);
+   if (err > 0) /* true==1: dentry is valid */
+   unionfs_check_dentry(dentry);
 
unionfs_read_unlock(dentry->d_sb);
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] Unionfs: unionfs_lookup locking consistency

2007-09-19 Thread Erez Zadok
Ensure that our lookup locking is consistent and symmetric: if a lock
existed before calling lookup_backend, it should remain so; only if
performing a lookup of a known new dentry, should lookup_backend return a
newly-locked dentry-inode info (and only if there was no error).  Document
this behavior.  This cleanup allowed us to remove two unnecessary int
declarations.

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/inode.c  |6 +-
 fs/unionfs/lookup.c |   38 +++---
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c
index 687b9a7..9638b64 100644
--- a/fs/unionfs/inode.c
+++ b/fs/unionfs/inode.c
@@ -163,7 +163,10 @@ static struct dentry *unionfs_lookup(struct inode *parent,
path_save.mnt = nd->mnt;
}
 
-   /* The locking is done by unionfs_lookup_backend. */
+   /*
+* unionfs_lookup_backend returns a locked dentry upon success,
+* so we'll have to unlock it below.
+*/
ret = unionfs_lookup_backend(dentry, nd, INTERPOSE_LOOKUP);
 
/* restore the dentry & vfsmnt in namei */
@@ -176,6 +179,7 @@ static struct dentry *unionfs_lookup(struct inode *parent,
dentry = ret;
/* parent times may have changed */
unionfs_copy_attr_times(dentry->d_parent->d_inode);
+   unionfs_unlock_dentry(dentry);
}
 
unionfs_check_inode(parent);
diff --git a/fs/unionfs/lookup.c b/fs/unionfs/lookup.c
index 7fa6310..8eb9749 100644
--- a/fs/unionfs/lookup.c
+++ b/fs/unionfs/lookup.c
@@ -77,6 +77,11 @@ out:
  *
  * Returns: NULL (ok), ERR_PTR if an error occurred, or a non-null non-error
  * PTR if d_splice returned a different dentry.
+ *
+ * If lookupmode is INTERPOSE_PARTIAL/REVAL/REVAL_NEG, the passed dentry's
+ * inode info must be locked.  If lookupmode is INTERPOSE_LOOKUP (i.e., a
+ * newly looked-up dentry), then unionfs_lookup_backend will return a locked
+ * dentry's info, which the caller must unlock.
  */
 struct dentry *unionfs_lookup_backend(struct dentry *dentry,
  struct nameidata *nd, int lookupmode)
@@ -94,8 +99,6 @@ struct dentry *unionfs_lookup_backend(struct dentry *dentry,
struct dentry *first_lower_dentry = NULL;
struct vfsmount *first_lower_mnt = NULL;
int locked_parent = 0;
-   int locked_child = 0;
-   int allocated_new_info = 0;
int opaque;
char *whname = NULL;
const char *name;
@@ -109,24 +112,21 @@ struct dentry *unionfs_lookup_backend(struct dentry 
*dentry,
if (lookupmode == INTERPOSE_PARTIAL || lookupmode == INTERPOSE_REVAL ||
lookupmode == INTERPOSE_REVAL_NEG)
verify_locked(dentry);
-   else {
+   else/* this could only be INTERPOSE_LOOKUP */
BUG_ON(UNIONFS_D(dentry) != NULL);
-   locked_child = 1;
-   }
 
-   switch(lookupmode) {
-   case INTERPOSE_PARTIAL:
-   break;
-   case INTERPOSE_LOOKUP:
-   if ((err = new_dentry_private_data(dentry)))
-   goto out;
-   allocated_new_info = 1;
-   break;
-   default:
-   if ((err = realloc_dentry_private_data(dentry)))
-   goto out;
-   allocated_new_info = 1;
-   break;
+   switch (lookupmode) {
+   case INTERPOSE_PARTIAL:
+   break;
+   case INTERPOSE_LOOKUP:
+   if ((err = new_dentry_private_data(dentry)))
+   goto out;
+   break;
+   default:
+   /* default: can only be INTERPOSE_REVAL/REVAL_NEG */
+   if ((err = realloc_dentry_private_data(dentry)))
+   goto out;
+   break;
}
 
/* must initialize dentry operations */
@@ -419,7 +419,7 @@ out:
if (locked_parent)
unionfs_unlock_dentry(parent_dentry);
dput(parent_dentry);
-   if (locked_child || (err && allocated_new_info))
+   if (err && (lookupmode == INTERPOSE_LOOKUP))
unionfs_unlock_dentry(dentry);
if (!err && d_interposed)
return d_interposed;
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] Unionfs: add missing newlines to printks

2007-09-19 Thread Erez Zadok
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/dentry.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
index f32922e..91f9780 100644
--- a/fs/unionfs/dentry.c
+++ b/fs/unionfs/dentry.c
@@ -438,13 +438,13 @@ static void unionfs_d_release(struct dentry *dentry)
unionfs_check_dentry(dentry);
/* this could be a negative dentry, so check first */
if (!UNIONFS_D(dentry)) {
-   printk(KERN_DEBUG "unionfs: dentry without private data: %.*s",
+   printk(KERN_DEBUG "unionfs: dentry without private data: 
%.*s\n",
   dentry->d_name.len, dentry->d_name.name);
goto out;
} else if (dbstart(dentry) < 0) {
/* this is due to a failed lookup */
printk(KERN_DEBUG "unionfs: dentry without lower "
-  "dentries: %.*s",
+  "dentries: %.*s\n",
   dentry->d_name.len, dentry->d_name.name);
goto out_free;
}
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] Unionfs: use bool type in dentry and file revalidation code

2007-09-19 Thread Erez Zadok
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/commonfops.c |   12 ++--
 fs/unionfs/dentry.c |2 +-
 fs/unionfs/dirfops.c|4 ++--
 fs/unionfs/file.c   |   12 ++--
 fs/unionfs/mmap.c   |6 +++---
 fs/unionfs/union.h  |2 +-
 6 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c
index 724128d..87cbb09 100644
--- a/fs/unionfs/commonfops.c
+++ b/fs/unionfs/commonfops.c
@@ -190,7 +190,7 @@ out:
 }
 
 /* open the highest priority file for a given upper file */
-static int open_highest_file(struct file *file, int willwrite)
+static int open_highest_file(struct file *file, bool willwrite)
 {
int bindex, bstart, bend, err = 0;
struct file *lower_file;
@@ -298,9 +298,9 @@ out:
 /*
  * Revalidate the struct file
  * @file: file to revalidate
- * @willwrite: 1 if caller may cause changes to the file; 0 otherwise.
+ * @willwrite: true if caller may cause changes to the file; false otherwise.
  */
-int unionfs_file_revalidate(struct file *file, int willwrite)
+int unionfs_file_revalidate(struct file *file, bool willwrite)
 {
struct super_block *sb;
struct dentry *dentry;
@@ -612,7 +612,7 @@ int unionfs_file_release(struct inode *inode, struct file 
*file)
 * This is important for open-but-unlinked files, as well as mmap
 * support.
 */
-   if ((err = unionfs_file_revalidate(file, 1)))
+   if ((err = unionfs_file_revalidate(file, true)))
goto out;
unionfs_check_file(file);
fileinfo = UNIONFS_F(file);
@@ -753,7 +753,7 @@ long unionfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 
unionfs_read_lock(file->f_path.dentry->d_sb);
 
-   if ((err = unionfs_file_revalidate(file, 1)))
+   if ((err = unionfs_file_revalidate(file, true)))
goto out;
 
/* check if asked for local commands */
@@ -791,7 +791,7 @@ int unionfs_flush(struct file *file, fl_owner_t id)
 
unionfs_read_lock(dentry->d_sb);
 
-   if ((err = unionfs_file_revalidate(file, 1)))
+   if ((err = unionfs_file_revalidate(file, true)))
goto out;
unionfs_check_file(file);
 
diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
index e46a53e..f32922e 100644
--- a/fs/unionfs/dentry.c
+++ b/fs/unionfs/dentry.c
@@ -262,7 +262,7 @@ static inline void purge_inode_data(struct inode *inode)
  * Revalidate a parent chain of dentries, then the actual node.
  * Assumes that dentry is locked, but will lock all parents if/when needed.
  *
- * If 'willwrite' is 1, and the lower inode times are not in sync, then
+ * If 'willwrite' is true, and the lower inode times are not in sync, then
  * *don't* purge_inode_data, as it could deadlock if ->write calls us and we
  * try to truncate a locked page.  Besides, if unionfs is about to write
  * data to a file, then there's the data unionfs is about to write is more
diff --git a/fs/unionfs/dirfops.c b/fs/unionfs/dirfops.c
index 980f125..c923e58 100644
--- a/fs/unionfs/dirfops.c
+++ b/fs/unionfs/dirfops.c
@@ -99,7 +99,7 @@ static int unionfs_readdir(struct file *file, void *dirent, 
filldir_t filldir)
 
unionfs_read_lock(file->f_path.dentry->d_sb);
 
-   if ((err = unionfs_file_revalidate(file, 0)))
+   if ((err = unionfs_file_revalidate(file, false)))
goto out;
 
inode = file->f_path.dentry->d_inode;
@@ -201,7 +201,7 @@ static loff_t unionfs_dir_llseek(struct file *file, loff_t 
offset, int origin)
 
unionfs_read_lock(file->f_path.dentry->d_sb);
 
-   if ((err = unionfs_file_revalidate(file, 0)))
+   if ((err = unionfs_file_revalidate(file, false)))
goto out;
 
rdstate = UNIONFS_F(file)->rdstate;
diff --git a/fs/unionfs/file.c b/fs/unionfs/file.c
index e9e63b7..d8eaaa5 100644
--- a/fs/unionfs/file.c
+++ b/fs/unionfs/file.c
@@ -24,7 +24,7 @@ static ssize_t unionfs_read(struct file *file, char __user 
*buf,
int err;
 
unionfs_read_lock(file->f_path.dentry->d_sb);
-   if ((err = unionfs_file_revalidate(file, 0)))
+   if ((err = unionfs_file_revalidate(file, false)))
goto out;
unionfs_check_file(file);
 
@@ -47,7 +47,7 @@ static ssize_t unionfs_aio_read(struct kiocb *iocb, const 
struct iovec *iov,
struct file *file = iocb->ki_filp;
 
unionfs_read_lock(file->f_path.dentry->d_sb);
-   if ((err = unionfs_file_revalidate(file, 0)))
+   if ((err = unionfs_file_revalidate(file, false)))
goto out;
unionfs_check_file(file);
 
@@ -72,7 +72,7 @@ static ssize_t unionfs_write(struct file *file, const char 
__user *buf,
int err = 0;
 
unionfs_read_lock(file->f_path.dentry->d_sb);
-   if ((err = unionfs_file_revalidate(file, 1)))
+   if ((err = unionfs_file_revalidate(file, true)))
goto out;
 

[GIT PULL -mm] 0/6 Unionfs updates/cleanups/fixes

2007-09-19 Thread Erez Zadok

The following is a series of patches related to Unionfs.  They represent a
few code cleanups (suggested or inspired by comments on the previous set of
patches), and a few more bug fixes (esp. to cache coherency).  These fixes
were tested on our 2.6.23-rc6 latest code, as well as the backports to
2.6.{22,21,20,19,18,9} on ext2/3/4, xfs, reiserfs, nfs, jffs2, ramfs, tmpfs,
cramfs, and squashfs (where available).  See http://unionfs.filesystems.org/
to download backported unionfs code.

Please pull from the 'master' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/ezk/unionfs.git
  ^^^
(Note: the git URL to pull from had changed from "jsipek" to "ezk")

to receive the following:

Erez_Zadok (6):
  Unionfs: use bool type in dentry and file revalidation code
  Unionfs: remove unnecessary comment
  Unionfs: add missing newlines to printks
  Unionfs: check integrity only if validated dentry successfully
  Unionfs: unionfs_lookup locking consistency
  Unionfs: cache coherency after lower objects are removed

 commonfops.c |   12 ++--
 dentry.c |9 +
 dirfops.c|4 ++--
 fanout.h |1 -
 file.c   |   12 ++--
 inode.c  |6 +-
 lookup.c |   52 +++-
 mmap.c   |6 +++---
 union.h  |2 +-
 9 files changed, 59 insertions(+), 45 deletions(-)

Erez Zadok
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] Unionfs: remove unnecessary comment

2007-09-19 Thread Erez Zadok
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
Acked-by: Josef Sipek <[EMAIL PROTECTED]>
---
 fs/unionfs/fanout.h |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/unionfs/fanout.h b/fs/unionfs/fanout.h
index c5bf454..afeb9f6 100644
--- a/fs/unionfs/fanout.h
+++ b/fs/unionfs/fanout.h
@@ -320,7 +320,6 @@ static inline void unionfs_copy_attr_times(struct inode 
*upper)
upper->i_ctime = lower->i_ctime;
if (timespec_compare(&upper->i_atime, &lower->i_atime) < 0)
upper->i_atime = lower->i_atime;
-   /* XXX: should we notify_change on our upper inode? */
}
 }
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Mingming Cao
On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> > 
> > > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > > cases except one handles memory allocation failure so I get rid of those
> > > GFP_NOFAIL flags.
> > > 
> > > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > > in jbd/jbd2? I will send a separate patch to cleanup that.
> > 
> > No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> > recursive calls back into the file system that could end up blocking on
> > jbd code.
> 
> Oh, I see your patch now.  You mean use GFP_NOFS instead of
> GFP_KERNEL.  :-)  OK then.
> 

oops, I did mean what you say here.:-)

> > Shaggy

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Andreas Dilger
On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
>  
>  alloc_transaction:
>   if (!journal->j_running_transaction) {
> - new_transaction = kmalloc(sizeof(*new_transaction),
> - GFP_NOFS|__GFP_NOFAIL);
> + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> 
> > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > cases except one handles memory allocation failure so I get rid of those
> > GFP_NOFAIL flags.
> > 
> > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > in jbd/jbd2? I will send a separate patch to cleanup that.
> 
> No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> recursive calls back into the file system that could end up blocking on
> jbd code.

Oh, I see your patch now.  You mean use GFP_NOFS instead of
GFP_KERNEL.  :-)  OK then.

> Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:

> Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> cases except one handles memory allocation failure so I get rid of those
> GFP_NOFAIL flags.
> 
> Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> in jbd/jbd2? I will send a separate patch to cleanup that.

No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Mingming Cao
Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>

---
 fs/jbd/journal.c  |6 +++---
 fs/jbd/revoke.c   |8 
 fs/jbd2/journal.c |6 +++---
 fs/jbd2/revoke.c  |8 
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:51:10.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 11:51:57.0 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
-   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
-   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-19 11:51:30.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700
@@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
while((tmp >>= 1UL) != 0UL)
shift++;
 
-   journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
+   journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
GFP_NOFS);
if (!journal->j_revoke_table[0])
return -ENOMEM;
journal->j_revoke = journal->j_revoke_table[0];
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;
 
journal->j_revoke->hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
for (tmp = 0; tmp < hash_size; tmp++)
INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-   journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
+   journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_NOFS);
if (!journal->j_revoke_table[1]) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
@@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;
 
journal->j_revoke->hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c  2007-09-19 11:53:12.0 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
-   journal->j_wbuf = kmalloc(

Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Mingming Cao
On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote:
> On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > JBD: Replace slab allocations with page cache allocations
> > 
> > JBD allocate memory for committed_data and frozen_data from slab. However
> > JBD should not pass slab pages down to the block layer. Use page allocator 
> > pages instead. This will also prepare JBD for the large blocksize patchset.
> > 
> > 
> > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly
> 
> __GFP_NOFAIL should only be used when we have no way of recovering
> from failure.  The allocation in journal_init_common() (at least)
> _can_ recover and hence really shouldn't be using __GFP_NOFAIL.
> 
> (Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
> there as a marker which says "we really shouldn't be doing this but
> we don't know how to fix it").
> 
> So sometime it'd be good if you could review all the __GFP_NOFAILs in
> there and see if we can remove some, thanks.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.

Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/jbd/journal.c  |2 +-
 fs/jbd/transaction.c  |3 +--
 fs/jbd2/journal.c |2 +-
 fs/jbd2/transaction.c |3 +--
 4 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:47:58.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 11:48:40.0 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c  2007-09-19 11:48:05.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c   2007-09-19 11:49:10.0 
-0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
if (!journal->j_running_transaction) {
-   new_transaction = kmalloc(sizeof(*new_transaction),
-   GFP_NOFS|__GFP_NOFAIL);
+   new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c  2007-09-19 11:49:46.0 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-19 11:48:08.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c  2007-09-19 11:50:12.0 
-0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
if (!journal->j_running_transaction) {
-   new_transaction = kmalloc(sizeof(*new_transaction),
-   GFP_NOFS|__GFP_NOFAIL);
+   new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
goto out;




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 13/26] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Rik van Riel wrote:

> Why is it safe to not use the normal page flag bit operators
> for these page flags operations?

Because SLUB always modifies page flags under PageLock.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Anton Altaparmakov

On 19 Sep 2007, at 18:29, Christoph Lameter wrote:

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
I suspect that is_vmalloc_addr() should not be in linux/mm.h at  
all and should
be in linux/vmalloc.h instead and vmalloc.h should include linux/ 
highmem.h.
That would be more sensible than sticking a vmalloc related  
function into

linux/mm.h where it does not belong...


Tried it and that leads to all sort of other failures.


Ah, a new header file "vaddr.h" or something then?

Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Anton Altaparmakov

On 19 Sep 2007, at 18:29, Christoph Lameter wrote:

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
There even is such a function already in mm/ 
sparse.c::vaddr_in_vmalloc_area()
with pretty identical content.  I would suggest either this new  
inline should
go away completely and use the existing one and export it or the  
existing one
should go away and the inline should be used.  It seems silly to  
have two

functions with different names doing exactly the same thing!


Just in case you have not noticed: This patchset removes the
vaddr_in_vmalloc_area() in sparse.c


That's great!  And sorry I did not notice that bit...

Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [14/17] Allow bit_waitqueue to wait on a bit in a vmalloc area

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Gabriel C wrote:

> Christoph Lameter wrote:
> 
> >  
> > +   if (is_vmalloc_addr(word))
> > +   page = vmalloc_to_page(word)
>   ^^
> Missing ' ; '

Argh. Late beautification attempts are backfiring
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Andi Kleen wrote:

> > alloc_page(GFP_VFALLBACK, order, )
> 
> Is there a reason this needs to be a GFP flag versus a wrapper
> around alloc_page/free_page ?  page_alloc.c is already too complicated
> and it's better to keep new features separated. The only drawback
> would be that free_pages would need a different call, but that
> doesn't seem like a big problem.

I tried to make this a wrapper but there is a lot of logic in 
__alloc_pages() that would have to be replicated. Also there are specific 
places in __alloc_pages() were we can establish that we have enough memory
but its the memory fragmentation that prevents us from satisfying the 
request for a larger page.
 
> I'm also a little dubious about your attempts to do vmalloc in
> interrupt context. Is that really needed? GFP_ATOMIC allocations of
> large areas seem to be extremly unreliable to me and not design. Even
> if it works sometimes free probably wouldn't work there due to the
> flushes, which is very nasty. It would be better to drop that.

The flushes are only done on virtuall mapped architectures (xtensa) and 
are simple ASM code that can run in an interrupt context
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Eric Dumazet wrote:

> 1) Only power of two allocations are good candidates, or we waste RAM

Correct.

> 2) On i386 machines, we have a small vmalloc window. (128 MB default value)
>   Many servers with >4GB memory (PAE) like to boot with vmalloc=32M option to 
> get 992MB of LOWMEM.
>   If we allow some slub caches to fallback to vmalloc land, we'll have 
> problems to tune this.

We would first do the vmalloc conversion to GFP_VFALLBACK which would 
reduce the vmalloc requirements of drivers and core significantly. The 
patchset should actually reduce the vmalloc space requirements 
significantly. They are only needed in situations where the page allocator 
cannot provide a contiguous mapping and that gets rarer the better Mel's 
antifrag code works.
 
> 4) vmalloc() currently uses a linked list of vm_struct. Might need something 
> more scalable.

If its rarely used then its not that big of a deal. The better the anti 
fragmentation measures the less vmalloc use.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> I suspect that is_vmalloc_addr() should not be in linux/mm.h at all and should
> be in linux/vmalloc.h instead and vmalloc.h should include linux/highmem.h.
> That would be more sensible than sticking a vmalloc related function into
> linux/mm.h where it does not belong...

Tried it and that leads to all sort of other failures.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Christoph Lameter
On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> There even is such a function already in mm/sparse.c::vaddr_in_vmalloc_area()
> with pretty identical content.  I would suggest either this new inline should
> go away completely and use the existing one and export it or the existing one
> should go away and the inline should be used.  It seems silly to have two
> functions with different names doing exactly the same thing!

Just in case you have not noticed: This patchset removes the 
vaddr_in_vmalloc_area() in sparse.c

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/18] exportfs: update documentation

2007-09-19 Thread Christoph Hellwig
Update deocumentation to the current state of affairs.  Remove duplicated
method descruptions in exportfs.h and point to Documentation/filesystems/
Exporting instead.  Add a little file header comment in expfs.c describing
what's going on and mentioning Neils and my copyright [1].

[1] Neil, in case you want a different/additional attribution just change
the patch in your queue to reflect the preferred version.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6-xfs/Documentation/filesystems/Exporting
===
--- linux-2.6-xfs.orig/Documentation/filesystems/Exporting  2007-09-17 
14:35:01.0 +0200
+++ linux-2.6-xfs/Documentation/filesystems/Exporting   2007-09-17 
14:37:19.0 +0200
@@ -2,9 +2,12 @@
 Making Filesystems Exportable
 =
 
-Most filesystem operations require a dentry (or two) as a starting
+Overview
+
+
+All filesystem operations require a dentry (or two) as a starting
 point.  Local applications have a reference-counted hold on suitable
-dentrys via open file descriptors or cwd/root.  However remote
+dentries via open file descriptors or cwd/root.  However remote
 applications that access a filesystem via a remote filesystem protocol
 such as NFS may not be able to hold such a reference, and so need a
 different way to refer to a particular dentry.  As the alternative
@@ -13,14 +16,14 @@ server-reboot (among other things, thoug
 problematic), there is no simple answer like 'filename'.
 
 The mechanism discussed here allows each filesystem implementation to
-specify how to generate an opaque (out side of the filesystem) byte
+specify how to generate an opaque (outside of the filesystem) byte
 string for any dentry, and how to find an appropriate dentry for any
 given opaque byte string.
 This byte string will be called a "filehandle fragment" as it
 corresponds to part of an NFS filehandle.
 
 A filesystem which supports the mapping between filehandle fragments
-and dentrys will be termed "exportable".
+and dentries will be termed "exportable".
 
 
 
@@ -89,11 +92,9 @@ For a filesystem to be exportable it mus
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
   when ->lookup finds an inode for a given parent and name.
-  Typically the ->lookup routine will end:
-   if (inode)
-   return d_splice(inode, dentry);
-   d_add(dentry, inode);
-   return NULL;
+  Typically the ->lookup routine will end with a:
+
+   return d_splice_alias(inode, dentry);
}
 
 
@@ -101,67 +102,39 @@ For a filesystem to be exportable it mus
   A file system implementation declares that instances of the filesystem
 are exportable by setting the s_export_op field in the struct
 super_block.  This field must point to a "struct export_operations"
-struct which could potentially be full of NULLs, though normally at
-least get_parent will be set.
+struct which has the following members:
 
- The primary operations are decode_fh and encode_fh.  
-decode_fh takes a filehandle fragment and tries to find or create a
-dentry for the object referred to by the filehandle.
-encode_fh takes a dentry and creates a filehandle fragment which can
-later be used to find/create a dentry for the same object.
-
-decode_fh will probably make use of "find_exported_dentry".
-This function lives in the "exportfs" module which a filesystem does
-not need unless it is being exported.  So rather that calling
-find_exported_dentry directly, each filesystem should call it through
-the find_exported_dentry pointer in it's export_operations table.
-This field is set correctly by the exporting agent (e.g. nfsd) when a
-filesystem is exported, and before any export operations are called.
-
-find_exported_dentry needs three support functions from the
-filesystem:
-  get_name.  When given a parent dentry and a child dentry, this
-should find a name in the directory identified by the parent
-dentry, which leads to the object identified by the child dentry.
-If no get_name function is supplied, a default implementation is
-provided which uses vfs_readdir to find potential names, and
-matches inode numbers to find the correct match.
-
-  get_parent.  When given a dentry for a directory, this should return 
-a dentry for the parent.  Quite possibly the parent dentry will
-have been allocated by d_alloc_anon.  
-The default get_parent function just returns an error so any
-filehandle lookup that requires finding a parent will fail.
-->lookup("..") is *not* used as a default as it can leave ".."
-entries in the dcache which are too messy to work with.
-
-  get_dentry.  When given an opaque datum, this should find the
-implied object and create a dentry for it (possibly with
-d_alloc_anon). 
-The opaque datum is 

[PATCH 17/18] exportfs: make struct export_operations const

2007-09-19 Thread Christoph Hellwig
Now that nfsd has stopped writing to the find_exported_dentry member
we an mark the export_operations const


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/efs/super.c
===
--- linux-2.6.orig/fs/efs/super.c   2007-09-13 15:06:51.0 +0200
+++ linux-2.6/fs/efs/super.c2007-09-13 15:09:05.0 +0200
@@ -113,7 +113,7 @@ static const struct super_operations efs
.remount_fs = efs_remount,
 };
 
-static struct export_operations efs_export_ops = {
+static const struct export_operations efs_export_ops = {
.fh_to_dentry   = efs_fh_to_dentry,
.fh_to_parent   = efs_fh_to_parent,
.get_parent = efs_get_parent,
Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c  2007-09-13 15:05:26.0 +0200
+++ linux-2.6/fs/ext2/super.c   2007-09-13 15:09:05.0 +0200
@@ -292,7 +292,7 @@ static struct dentry *ext2_fh_to_parent(
  * systems, but can be improved upon.
  * Currently only get_parent is required.
  */
-static struct export_operations ext2_export_ops = {
+static const struct export_operations ext2_export_ops = {
.fh_to_dentry = ext2_fh_to_dentry,
.fh_to_parent = ext2_fh_to_parent,
.get_parent = ext2_get_parent,
Index: linux-2.6/fs/ext3/super.c
===
--- linux-2.6.orig/fs/ext3/super.c  2007-09-13 15:06:39.0 +0200
+++ linux-2.6/fs/ext3/super.c   2007-09-13 15:09:05.0 +0200
@@ -670,7 +670,7 @@ static const struct super_operations ext
 #endif
 };
 
-static struct export_operations ext3_export_ops = {
+static const struct export_operations ext3_export_ops = {
.fh_to_dentry = ext3_fh_to_dentry,
.fh_to_parent = ext3_fh_to_parent,
.get_parent = ext3_get_parent,
Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-09-13 15:06:44.0 +0200
+++ linux-2.6/fs/ext4/super.c   2007-09-13 15:09:05.0 +0200
@@ -721,7 +721,7 @@ static const struct super_operations ext
 #endif
 };
 
-static struct export_operations ext4_export_ops = {
+static const struct export_operations ext4_export_ops = {
.fh_to_dentry = ext4_fh_to_dentry,
.fh_to_parent = ext4_fh_to_parent,
.get_parent = ext4_get_parent,
Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2007-09-13 15:08:12.0 +0200
+++ linux-2.6/fs/fat/inode.c2007-09-13 15:09:05.0 +0200
@@ -769,7 +769,7 @@ out:
return parent;
 }
 
-static struct export_operations fat_export_ops = {
+static const struct export_operations fat_export_ops = {
.encode_fh  = fat_encode_fh,
.fh_to_dentry   = fat_fh_to_dentry,
.get_parent = fat_get_parent,
Index: linux-2.6/fs/gfs2/ops_export.c
===
--- linux-2.6.orig/fs/gfs2/ops_export.c 2007-09-13 15:08:53.0 +0200
+++ linux-2.6/fs/gfs2/ops_export.c  2007-09-13 15:09:05.0 +0200
@@ -294,7 +294,7 @@ static struct dentry *gfs2_fh_to_parent(
}
 }
 
-struct export_operations gfs2_export_ops = {
+const struct export_operations gfs2_export_ops = {
.encode_fh = gfs2_encode_fh,
.fh_to_dentry = gfs2_fh_to_dentry,
.fh_to_parent = gfs2_fh_to_parent,
Index: linux-2.6/fs/isofs/export.c
===
--- linux-2.6.orig/fs/isofs/export.c2007-09-13 15:08:18.0 +0200
+++ linux-2.6/fs/isofs/export.c 2007-09-13 15:09:05.0 +0200
@@ -207,7 +207,7 @@ static struct dentry *isofs_fh_to_parent
fh_len > 4 ? ifid->parent_generation : 0);
 }
 
-struct export_operations isofs_export_ops = {
+const struct export_operations isofs_export_ops = {
.encode_fh  = isofs_export_encode_fh,
.fh_to_dentry   = isofs_fh_to_dentry,
.fh_to_parent   = isofs_fh_to_parent,
Index: linux-2.6/fs/isofs/isofs.h
===
--- linux-2.6.orig/fs/isofs/isofs.h 2007-09-11 16:23:34.0 +0200
+++ linux-2.6/fs/isofs/isofs.h  2007-09-13 15:09:05.0 +0200
@@ -178,4 +178,4 @@ isofs_normalize_block_and_offset(struct 
 extern const struct inode_operations isofs_dir_inode_operations;
 extern const struct file_operations isofs_dir_operations;
 extern const struct address_space_operations isofs_symlink_aops;
-extern struct export_operations isofs_export_ops;
+extern const struct export_operations isofs_export_ops;
Index: linux-2.6/fs/jfs/super.c
===
--- linux-2.6.orig/fs/jfs/super.c   2007-09-13 15:07:17.0 +0200
+++ linu

[PATCH 13/18] reiserfs: new export ops

2007-09-19 Thread Christoph Hellwig
Another nice little cleanup by using the new methods.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/reiserfs/inode.c
===
--- linux-2.6.orig/fs/reiserfs/inode.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/reiserfs/inode.c   2007-09-13 15:21:12.0 +0200
@@ -1514,19 +1514,20 @@ struct inode *reiserfs_iget(struct super
return inode;
 }
 
-struct dentry *reiserfs_get_dentry(struct super_block *sb, void *vobjp)
+static struct dentry *reiserfs_get_dentry(struct super_block *sb,
+   u32 objectid, u32 dir_id, u32 generation)
+
 {
-   __u32 *data = vobjp;
struct cpu_key key;
struct dentry *result;
struct inode *inode;
 
-   key.on_disk_key.k_objectid = data[0];
-   key.on_disk_key.k_dir_id = data[1];
+   key.on_disk_key.k_objectid = objectid;
+   key.on_disk_key.k_dir_id = dir_id;
reiserfs_write_lock(sb);
inode = reiserfs_iget(sb, &key);
-   if (inode && !IS_ERR(inode) && data[2] != 0 &&
-   data[2] != inode->i_generation) {
+   if (inode && !IS_ERR(inode) && generation != 0 &&
+   generation != inode->i_generation) {
iput(inode);
inode = NULL;
}
@@ -1543,14 +1544,9 @@ struct dentry *reiserfs_get_dentry(struc
return result;
 }
 
-struct dentry *reiserfs_decode_fh(struct super_block *sb, __u32 * data,
- int len, int fhtype,
- int (*acceptable) (void *contect,
-struct dentry * de),
- void *context)
+struct dentry *reiserfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
 {
-   __u32 obj[3], parent[3];
-
/* fhtype happens to reflect the number of u32s encoded.
 * due to a bug in earlier code, fhtype might indicate there
 * are more u32s then actually fitted.
@@ -1563,32 +1559,28 @@ struct dentry *reiserfs_decode_fh(struct
 *   6 - as above plus generation of directory
 * 6 does not fit in NFSv2 handles
 */
-   if (fhtype > len) {
-   if (fhtype != 6 || len != 5)
+   if (fh_type > fh_len) {
+   if (fh_type != 6 || fh_len != 5)
reiserfs_warning(sb,
-"nfsd/reiserfs, fhtype=%d, len=%d - 
odd",
-fhtype, len);
-   fhtype = 5;
+   "nfsd/reiserfs, fhtype=%d, len=%d - odd",
+   fh_type, fh_len);
+   fh_type = 5;
}
 
-   obj[0] = data[0];
-   obj[1] = data[1];
-   if (fhtype == 3 || fhtype >= 5)
-   obj[2] = data[2];
-   else
-   obj[2] = 0; /* generation number */
+   return reiserfs_get_dentry(sb, fid->raw[0], fid->raw[1],
+   (fh_type == 3 || fh_type >= 5) ? fid->raw[2] : 0);
+}
 
-   if (fhtype >= 4) {
-   parent[0] = data[fhtype >= 5 ? 3 : 2];
-   parent[1] = data[fhtype >= 5 ? 4 : 3];
-   if (fhtype == 6)
-   parent[2] = data[5];
-   else
-   parent[2] = 0;
-   }
-   return sb->s_export_op->find_exported_dentry(sb, obj,
-fhtype < 4 ? NULL : parent,
-acceptable, context);
+struct dentry *reiserfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   if (fh_type < 4)
+   return NULL;
+
+   return reiserfs_get_dentry(sb,
+   (fh_type >= 5) ? fid->raw[3] : fid->raw[2],
+   (fh_type >= 5) ? fid->raw[4] : fid->raw[3],
+   (fh_type == 6) ? fid->raw[5] : 0);
 }
 
 int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp,
Index: linux-2.6/fs/reiserfs/super.c
===
--- linux-2.6.orig/fs/reiserfs/super.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/reiserfs/super.c   2007-09-13 15:21:12.0 +0200
@@ -651,9 +651,9 @@ static struct quotactl_ops reiserfs_qctl
 
 static struct export_operations reiserfs_export_ops = {
.encode_fh = reiserfs_encode_fh,
-   .decode_fh = reiserfs_decode_fh,
+   .fh_to_dentry = reiserfs_fh_to_dentry,
+   .fh_to_parent = reiserfs_fh_to_parent,
.get_parent = reiserfs_get_parent,
-   .get_dentry = reiserfs_get_dentry,
 };
 
 /* this struct is used in reiserfs_getopt () for containing the value for those
Index: linux-2.6/include/linux/reiserfs_fs.h
===
--- linux-2.6.orig/include/linux/reiserfs_fs.h  2007-09-13 15:10:45.0 
+0200
+++ linux-2.6/

[PATCH 14/18] gfs2: new export ops

2007-09-19 Thread Christoph Hellwig
Convert gfs2 to the new ops.   Uses a similar structure to the generic
helpers, but gfs2 has it's own file handle formats.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>


Index: linux-2.6/fs/gfs2/ops_export.c
===
--- linux-2.6.orig/fs/gfs2/ops_export.c 2007-07-19 15:56:46.0 +0200
+++ linux-2.6/fs/gfs2/ops_export.c  2007-07-20 19:58:06.0 +0200
@@ -31,40 +31,6 @@
 #define GFS2_LARGE_FH_SIZE 8
 #define GFS2_OLD_FH_SIZE 10
 
-static struct dentry *gfs2_decode_fh(struct super_block *sb,
-__u32 *p,
-int fh_len,
-int fh_type,
-int (*acceptable)(void *context,
-  struct dentry *dentry),
-void *context)
-{
-   __be32 *fh = (__force __be32 *)p;
-   struct gfs2_inum_host inum, parent;
-
-   memset(&parent, 0, sizeof(struct gfs2_inum));
-
-   switch (fh_len) {
-   case GFS2_LARGE_FH_SIZE:
-   case GFS2_OLD_FH_SIZE:
-   parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) << 32;
-   parent.no_formal_ino |= be32_to_cpu(fh[5]);
-   parent.no_addr = ((u64)be32_to_cpu(fh[6])) << 32;
-   parent.no_addr |= be32_to_cpu(fh[7]);
-   case GFS2_SMALL_FH_SIZE:
-   inum.no_formal_ino = ((u64)be32_to_cpu(fh[0])) << 32;
-   inum.no_formal_ino |= be32_to_cpu(fh[1]);
-   inum.no_addr = ((u64)be32_to_cpu(fh[2])) << 32;
-   inum.no_addr |= be32_to_cpu(fh[3]);
-   break;
-   default:
-   return NULL;
-   }
-
-   return gfs2_export_ops.find_exported_dentry(sb, &inum, &parent,
-   acceptable, context);
-}
-
 static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
  int connectable)
 {
@@ -189,10 +155,10 @@ static struct dentry *gfs2_get_parent(st
return dentry;
 }
 
-static struct dentry *gfs2_get_dentry(struct super_block *sb, void *inum_obj)
+static struct dentry *gfs2_get_dentry(struct super_block *sb,
+   struct gfs2_inum_host *inum)
 {
struct gfs2_sbd *sdp = sb->s_fs_info;
-   struct gfs2_inum_host *inum = inum_obj;
struct gfs2_holder i_gh, ri_gh, rgd_gh;
struct gfs2_rgrpd *rgd;
struct inode *inode;
@@ -289,11 +255,50 @@ fail:
return ERR_PTR(error);
 }
 
+static struct dentry *gfs2_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct gfs2_inum_host this;
+   __be32 *fh = (__force __be32 *)fid->raw;
+
+   switch (fh_type) {
+   case GFS2_SMALL_FH_SIZE:
+   case GFS2_LARGE_FH_SIZE:
+   case GFS2_OLD_FH_SIZE:
+   this.no_formal_ino = ((u64)be32_to_cpu(fh[0])) << 32;
+   this.no_formal_ino |= be32_to_cpu(fh[1]);
+   this.no_addr = ((u64)be32_to_cpu(fh[2])) << 32;
+   this.no_addr |= be32_to_cpu(fh[3]);
+   return gfs2_get_dentry(sb, &this);
+   default:
+   return NULL;
+   }
+}
+
+static struct dentry *gfs2_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct gfs2_inum_host parent;
+   __be32 *fh = (__force __be32 *)fid->raw;
+
+   switch (fh_type) {
+   case GFS2_LARGE_FH_SIZE:
+   case GFS2_OLD_FH_SIZE:
+   parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) << 32;
+   parent.no_formal_ino |= be32_to_cpu(fh[5]);
+   parent.no_addr = ((u64)be32_to_cpu(fh[6])) << 32;
+   parent.no_addr |= be32_to_cpu(fh[7]);
+   return gfs2_get_dentry(sb, &parent);
+   default:
+   return NULL;
+   }
+}
+
 struct export_operations gfs2_export_ops = {
-   .decode_fh = gfs2_decode_fh,
.encode_fh = gfs2_encode_fh,
+   .fh_to_dentry = gfs2_fh_to_dentry,
+   .fh_to_parent = gfs2_fh_to_parent,
.get_name = gfs2_get_name,
.get_parent = gfs2_get_parent,
-   .get_dentry = gfs2_get_dentry,
 };
 

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/18] isofs: new export ops

2007-09-19 Thread Christoph Hellwig
Nice little cleanup by consolidating things a little and using   
a structure for the special file handle format.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/isofs/export.c
===
--- linux-2.6.orig/fs/isofs/export.c2007-02-11 10:31:19.0 +0100
+++ linux-2.6/fs/isofs/export.c 2007-02-11 10:45:25.0 +0100
@@ -42,16 +42,6 @@
return result;
 }
 
-static struct dentry *
-isofs_export_get_dentry(struct super_block *sb, void *vobjp)
-{
-   __u32 *objp = vobjp;
-   unsigned long block = objp[0];
-   unsigned long offset = objp[1];
-   __u32 generation = objp[2];
-   return isofs_export_iget(sb, block, offset, generation);
-}
-
 /* This function is surprisingly simple.  The trick is understanding
  * that "child" is always a directory. So, to find its parent, you
  * simply need to find its ".." entry, normalize its block and offset,
@@ -182,43 +172,44 @@
return type;
 }
 
+struct isofs_fid {
+   u32 block;
+   u16 offset;
+   u16 parent_offset;
+   u32 generation;
+   u32 parent_block;
+   u32 parent_generation;
+};
 
-static struct dentry *
-isofs_export_decode_fh(struct super_block *sb,
-  __u32 *fh32,
-  int fh_len,
-  int fileid_type,
-  int (*acceptable)(void *context, struct dentry *de),
-  void *context)
+static struct dentry *isofs_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   __u16 *fh16 = (__u16*)fh32;
-   __u32 child[3];   /* The child is what triggered all this. */
-   __u32 parent[3];  /* The parent is just along for the ride. */
+   struct isofs_fid *ifid = (struct isofs_fid *)fid;
 
-   if (fh_len < 3 || fileid_type > 2)
+   if (fh_len < 3 || fh_type > 2)
return NULL;
 
-   child[0] = fh32[0];
-   child[1] = fh16[2];  /* fh16 [sic] */
-   child[2] = fh32[2];
-
-   parent[0] = 0;
-   parent[1] = 0;
-   parent[2] = 0;
-   if (fileid_type == 2) {
-   if (fh_len > 2) parent[0] = fh32[3];
-   parent[1] = fh16[3];  /* fh16 [sic] */
-   if (fh_len > 4) parent[2] = fh32[4];
-   }
-
-   return sb->s_export_op->find_exported_dentry(sb, child, parent,
-acceptable, context);
+   return isofs_export_iget(sb, ifid->block, ifid->offset,
+   ifid->generation);
 }
 
+static struct dentry *isofs_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
+{
+   struct isofs_fid *ifid = (struct isofs_fid *)fid;
+
+   if (fh_type != 2)
+   return NULL;
+
+   return isofs_export_iget(sb,
+   fh_len > 2 ? ifid->parent_block : 0,
+   ifid->parent_offset,
+   fh_len > 4 ? ifid->parent_generation : 0);
+}
 
 struct export_operations isofs_export_ops = {
-   .decode_fh  = isofs_export_decode_fh,
.encode_fh  = isofs_export_encode_fh,
-   .get_dentry = isofs_export_get_dentry,
+   .fh_to_dentry   = isofs_fh_to_dentry,
+   .fh_to_parent   = isofs_fh_to_parent,
.get_parent = isofs_export_get_parent,
 };

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/18] exportfs: remove old methods

2007-09-19 Thread Christoph Hellwig
Now that all filesystems are converted remove support for the
old methods.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/exportfs/expfs.c
===
--- linux-2.6.orig/fs/exportfs/expfs.c  2007-08-29 13:52:01.0 +0200
+++ linux-2.6/fs/exportfs/expfs.c   2007-08-29 14:02:41.0 +0200
@@ -13,19 +13,6 @@ static int get_name(struct dentry *dentr
struct dentry *child);
 
 
-static struct dentry *exportfs_get_dentry(struct super_block *sb, void *obj)
-{
-   struct dentry *result = ERR_PTR(-ESTALE);
-
-   if (sb->s_export_op->get_dentry) {
-   result = sb->s_export_op->get_dentry(sb, obj);
-   if (!result)
-   result = ERR_PTR(-ESTALE);
-   }
-
-   return result;
-}
-
 static int exportfs_get_name(struct dentry *dir, char *name,
struct dentry *child)
 {
@@ -214,125 +201,6 @@ reconnect_path(struct super_block *sb, s
return 0;
 }
 
-/**
- * find_exported_dentry - helper routine to implement 
export_operations->decode_fh
- * @sb:The &super_block identifying the filesystem
- * @obj:   An opaque identifier of the object to be found - passed to
- * get_inode
- * @parent:An optional opqaue identifier of the parent of the object.
- * @acceptable:A function used to test possible &dentries to see if 
they are
- * acceptable
- * @context:   A parameter to @acceptable so that it knows on what basis to
- * judge.
- *
- * find_exported_dentry is the central helper routine to enable file systems
- * to provide the decode_fh() export_operation.  It's main task is to take
- * an &inode, find or create an appropriate &dentry structure, and possibly
- * splice this into the dcache in the correct place.
- *
- * The decode_fh() operation provided by the filesystem should call
- * find_exported_dentry() with the same parameters that it received except
- * that instead of the file handle fragment, pointers to opaque identifiers
- * for the object and optionally its parent are passed.  The default decode_fh
- * routine passes one pointer to the start of the filehandle fragment, and
- * one 8 bytes into the fragment.  It is expected that most filesystems will
- * take this approach, though the offset to the parent identifier may well be
- * different.
- *
- * find_exported_dentry() will call get_dentry to get an dentry pointer from
- * the file system.  If any &dentry in the d_alias list is acceptable, it will
- * be returned.  Otherwise find_exported_dentry() will attempt to splice a new
- * &dentry into the dcache using get_name() and get_parent() to find the
- * appropriate place.
- */
-
-struct dentry *
-find_exported_dentry(struct super_block *sb, void *obj, void *parent,
-int (*acceptable)(void *context, struct dentry *de),
-void *context)
-{
-   struct dentry *result, *alias;
-   int err = -ESTALE;
-
-   /*
-* Attempt to find the inode.
-*/
-   result = exportfs_get_dentry(sb, obj);
-   if (IS_ERR(result))
-   return result;
-
-   if (S_ISDIR(result->d_inode->i_mode)) {
-   if (!(result->d_flags & DCACHE_DISCONNECTED)) {
-   if (acceptable(context, result))
-   return result;
-   err = -EACCES;
-   goto err_result;
-   }
-
-   err = reconnect_path(sb, result);
-   if (err)
-   goto err_result;
-   } else {
-   struct dentry *target_dir, *nresult;
-   char nbuf[NAME_MAX+1];
-
-   alias = find_acceptable_alias(result, acceptable, context);
-   if (alias)
-   return alias;
-
-   if (parent == NULL)
-   goto err_result;
-
-   target_dir = exportfs_get_dentry(sb,parent);
-   if (IS_ERR(target_dir)) {
-   err = PTR_ERR(target_dir);
-   goto err_result;
-   }
-
-   err = reconnect_path(sb, target_dir);
-   if (err) {
-   dput(target_dir);
-   goto err_result;
-   }
-
-   /*
-* As we weren't after a directory, have one more step to go.
-*/
-   err = exportfs_get_name(target_dir, nbuf, result);
-   if (!err) {
-   mutex_lock(&target_dir->d_inode->i_mutex);
-   nresult = lookup_one_len(nbuf, target_dir,
-strlen(nbuf));
-   mutex_unlock(&target_dir->d_inode->i_mutex);
-   if (!IS_ERR(nresult)) {
-   if (nresult->d_inode) {
- 

[PATCH 09/18] xfs: new export ops

2007-09-19 Thread Christoph Hellwig
This one is a lot more complicated than the previous ones.  XFS already
had a very clever scheme for supporting 64bit inode numbers in filehandles,
and I've reworked this to be some kind of a prototype for the generic
64bit inode filehandle support.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6-xfs/fs/xfs/linux-2.6/xfs_export.c
===
--- linux-2.6-xfs.orig/fs/xfs/linux-2.6/xfs_export.c2007-09-17 
14:01:25.0 +0200
+++ linux-2.6-xfs/fs/xfs/linux-2.6/xfs_export.c 2007-09-17 14:06:24.0 
+0200
@@ -33,62 +33,25 @@
 static struct dentry dotdot = { .d_name.name = "..", .d_name.len = 2, };
 
 /*
- * XFS encodes and decodes the fileid portion of NFS filehandles
- * itself instead of letting the generic NFS code do it.  This
- * allows filesystems with 64 bit inode numbers to be exported.
- *
- * Note that a side effect is that xfs_vget() won't be passed a
- * zero inode/generation pair under normal circumstances.  As
- * however a malicious client could send us such data, the check
- * remains in that code.
+ * Note that we only accept fileids which are long enough rather than allow
+ * the parent generation number to default to zero.  XFS considers zero a
+ * valid generation number not an invalid/wildcard value.
  */
-
-STATIC struct dentry *
-xfs_fs_decode_fh(
-   struct super_block  *sb,
-   __u32   *fh,
-   int fh_len,
-   int fileid_type,
-   int (*acceptable)(
-   void*context,
-   struct dentry   *de),
-   void*context)
+static int xfs_fileid_length(int fileid_type)
 {
-   xfs_fid_t   ifid;
-   xfs_fid_t   pfid;
-   void*parent = NULL;
-   int is64 = 0;
-   __u32   *p = fh;
-
-#if XFS_BIG_INUMS
-   is64 = (fileid_type & XFS_FILEID_TYPE_64FLAG);
-   fileid_type &= ~XFS_FILEID_TYPE_64FLAG;
-#endif
-
-   /*
-* Note that we only accept fileids which are long enough
-* rather than allow the parent generation number to default
-* to zero.  XFS considers zero a valid generation number not
-* an invalid/wildcard value.  There's little point printk'ing
-* a warning here as we don't have the client information
-* which would make such a warning useful.
-*/
-   if (fileid_type > 2 ||
-   fh_len < xfs_fileid_length((fileid_type == 2), is64))
-   return NULL;
-
-   p = xfs_fileid_decode_fid2(p, &ifid, is64);
-
-   if (fileid_type == 2) {
-   p = xfs_fileid_decode_fid2(p, &pfid, is64);
-   parent = &pfid;
+   switch (fileid_type) {
+   case FILEID_INO32_GEN:
+   return 2;
+   case FILEID_INO32_GEN_PARENT:
+   return 4;
+   case FILEID_INO32_GEN | XFS_FILEID_TYPE_64FLAG:
+   return 3;
+   case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG:
+   return 6;
}
-
-   fh = (__u32 *)&ifid;
-   return sb->s_export_op->find_exported_dentry(sb, fh, parent, 
acceptable, context);
+   return 255; /* invalid */
 }
 
-
 STATIC int
 xfs_fs_encode_fh(
struct dentry   *dentry,
@@ -96,21 +59,21 @@ xfs_fs_encode_fh(
int *max_len,
int connectable)
 {
+   struct fid  *fid = (struct fid *)fh;
+   struct xfs_fid64*fid64 = (struct xfs_fid64 *)fh;
struct inode*inode = dentry->d_inode;
-   int type = 1;
-   __u32   *p = fh;
+   int fileid_type;
int len;
-   int is64 = 0;
-#if XFS_BIG_INUMS
-   if (!(XFS_M(inode->i_sb)->m_flags & XFS_MOUNT_SMALL_INUMS)) {
-   /* filesystem may contain 64bit inode numbers */
-   is64 = XFS_FILEID_TYPE_64FLAG;
-   }
-#endif
 
/* Directories don't need their parent encoded, they have ".." */
if (S_ISDIR(inode->i_mode))
-   connectable = 0;
+   fileid_type = FILEID_INO32_GEN;
+   else
+   fileid_type = FILEID_INO32_GEN_PARENT;
+
+   /* filesystem may contain 64bit inode numbers */
+   if (!(XFS_M(inode->i_sb)->m_flags & XFS_MOUNT_SMALL_INUMS))
+   fileid_type |= XFS_FILEID_TYPE_64FLAG;
 
/*
 * Only encode if there is enough space given.  In practice
@@ -118,39 +81,118 @@ xfs_fs_encode_fh(
 * over NFSv2 with the subtree_check export option; the other
 * seven combinations work.  The real answer is "don't use v2".
 */
-   len = xfs_fileid_length(connectable, is64);
+   len = xfs_fileid_length(fileid_type);
if (*max_len < len)
return 255;
*max_len = len;

[PATCH 15/18] ocfs2: new export ops

2007-09-19 Thread Christoph Hellwig
OCFS2 has it's own 64bit-firendly filehandle format so we can't use
the generic helpers here.  I'll add a struct for the types later.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>


Index: linux-2.6/fs/ocfs2/export.c
===
--- linux-2.6.orig/fs/ocfs2/export.c2007-05-06 13:51:17.0 +0200
+++ linux-2.6/fs/ocfs2/export.c 2007-06-12 15:54:44.0 +0200
@@ -45,9 +45,9 @@ struct ocfs2_inode_handle
u32 ih_generation;
 };
 
-static struct dentry *ocfs2_get_dentry(struct super_block *sb, void *vobjp)
+static struct dentry *ocfs2_get_dentry(struct super_block *sb,
+   struct ocfs2_inode_handle *handle)
 {
-   struct ocfs2_inode_handle *handle = vobjp;
struct inode *inode;
struct dentry *result;
 
@@ -200,54 +200,37 @@ bail:
return type;
 }
 
-static struct dentry *ocfs2_decode_fh(struct super_block *sb, u32 *fh_in,
- int fh_len, int fileid_type,
- int (*acceptable)(void *context,
-   struct dentry *de),
- void *context)
+static struct dentry *ocfs2_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   struct ocfs2_inode_handle handle, parent;
-   struct dentry *ret = NULL;
-   __le32 *fh = (__force __le32 *) fh_in;
-
-   mlog_entry("(0x%p, 0x%p, %d, %d, 0x%p, 0x%p)\n",
-  sb, fh, fh_len, fileid_type, acceptable, context);
+   struct ocfs2_inode_handle handle;
 
-   if (fh_len < 3 || fileid_type > 2)
-   goto bail;
+   if (fh_len < 3 || fh_type > 2)
+   return NULL;
 
-   if (fileid_type == 2) {
-   if (fh_len < 6)
-   goto bail;
-
-   parent.ih_blkno = (u64)le32_to_cpu(fh[3]) << 32;
-   parent.ih_blkno |= (u64)le32_to_cpu(fh[4]);
-   parent.ih_generation = le32_to_cpu(fh[5]);
-
-   mlog(0, "Decoding parent: blkno: %llu, generation: %u\n",
-(unsigned long long)parent.ih_blkno,
-parent.ih_generation);
-   }
-
-   handle.ih_blkno = (u64)le32_to_cpu(fh[0]) << 32;
-   handle.ih_blkno |= (u64)le32_to_cpu(fh[1]);
-   handle.ih_generation = le32_to_cpu(fh[2]);
+   handle.ih_blkno = (u64)le32_to_cpu(fid->raw[0]) << 32;
+   handle.ih_blkno |= (u64)le32_to_cpu(fid->raw[1]);
+   handle.ih_generation = le32_to_cpu(fid->raw[2]);
+   return ocfs2_get_dentry(sb, &handle);
+}
 
-   mlog(0, "Encoding fh: blkno: %llu, generation: %u\n",
-(unsigned long long)handle.ih_blkno, handle.ih_generation);
+static struct dentry *ocfs2_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
+{
+   struct ocfs2_inode_handle parent;
 
-   ret = ocfs2_export_ops.find_exported_dentry(sb, &handle, &parent,
-   acceptable, context);
+   if (fh_type != 2 || fh_len < 6)
+   return NULL;
 
-bail:
-   mlog_exit_ptr(ret);
-   return ret;
+   parent.ih_blkno = (u64)le32_to_cpu(fid->raw[3]) << 32;
+   parent.ih_blkno |= (u64)le32_to_cpu(fid->raw[4]);
+   parent.ih_generation = le32_to_cpu(fid->raw[5]);
+   return ocfs2_get_dentry(sb, &parent);
 }
 
 struct export_operations ocfs2_export_ops = {
-   .decode_fh  = ocfs2_decode_fh,
.encode_fh  = ocfs2_encode_fh,
-
+   .fh_to_dentry   = ocfs2_fh_to_dentry,
+   .fh_to_parent   = ocfs2_fh_to_parent,
.get_parent = ocfs2_get_parent,
-   .get_dentry = ocfs2_get_dentry,
 };

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/18] shmem: new export ops

2007-09-19 Thread Christoph Hellwig
I'm not sure what people were thinking when adding support to
export tmpfs, but here's the conversion anyway:


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c   2007-02-11 10:46:30.0 +0100
+++ linux-2.6/mm/shmem.c2007-02-11 10:53:12.0 +0100
@@ -1977,33 +1977,25 @@
return ino->i_ino == inum && fh[0] == ino->i_generation;
 }
 
-static struct dentry *shmem_get_dentry(struct super_block *sb, void *vfh)
+static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   struct dentry *de = NULL;
struct inode *inode;
-   __u32 *fh = vfh;
-   __u64 inum = fh[2];
-   inum = (inum << 32) | fh[1];
+   struct dentry *dentry = NULL;
+   u64 inum = fid->raw[2];
+   inum = (inum << 32) | fid->raw[1];
 
-   inode = ilookup5(sb, (unsigned long)(inum+fh[0]), shmem_match, vfh);
+   if (fh_len < 3)
+   return NULL;
+
+   inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]),
+   shmem_match, fid->raw);
if (inode) {
-   de = d_find_alias(inode);
+   dentry = d_find_alias(inode);
iput(inode);
}
 
-   return de? de: ERR_PTR(-ESTALE);
-}
-
-static struct dentry *shmem_decode_fh(struct super_block *sb, __u32 *fh,
-   int len, int type,
-   int (*acceptable)(void *context, struct dentry *de),
-   void *context)
-{
-   if (len < 3)
-   return ERR_PTR(-ESTALE);
-
-   return sb->s_export_op->find_exported_dentry(sb, fh, NULL, acceptable,
-   context);
+   return dentry;
 }
 
 static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
@@ -2038,9 +2030,8 @@
 
 static struct export_operations shmem_export_ops = {
.get_parent = shmem_get_parent,
-   .get_dentry = shmem_get_dentry,
.encode_fh  = shmem_encode_fh,
-   .decode_fh  = shmem_decode_fh,
+   .fh_to_dentry   = shmem_fh_to_dentry,
 };
 
 static int shmem_parse_options(char *options, int *mode, uid_t *uid,

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/18] jfs: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/jfs/jfs_inode.h
===
--- linux-2.6.orig/fs/jfs/jfs_inode.h   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/jfs_inode.h2007-09-13 15:19:06.0 +0200
@@ -18,6 +18,8 @@
 #ifndef_H_JFS_INODE
 #define _H_JFS_INODE
 
+struct fid;
+
 extern struct inode *ialloc(struct inode *, umode_t);
 extern int jfs_fsync(struct file *, struct dentry *, int);
 extern int jfs_ioctl(struct inode *, struct file *,
@@ -32,7 +34,10 @@ extern void jfs_truncate_nolock(struct i
 extern void jfs_free_zero_link(struct inode *);
 extern struct dentry *jfs_get_parent(struct dentry *dentry);
 extern void jfs_get_inode_flags(struct jfs_inode_info *);
-extern struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp);
+extern struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
+extern struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
 extern void jfs_set_inode_flags(struct inode *);
 extern int jfs_get_block(struct inode *, sector_t, struct buffer_head *, int);
 
Index: linux-2.6/fs/jfs/namei.c
===
--- linux-2.6.orig/fs/jfs/namei.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/namei.c2007-09-13 15:19:42.0 +0200
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "jfs_incore.h"
 #include "jfs_superblock.h"
 #include "jfs_inode.h"
@@ -1477,13 +1478,10 @@ static struct dentry *jfs_lookup(struct 
return dentry;
 }
 
-struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *jfs_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino == 0)
return ERR_PTR(-ESTALE);
@@ -1493,20 +1491,25 @@ struct dentry *jfs_get_dentry(struct sup
 
if (is_bad_inode(inode) ||
(generation && inode->i_generation != generation)) {
-   result = ERR_PTR(-ESTALE);
-   goto out_iput;
+   iput(inode);
+   return ERR_PTR(-ESTALE);
}
 
-   result = d_alloc_anon(inode);
-   if (!result) {
-   result = ERR_PTR(-ENOMEM);
-   goto out_iput;
-   }
-   return result;
+   return inode;
+}
 
- out_iput:
-   iput(inode);
-   return result;
+struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   jfs_nfs_get_inode);
+}
+
+struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   jfs_nfs_get_inode);
 }
 
 struct dentry *jfs_get_parent(struct dentry *dentry)
Index: linux-2.6/fs/jfs/super.c
===
--- linux-2.6.orig/fs/jfs/super.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/super.c2007-09-13 15:19:06.0 +0200
@@ -738,7 +738,8 @@ static const struct super_operations jfs
 };
 
 static struct export_operations jfs_export_operations = {
-   .get_dentry = jfs_get_dentry,
+   .fh_to_dentry   = jfs_fh_to_dentry,
+   .fh_to_parent   = jfs_fh_to_parent,
.get_parent = jfs_get_parent,
 };
 

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/18] fat: new export ops

2007-09-19 Thread Christoph Hellwig
Very little changes here, fat had a mostly no op decode_fh before
and does not store any parent information.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2007-03-13 19:22:40.0 +0100
+++ linux-2.6/fs/fat/inode.c2007-03-13 19:23:13.0 +0100
@@ -651,24 +651,15 @@ static const struct super_operations fat
  * of i_logstart is used to store the directory entry offset.
  */
 
-static struct dentry *
-fat_decode_fh(struct super_block *sb, __u32 *fh, int len, int fhtype,
- int (*acceptable)(void *context, struct dentry *de),
- void *context)
-{
-   if (fhtype != 3)
-   return ERR_PTR(-ESTALE);
-   if (len < 5)
-   return ERR_PTR(-ESTALE);
-
-   return sb->s_export_op->find_exported_dentry(sb, fh, NULL, acceptable, 
context);
-}
-
-static struct dentry *fat_get_dentry(struct super_block *sb, void *inump)
+static struct dentry *fat_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
struct inode *inode = NULL;
struct dentry *result;
-   __u32 *fh = inump;
+   u32 *fh = fid->raw;
+
+   if (fh_len < 5 || fh_type != 3)
+   return NULL;
 
inode = iget(sb, fh[0]);
if (!inode || is_bad_inode(inode) || inode->i_generation != fh[1]) {
@@ -782,9 +773,8 @@ out:
 }
 
 static struct export_operations fat_export_ops = {
-   .decode_fh  = fat_decode_fh,
.encode_fh  = fat_encode_fh,
-   .get_dentry = fat_get_dentry,
+   .fh_to_dentry   = fat_fh_to_dentry,
.get_parent = fat_get_parent,
 };
 

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/18] efs: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/efs/namei.c
===
--- linux-2.6.orig/fs/efs/namei.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/efs/namei.c2007-09-13 15:19:00.0 +0200
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+
 
 static efs_ino_t efs_find_entry(struct inode *inode, const char *name, int 
len) {
struct buffer_head *bh;
@@ -75,13 +77,10 @@ struct dentry *efs_lookup(struct inode *
return NULL;
 }
 
-struct dentry *efs_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *efs_nfs_get_inode(struct super_block *sb, u64 ino,
+   u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino == 0)
return ERR_PTR(-ESTALE);
@@ -91,20 +90,25 @@ struct dentry *efs_get_dentry(struct sup
 
if (is_bad_inode(inode) ||
(generation && inode->i_generation != generation)) {
-   result = ERR_PTR(-ESTALE);
-   goto out_iput;
+   iput(inode);
+   return ERR_PTR(-ESTALE);
}
 
-   result = d_alloc_anon(inode);
-   if (!result) {
-   result = ERR_PTR(-ENOMEM);
-   goto out_iput;
-   }
-   return result;
+   return inode;
+}
 
- out_iput:
-   iput(inode);
-   return result;
+struct dentry *efs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   efs_nfs_get_inode);
+}
+
+struct dentry *efs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   efs_nfs_get_inode);
 }
 
 struct dentry *efs_get_parent(struct dentry *child)
Index: linux-2.6/fs/efs/super.c
===
--- linux-2.6.orig/fs/efs/super.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/efs/super.c2007-09-13 15:18:26.0 +0200
@@ -114,7 +114,8 @@ static const struct super_operations efs
 };
 
 static struct export_operations efs_export_ops = {
-   .get_dentry = efs_get_dentry,
+   .fh_to_dentry   = efs_fh_to_dentry,
+   .fh_to_parent   = efs_fh_to_parent,
.get_parent = efs_get_parent,
 };
 
Index: linux-2.6/include/linux/efs_fs.h
===
--- linux-2.6.orig/include/linux/efs_fs.h   2007-09-13 15:10:46.0 
+0200
+++ linux-2.6/include/linux/efs_fs.h2007-09-13 15:18:26.0 +0200
@@ -35,6 +35,7 @@ static inline struct efs_sb_info *SUPER_
 }
 
 struct statfs;
+struct fid;
 
 extern const struct inode_operations efs_dir_inode_operations;
 extern const struct file_operations efs_dir_operations;
@@ -45,7 +46,10 @@ extern efs_block_t efs_map_block(struct 
 extern int efs_get_block(struct inode *, sector_t, struct buffer_head *, int);
 
 extern struct dentry *efs_lookup(struct inode *, struct dentry *, struct 
nameidata *);
-extern struct dentry *efs_get_dentry(struct super_block *sb, void *vobjp);
+extern struct dentry *efs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
+extern struct dentry *efs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
 extern struct dentry *efs_get_parent(struct dentry *);
 extern int efs_bmap(struct inode *, int);
 

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/18] ntfs: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>


Index: linux-2.6/fs/ntfs/namei.c
===
--- linux-2.6.orig/fs/ntfs/namei.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/ntfs/namei.c   2007-09-13 15:20:12.0 +0200
@@ -450,58 +450,40 @@ try_next:
return parent_dent;
 }
 
-/**
- * ntfs_get_dentry - find a dentry for the inode from a file handle 
sub-fragment
- * @sb:super block identifying the mounted ntfs volume
- * @fh:the file handle sub-fragment
- *
- * Find a dentry for the inode given a file handle sub-fragment.  This function
- * is called from fs/exportfs/expfs.c::find_exported_dentry() which in turn is
- * called from the default ->decode_fh() which is export_decode_fh() in the
- * same file.  The code is closely based on the default ->get_dentry() helper
- * fs/exportfs/expfs.c::get_object().
- *
- * The @fh contains two 32-bit unsigned values, the first one is the inode
- * number and the second one is the inode generation.
- *
- * Return the dentry on success or the error code on error (IS_ERR() is true).
- */
-static struct dentry *ntfs_get_dentry(struct super_block *sb, void *fh)
+static struct inode *ntfs_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   struct inode *vi;
-   struct dentry *dent;
-   unsigned long ino = ((u32 *)fh)[0];
-   u32 gen = ((u32 *)fh)[1];
-
-   ntfs_debug("Entering for inode 0x%lx, generation 0x%x.", ino, gen);
-   vi = ntfs_iget(sb, ino);
-   if (IS_ERR(vi)) {
-   ntfs_error(sb, "Failed to get inode 0x%lx.", ino);
-   return (struct dentry *)vi;
-   }
-   if (unlikely(is_bad_inode(vi) || vi->i_generation != gen)) {
-   /* We didn't find the right inode. */
-   ntfs_error(sb, "Inode 0x%lx, bad count: %d %d or version 0x%x "
-   "0x%x.", vi->i_ino, vi->i_nlink,
-   atomic_read(&vi->i_count), vi->i_generation,
-   gen);
-   iput(vi);
-   return ERR_PTR(-ESTALE);
-   }
-   /* Now find a dentry.  If possible, get a well-connected one. */
-   dent = d_alloc_anon(vi);
-   if (unlikely(!dent)) {
-   iput(vi);
-   return ERR_PTR(-ENOMEM);
+   struct inode *inode;
+
+   inode = ntfs_iget(sb, ino);
+   if (!IS_ERR(inode)) {
+   if (is_bad_inode(inode) || inode->i_generation != generation) {
+   iput(inode);
+   inode = ERR_PTR(-ESTALE);
+   }
}
-   ntfs_debug("Done for inode 0x%lx, generation 0x%x.", ino, gen);
-   return dent;
+
+   return inode;
+}
+
+static struct dentry *ntfs_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ntfs_nfs_get_inode);
+}
+
+static struct dentry *ntfs_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ntfs_nfs_get_inode);
 }
 
 /**
  * Export operations allowing NFS exporting of mounted NTFS partitions.
  *
- * We use the default ->decode_fh() and ->encode_fh() for now.  Note that they
+ * We use the default ->encode_fh() for now.  Note that they
  * use 32 bits to store the inode number which is an unsigned long so on 64-bit
  * architectures is usually 64 bits so it would all fail horribly on huge
  * volumes.  I guess we need to define our own encode and decode fh functions
@@ -520,7 +502,6 @@ static struct dentry *ntfs_get_dentry(st
 struct export_operations ntfs_export_ops = {
.get_parent = ntfs_get_parent,  /* Find the parent of a given
   directory. */
-   .get_dentry = ntfs_get_dentry,  /* Find a dentry for the inode
-  given a file handle
-  sub-fragment. */
+   .fh_to_dentry   = ntfs_fh_to_dentry,
+   .fh_to_parent   = ntfs_fh_to_parent,
 };

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/18] exportfs: add new methods

2007-09-19 Thread Christoph Hellwig
Add the guts for the new filesystem API to exportfs.

There's now a fh_to_dentry method that returns a dentry for the 
object looked for given a filehandle fragment, and a fh_to_parent
operation that returns the dentry for the encoded parent directory
in case the file handle contains it.

There are default implementations for these methods that only take
a callback for an nfs-enhanced iget variant and implement the
rest of the semantics.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/include/linux/exportfs.h
===
--- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:11:11.0 
+0200
+++ linux-2.6/include/linux/exportfs.h  2007-09-13 15:13:57.0 +0200
@@ -4,6 +4,7 @@
 #include 
 
 struct dentry;
+struct inode;
 struct super_block;
 struct vfsmount;
 
@@ -101,6 +102,21 @@ struct fid {
  *the filehandle fragment.  encode_fh() should return the number of bytes
  *stored or a negative error code such as %-ENOSPC
  *
+ * fh_to_dentry:
+ *@fh_to_dentry is given a &struct super_block (@sb) and a file handle
+ *fragment (@fh, @fh_len). It should return a &struct dentry which refers
+ *to the same file that the file handle fragment refers to.  If it cannot,
+ *it should return a %NULL pointer if the file was found but no acceptable
+ *&dentries were available, or an %ERR_PTR error code indicating why it
+ *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
+ *returned including, if necessary, a new dentry created with d_alloc_root.
+ *The caller can then find any other extant dentries by following the
+ *d_alias links.
+ *
+ * fh_to_parent:
+ *Same as @fh_to_dentry, except that it returns a pointer to the parent
+ *dentry if it was encoded into the filehandle fragment by @encode_fh.
+ *
  * get_name:
  *@get_name should find a name for the given @child in the given @parent
  *directory.  The name should be stored in the @name (with the
@@ -139,6 +155,10 @@ struct export_operations {
void *context);
int (*encode_fh)(struct dentry *de, __u32 *fh, int *max_len,
int connectable);
+   struct dentry * (*fh_to_dentry)(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
+   struct dentry * (*fh_to_parent)(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
int (*get_name)(struct dentry *parent, char *name,
struct dentry *child);
struct dentry * (*get_parent)(struct dentry *child);
@@ -161,4 +181,14 @@ extern struct dentry *exportfs_decode_fh
int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
void *context);
 
+/*
+ * Generic helpers for filesystems.
+ */
+extern struct dentry *generic_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type,
+   struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen));
+extern struct dentry *generic_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type,
+   struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen));
+
 #endif /* LINUX_EXPORTFS_H */
Index: linux-2.6/fs/exportfs/expfs.c
===
--- linux-2.6.orig/fs/exportfs/expfs.c  2007-09-13 15:13:02.0 +0200
+++ linux-2.6/fs/exportfs/expfs.c   2007-09-13 15:14:42.0 +0200
@@ -514,17 +514,141 @@ struct dentry *exportfs_decode_fh(struct
int (*acceptable)(void *, struct dentry *), void *context)
 {
struct export_operations *nop = mnt->mnt_sb->s_export_op;
-   struct dentry *result;
+   struct dentry *result, *alias;
+   int err;
 
-   if (nop->decode_fh) {
-   result = nop->decode_fh(mnt->mnt_sb, fid->raw, fh_len,
+   /*
+* Old way of doing things.  Will go away soon.
+*/
+   if (!nop->fh_to_dentry) {
+   if (nop->decode_fh) {
+   return nop->decode_fh(mnt->mnt_sb, fid->raw, fh_len,
fileid_type, acceptable, context);
+   } else {
+   return export_decode_fh(mnt->mnt_sb, fid->raw, fh_len,
+   fileid_type, acceptable, context);
+   }
+   }
+
+   /*
+* Try to get any dentry for the given file handle from the filesystem.
+*/
+   result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
+   if (!result)
+   result = ERR_PTR(-ESTALE);
+   if (IS_ERR(result))
+   return result;
+
+   if (S_ISDIR(result->d_inode->i_mode)) {
+   /*
+* This request is for a directory.
+*
+* On the positive side there is only one dentry for 

[PATCH 05/18] ext4: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext4/super.c   2007-09-13 15:18:21.0 +0200
@@ -613,13 +613,10 @@ static int ext4_show_options(struct seq_
 }
 
 
-static struct dentry *ext4_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext4_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino < EXT4_FIRST_INO(sb) && ino != EXT4_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -642,15 +639,22 @@ static struct dentry *ext4_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+
+   return inode;
+}
+
+static struct dentry *ext4_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext4_nfs_get_inode);
+}
+
+static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext4_nfs_get_inode);
 }
 
 #ifdef CONFIG_QUOTA
@@ -720,8 +724,9 @@ static const struct super_operations ext
 };
 
 static struct export_operations ext4_export_ops = {
+   .fh_to_dentry = ext4_fh_to_dentry,
+   .fh_to_parent = ext4_fh_to_parent,
.get_parent = ext4_get_parent,
-   .get_dentry = ext4_get_dentry,
 };
 
 enum {

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/18] ext2: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext2/super.c   2007-09-13 15:16:25.0 +0200
@@ -248,13 +248,10 @@ static const struct super_operations ext
 #endif
 };
 
-static struct dentry *ext2_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext2_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino < EXT2_FIRST_INO(sb) && ino != EXT2_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -275,15 +272,21 @@ static struct dentry *ext2_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+   return inode;
+}
+
+static struct dentry *ext2_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext2_nfs_get_inode);
+}
+
+static struct dentry *ext2_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext2_nfs_get_inode);
 }
 
 /* Yes, most of these are left as NULL!!
@@ -292,8 +295,9 @@ static struct dentry *ext2_get_dentry(st
  * Currently only get_parent is required.
  */
 static struct export_operations ext2_export_ops = {
+   .fh_to_dentry = ext2_fh_to_dentry,
+   .fh_to_parent = ext2_fh_to_parent,
.get_parent = ext2_get_parent,
-   .get_dentry = ext2_get_dentry,
 };
 
 static unsigned long get_sb_block(void **data)

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/18] ext3: new export ops

2007-09-19 Thread Christoph Hellwig
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/ext3/super.c
===
--- linux-2.6.orig/fs/ext3/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext3/super.c   2007-09-13 15:16:57.0 +0200
@@ -562,13 +562,10 @@ static int ext3_show_options(struct seq_
 }
 
 
-static struct dentry *ext3_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext3_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino < EXT3_FIRST_INO(sb) && ino != EXT3_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -591,15 +588,22 @@ static struct dentry *ext3_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+
+   return inode;
+}
+
+static struct dentry *ext3_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext3_nfs_get_inode);
+}
+
+static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext3_nfs_get_inode);
 }
 
 #ifdef CONFIG_QUOTA
@@ -669,8 +673,9 @@ static const struct super_operations ext
 };
 
 static struct export_operations ext3_export_ops = {
+   .fh_to_dentry = ext3_fh_to_dentry,
+   .fh_to_parent = ext3_fh_to_parent,
.get_parent = ext3_get_parent,
-   .get_dentry = ext3_get_dentry,
 };
 
 enum {

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/18] exportfs: add fid type

2007-09-19 Thread Christoph Hellwig
Add a structured fid type so that we don't have to pass an array
of u32 values around everywhere.  It's a union of possible layouts.

As a start there's only the u32 array and the traditional 32bit
inode format, but there will be more in one of my next patchset
when I start to document the various filehandle formats we have
in lowlevel filesystems better.

Also add an enum that gives the various filehandle types human-
readable names.

Note:  Some people might think the struct containing an anonymous
union is ugly, but I didn't want to pass around a raw union type.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/include/linux/exportfs.h
===
--- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:10:59.0 
+0200
+++ linux-2.6/include/linux/exportfs.h  2007-09-13 15:11:11.0 +0200
@@ -7,6 +7,44 @@ struct dentry;
 struct super_block;
 struct vfsmount;
 
+/*
+ * The fileid_type identifies how the file within the filesystem is encoded.
+ * In theory this is freely set and parsed by the filesystem, but we try to
+ * stick to conventions so we can share some generic code and don't confuse
+ * sniffers like ethereal/wireshark.
+ *
+ * The filesystem must not use the value '0' or '0xff'.
+ */
+enum fid_type {
+   /*
+* The root, or export point, of the filesystem.
+* (Never actually passed down to the filesystem.
+*/
+   FILEID_ROOT = 0,
+
+   /*
+* 32bit inode number, 32 bit generation number.
+*/
+   FILEID_INO32_GEN = 1,
+
+   /*
+* 32bit inode number, 32 bit generation number,
+* 32 bit parent directory inode number.
+*/
+   FILEID_INO32_GEN_PARENT = 2,
+};
+
+struct fid {
+   union {
+   struct {
+   u32 ino;
+   u32 gen;
+   u32 parent_ino;
+   u32 parent_gen;
+   } i32;
+   __u32 raw[6];
+   };
+};
 
 /**
  * struct export_operations - for nfsd to communicate with file systems
@@ -117,9 +155,9 @@ extern struct dentry *find_exported_dent
void *parent, int (*acceptable)(void *context, struct dentry *de),
void *context);
 
-extern int exportfs_encode_fh(struct dentry *dentry, __u32 *fh, int *max_len,
-   int connectable);
-extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, __u32 *fh,
+extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,
+   int *max_len, int connectable);
+extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
void *context);
 
Index: linux-2.6/fs/nfsd/nfsfh.c
===
--- linux-2.6.orig/fs/nfsd/nfsfh.c  2007-09-13 15:10:59.0 +0200
+++ linux-2.6/fs/nfsd/nfsfh.c   2007-09-13 15:13:32.0 +0200
@@ -115,8 +115,7 @@ fh_verify(struct svc_rqst *rqstp, struct
dprintk("nfsd: fh_verify(%s)\n", SVCFH_fmt(fhp));
 
if (!fhp->fh_dentry) {
-   __u32 *datap=NULL;
-   __u32 tfh[3];   /* filehandle fragment for oldstyle 
filehandles */
+   struct fid *fid = NULL, sfid;
int fileid_type;
int data_left = fh->fh_size/4;
 
@@ -128,7 +127,6 @@ fh_verify(struct svc_rqst *rqstp, struct
 
if (fh->fh_version == 1) {
int len;
-   datap = fh->fh_auth;
if (--data_left<0) goto out;
switch (fh->fh_auth_type) {
case 0: break;
@@ -144,9 +142,11 @@ fh_verify(struct svc_rqst *rqstp, struct
fh->fh_fsid[1] = fh->fh_fsid[2];
}
if ((data_left -= len)<0) goto out;
-   exp = rqst_exp_find(rqstp, fh->fh_fsid_type, datap);
-   datap += len;
+   exp = rqst_exp_find(rqstp, fh->fh_fsid_type,
+   fh->fh_auth);
+   fid = (struct fid *)(fh->fh_auth + len);
} else {
+   __u32 tfh[2];
dev_t xdev;
ino_t xino;
if (fh->fh_size != NFS_FHSIZE)
@@ -190,22 +190,22 @@ fh_verify(struct svc_rqst *rqstp, struct
error = nfserr_badhandle;
 
if (fh->fh_version != 1) {
-   tfh[0] = fh->ofh_ino;
-   tfh[1] = fh->ofh_generation;
-   tfh[2] = fh->ofh_dirino;
-   datap = tfh;
+   sfid.i32.ino = fh->ofh_ino;
+   sfid.i32.gen = fh->ofh_generation;
+   sfid.i32.parent_ino = fh->ofh_dirino;
+   f

[PATCH 00/18] export operations rewrite

2007-09-19 Thread Christoph Hellwig
This patchset is a medium scale rewrite of the export operations
interface.  The goal is to make the interface less complex, and
easier to understand from the filesystem side, aswell as preparing
generic support for exporting of 64bit inode numbers.

This touches all nfs exporting filesystems, and I've done testing
on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs,   
jfs)

Unlike the last posting this series is against -mm (2.6.23-rc6-mm1
specificly), so the first patch has been dropped as it's in git-xfs
already, and the XFS conversion has gotten some major context changes
due to the changes in git-xfs.  Except for those and some typo fixes
in the documentation patch it's identical to the previous one.

This series is sent for inclusion into -mm.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] fix setuid/setgid clearing in networked filesystems (take 6)

2007-09-19 Thread Jeff Layton
On Mon, 17 Sep 2007 07:29:05 -0400
Jeff Layton <[EMAIL PROTECTED]> wrote:

> This patchset is the latest one for fixing the clearing of setuid/setgid
> bits in networked filesystems. It should apply cleanly to 2.6.23-rc4-mm1.
> This is basically the same patchset as take 5. The main differences are
> that the patches have been reordered to make the tree cleanly bisectable,
> and the comment in notify_change is now a bit more descriptive. I've also
> moved the description of the main rationale for the patch into patch 5.
> 
> Andrew, would it be possible to go ahead and get this committed to -mm?
> 
> Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>

Andrew,
  This will need some changes before it can apply cleanly to
2.6.23-rc6-mm1. You can ignore this for now. I'll respin, retest and
repost...

Thanks,
-- 
Jeff Layton <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 13/26] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts

2007-09-19 Thread Rik van Riel

Christoph Lameter wrote:

Add a flag SlabReclaimable() that is set on slabs with a method
that allows defrag/reclaim. Clear the flag if a reclaim action is not
successful in reducing the number of objects in a slab. The reclaim
flag is set again if all objects have been allocated from it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   42 --
 1 file changed, 36 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:10:37.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:10:47.0 -0700
@@ -107,6 +107,8 @@
 #define SLABDEBUG 0
 #endif
 
+#define SLABRECLAIMABLE (1 << PG_dirty)

+
 static inline int SlabFrozen(struct page *page)
 {
return page->flags & FROZEN;
@@ -137,6 +139,21 @@ static inline void ClearSlabDebug(struct
page->flags &= ~SLABDEBUG;
 }
 
+static inline int SlabReclaimable(struct page *page)

+{
+   return page->flags & SLABRECLAIMABLE;
+}
+
+static inline void SetSlabReclaimable(struct page *page)
+{
+   page->flags |= SLABRECLAIMABLE;
+}
+
+static inline void ClearSlabReclaimable(struct page *page)
+{
+   page->flags &= ~SLABRECLAIMABLE;
+}


Why is it safe to not use the normal page flag bit operators
for these page flags operations?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Andrea Arcangeli
On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
> 
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size.  Effectively, btree depth = seek
> count on lookup of any object.

I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.

You don't need largepages for this, fsblocks is enough.

Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).

Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.

> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
> 
>   1. little code change needed in the filesystems
>   -> still true

Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.

>   3. avoiding the need for vmap() as it has great
>  overhead and does not scale
>   -> Nick is starting to work on that and has
>  already had good results.

Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.

> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.

I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.

For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.

config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.

fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Anton Altaparmakov

On 19 Sep 2007, at 10:19, David Rientjes wrote:

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:
Although it may cause a problem as highmem.h also includes mm.h so  
a bit of

trickery may be needed to get it to compile...

I suspect that is_vmalloc_addr() should not be in linux/mm.h at  
all and should
be in linux/vmalloc.h instead and vmalloc.h should include linux/ 
highmem.h.
That would be more sensible than sticking a vmalloc related  
function into

linux/mm.h where it does not belong...


That is why I suggested include/linux/vmalloc.h as its home in the  
first
place.  And no, adding an include for linux/highmem.h (and asm/ 
pgtable.h)

to linux/vmalloc.h does not work.


I am sure Christoph can figure out somewhere that it will work.   
After all, the code in that function already exists both as another  
function and open coded in several places and compiles fine there...


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] [2/4] 2.6.23-rc6: known regressions

2007-09-19 Thread Dave Kleikamp
On Wed, 2007-09-19 at 13:44 +0200, Oliver Neukum wrote:
> Am Donnerstag 13 September 2007 schrieb Dave Kleikamp:
> > On Wed, 2007-09-12 at 18:58 +0200, Michal Piotrowski wrote:
> > 
> > > Subject : umount triggers a warning in jfs and takes almost a 
> > > minute
> > > References  : http://lkml.org/lkml/2007/9/4/73
> > > Last known good : ?
> > > Submitter   : Oliver Neukum <[EMAIL PROTECTED]>
> > > Caused-By   : ?
> > > Handled-By  : ?
> > > Status  : unknown
> > 
> > I'm still waiting to hear from Oliver whether or not this is actually a
> > regression.
> 
> OK, I've done test. On 2.6.22 I was unable to trigger the warning.
> On 2.6.23-rc6 I get the following warning in about 3/4 of all unmounts:
> 
> Sep 19 13:08:04 oenone kernel: WARNING: at fs/jfs/jfs_logmgr.c:1643 
> jfs_flush_journal()
> Sep 19 13:08:04 oenone kernel:
> Sep 19 13:08:04 oenone kernel: Call Trace:
> Sep 19 13:08:04 oenone kernel:  [] 
> :jfs:jfs_flush_journal+0x26a/0x27d
> Sep 19 13:08:04 oenone kernel:  [] dispose_list+0xde/0xf7
> Sep 19 13:08:04 oenone kernel:  [] :jfs:jfs_umount+0x30/0xe5
> Sep 19 13:08:04 oenone kernel:  [] 
> :jfs:jfs_put_super+0xd/0x5e
> Sep 19 13:08:04 oenone kernel:  [] 
> generic_shutdown_super+0x60/0xf0
> Sep 19 13:08:04 oenone kernel:  [] kill_block_super+0xd/0x1e
> Sep 19 13:08:04 oenone kernel:  [] 
> deactivate_super+0x6a/0x82
> Sep 19 13:08:04 oenone kernel:  [] sys_umount+0x249/0x25a
> Sep 19 13:08:04 oenone kernel:  [] 
> audit_syscall_entry+0x141/0x174
> Sep 19 13:08:04 oenone kernel:  [] tracesys+0xdc/0xe1
> Sep 19 13:08:04 oenone kernel:
> 
> This is on a partition of a usb mass storage device with 2K sector size.
> IMHO it is a regression. I am recompiling with CONFIG_JFS_DEBUG.

Looks like a regression then.  I'll take a look at the debug output and get 
back to you.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] [2/4] 2.6.23-rc6: known regressions

2007-09-19 Thread Oliver Neukum
Am Donnerstag 13 September 2007 schrieb Dave Kleikamp:
> On Wed, 2007-09-12 at 18:58 +0200, Michal Piotrowski wrote:
> 
> > Subject         : umount triggers a warning in jfs and takes almost a minute
> > References      : http://lkml.org/lkml/2007/9/4/73
> > Last known good : ?
> > Submitter       : Oliver Neukum <[EMAIL PROTECTED]>
> > Caused-By       : ?
> > Handled-By      : ?
> > Status          : unknown
> 
> I'm still waiting to hear from Oliver whether or not this is actually a
> regression.

OK, I've done test. On 2.6.22 I was unable to trigger the warning.
On 2.6.23-rc6 I get the following warning in about 3/4 of all unmounts:

Sep 19 13:08:04 oenone kernel: WARNING: at fs/jfs/jfs_logmgr.c:1643 
jfs_flush_journal()
Sep 19 13:08:04 oenone kernel:
Sep 19 13:08:04 oenone kernel: Call Trace:
Sep 19 13:08:04 oenone kernel:  [] 
:jfs:jfs_flush_journal+0x26a/0x27d
Sep 19 13:08:04 oenone kernel:  [] dispose_list+0xde/0xf7
Sep 19 13:08:04 oenone kernel:  [] :jfs:jfs_umount+0x30/0xe5
Sep 19 13:08:04 oenone kernel:  [] :jfs:jfs_put_super+0xd/0x5e
Sep 19 13:08:04 oenone kernel:  [] 
generic_shutdown_super+0x60/0xf0
Sep 19 13:08:04 oenone kernel:  [] kill_block_super+0xd/0x1e
Sep 19 13:08:04 oenone kernel:  [] deactivate_super+0x6a/0x82
Sep 19 13:08:04 oenone kernel:  [] sys_umount+0x249/0x25a
Sep 19 13:08:04 oenone kernel:  [] 
audit_syscall_entry+0x141/0x174
Sep 19 13:08:04 oenone kernel:  [] tracesys+0xdc/0xe1
Sep 19 13:08:04 oenone kernel:

This is on a partition of a usb mass storage device with 2K sector size.
IMHO it is a regression. I am recompiling with CONFIG_JFS_DEBUG.

Regards
Oliver

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Eric Dumazet
On Wed, 19 Sep 2007 08:34:47 +0100
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:

> Hi Christoph,
> 
> On 19 Sep 2007, at 04:36, Christoph Lameter wrote:
> > Currently there is a strong tendency to avoid larger page  
> > allocations in
> > the kernel because of past fragmentation issues and the current
> > defragmentation methods are still evolving. It is not clear to what  
> > extend
> > they can provide reliable allocations for higher order pages (plus the
> > definition of "reliable" seems to be in the eye of the beholder).
> >
> > Currently we use vmalloc allocations in many locations to provide a  
> > safe
> > way to allocate larger arrays. That is due to the danger of higher  
> > order
> > allocations failing. Virtual Compound pages allow the use of regular
> > page allocator allocations that will fall back only if there is an  
> > actual
> > problem with acquiring a higher order page.
> >
> > This patch set provides a way for a higher page allocation to fall  
> > back.
> > Instead of a physically contiguous page a virtually contiguous page
> > is provided. The functionality of the vmalloc layer is used to provide
> > the necessary page tables and control structures to establish a  
> > virtually
> > contiguous area.
> 
> I like this a lot.  It will get rid of all the silly games we have to  
> play when needing both large allocations and efficient allocations  
> where possible.  In NTFS I can then just allocated higher order pages  
> instead of having to mess about with the allocation size and  
> allocating a single page if the requested size is <= PAGE_SIZE or  
> using vmalloc() if the size is bigger.  And it will make it faster  
> because a lot of the time a higher order page allocation will succeed  
> with your patchset without resorting to vmalloc() so that will be a  
> lot faster.
> 
> So where I currently have fs/ntfs/malloc.h the below mess I could get  
> rid of it completely and just use the normal page allocator/ 
> deallocator instead...
> 
> static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
> {
>  if (likely(size <= PAGE_SIZE)) {
>  BUG_ON(!size);
>  /* kmalloc() has per-CPU caches so is faster for  
> now. */
>  return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
>  /* return (void *)__get_free_page(gfp_mask); */
>  }
>  if (likely(size >> PAGE_SHIFT < num_physpages))
>  return __vmalloc(size, gfp_mask, PAGE_KERNEL);
>  return NULL;
> }
> 
> And other places in the kernel can make use of the same.  I think XFS  
> does very similar things to NTFS in terms of larger allocations at  
> least and there are probably more places I don't know about off the  
> top of my head...
> 
> I am looking forward to your patchset going into mainline.  (-:

Sure, it sounds *really* good. But...

1) Only power of two allocations are good candidates, or we waste RAM

2) On i386 machines, we have a small vmalloc window. (128 MB default value)
  Many servers with >4GB memory (PAE) like to boot with vmalloc=32M option to 
get 992MB of LOWMEM.
  If we allow some slub caches to fallback to vmalloc land, we'll have problems 
to tune this.

3) A fallback to vmalloc means an allocation of one vm_struct per compound page.

4) vmalloc() currently uses a linked list of vm_struct. Might need something 
more scalable.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Nick Piggin
On Wednesday 19 September 2007 04:30, Linus Torvalds wrote:
> On Tue, 18 Sep 2007, Nick Piggin wrote:
> > ROFL! Yeah of course, how could I have forgotten about our trusty OOM
> > killer as the solution to the fragmentation problem? It would only have
> > been funnier if you had said to reboot every so often when memory gets
> > fragmented :)
>
> Can we please stop this *idiotic* thread.
>
> Nick, you and some others seem to be arguing based on a totally flawed
> base, namely:
>  - we can guarantee anything at all in the VM
>  - we even care about the 16kB blocksize
>  - second-class citizenry is "bad"
>
> The fact is, *none* of those things are true. The VM doesn't guarantee
> anything, and is already very much about statistics in many places. You
> seem to be arguing as if Christoph was introducing something new and
> unacceptable, when it's largely just more of the same.

I will stop this idiotic thread.

However, at the VM and/or vm/fs things we had, I was happy enough
for this thing of Christoph's to get merged. Actually I even didn't care
if it had mmap support, so long as it solved their problem.

But a solution to the general problem of VM and IO scalability, it is not.
IMO.


> And the fact is, nobody but SGI customers would ever want the 16kB
> blocksize. IOW - NONE OF THIS MATTERS!

Maybe. Maybe not.


> Can you guys stop this inane thread already, or at least take it private
> between you guys, instead of forcing everybody else to listen in on your
> flamefest.

Will do. Sorry.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas
On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now.  So while
> we have the capability to use large block sizes in certain
> filesystems, we can't use that capability until the data path
> supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

-- 
thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread David Rientjes
On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> Although it may cause a problem as highmem.h also includes mm.h so a bit of
> trickery may be needed to get it to compile...
> 
> I suspect that is_vmalloc_addr() should not be in linux/mm.h at all and should
> be in linux/vmalloc.h instead and vmalloc.h should include linux/highmem.h.
> That would be more sensible than sticking a vmalloc related function into
> linux/mm.h where it does not belong...
> 

That is why I suggested include/linux/vmalloc.h as its home in the first 
place.  And no, adding an include for linux/highmem.h (and asm/pgtable.h) 
to linux/vmalloc.h does not work.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Anton Altaparmakov

On 19 Sep 2007, at 09:09, David Rientjes wrote:

On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-09-17 21:46:06.0
-0700
+++ linux-2.6/include/linux/mm.h	2007-09-17 23:56:54.0  
-0700

@@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
}

+/* Determine if an address is within the vmalloc range */
+static inline int is_vmalloc_addr(const void *x)
+{
+   unsigned long addr = (unsigned long)x;
+
+   return addr >= VMALLOC_START && addr < VMALLOC_END;
+}


This breaks on i386 because VMALLOC_END is defined in terms of  
PKMAP_BASE

in the CONFIG_HIGHMEM case.


That is incorrect.  This works perfectly on i386 and on ALL  
architectures
supported by Linux.  A lot of places in the kernel already do this  
today

(mostly hand coded though, eg XFS and NTFS)...


Hmm, really?

After applying patches 1-3 in this series and compiling on my i386  
with

defconfig, I get this:

In file included from include/linux/suspend.h:11,
 from arch/i386/kernel/asm-offsets.c:11:
include/linux/mm.h: In function 'is_vmalloc_addr':
include/linux/mm.h:1166: error: 'PKMAP_BASE' undeclared (first use  
in this function)
include/linux/mm.h:1166: error: (Each undeclared identifier is  
reported only once

include/linux/mm.h:1166: error: for each function it appears in.)

so I don't know what you're talking about.


Just a compile failure not inherently broken!

Add:

#include  to the top of linux/mm.h and it should  
compile just fine.


Although it may cause a problem as highmem.h also includes mm.h so a  
bit of trickery may be needed to get it to compile...


I suspect that is_vmalloc_addr() should not be in linux/mm.h at all  
and should be in linux/vmalloc.h instead and vmalloc.h should include  
linux/highmem.h.  That would be more sensible than sticking a vmalloc  
related function into linux/mm.h where it does not belong...


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Andi Kleen
Christoph Lameter <[EMAIL PROTECTED]> writes:

It seems like a good idea simply because the same functionality
is already open coded in a couple of places and unifying
that would be a good thing. But ...
 
> The patchset provides this functionality in stages. Stage 1 introduces
> the basic fall back mechanism necessary to replace vmalloc allocations
> with
> 
>   alloc_page(GFP_VFALLBACK, order, )

Is there a reason this needs to be a GFP flag versus a wrapper
around alloc_page/free_page ?  page_alloc.c is already too complicated
and it's better to keep new features separated. The only drawback
would be that free_pages would need a different call, but that
doesn't seem like a big problem.

Especially integrating it into slab would seem wrong to me.
slab is already too complicated and for users who need that
large areas page granuality rounding to pages is probably fine.

Also such a wrapper could do the old alloc_page_exact() trick:
instead of always rounding up to next order return the left over
pages to the VM. In some cases this can save significant memory.

I'm also a little dubious about your attempts to do vmalloc in
interrupt context. Is that really needed? GFP_ATOMIC allocations of
large areas seem to be extremly unreliable to me and not design. Even
if it works sometimes free probably wouldn't work there due to the
flushes, which is very nasty. It would be better to drop that.

-Andi



> which signifies to the page allocator that a higher order is to be found
> but a virtual mapping may stand in if there is an issue with fragmentation.
> 
> Stage 1 functionality does not allow allocation and freeing of virtual
> mappings from interrupt contexts.
> 
> The stage 1 series ends with the conversion of a few key uses of vmalloc
> in the VM to alloc_pages() for the allocation of sparsemems memmap table
> and the wait table in each zone. Other uses of vmalloc could be converted
> in the same way.
> 
> 
> Stage 2 functionality enhances the fallback even more allowing allocation
> and frees in interrupt context.
> 
> SLUB is then modified to use the virtual mappings for slab caches
> that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
> then we drop all the restraints regarding page order and allocate
> good large memory areas that fit lots of objects so that we rarely
> have to use the slow paths.
> 
> Two slab caches--the dentry cache and the buffer_heads--are then flagged
> that way. Others could be converted in the same way.
> 
> The patch set also provides a debugging aid through setting
> 
>   CONFIG_VFALLBACK_ALWAYS
> 
> If set then all GFP_VFALLBACK allocations fall back to the virtual
> mappings. This is useful for verification tests. The test of this
> patch set was done by enabling that options and compiling a kernel.
> 
> 
> Stage 3 functionality could be the adding of support for the large
> buffer size patchset. Not done yet and not sure if it would be useful
> to do.
> 
> Much of this patchset may only be needed for special cases in which the
> existing defragmentation methods fail for some reason. It may be better to
> have the system operate without such a safety net and make sure that the
> page allocator can return large orders in a reliable way.
> 
> The initial idea for this patchset came from Nick Piggin's fsblock
> and from his arguments about reliability and guarantees. Since his
> fsblock uses the virtual mappings I think it is legitimate to
> generalize the use of virtual mappings to support higher order
> allocations in this way. The application of these ideas to the large
> block size patchset etc are straightforward. If wanted I can base
> the next rev of the largebuffer patchset on this one and implement
> fallback.
> 
> Contrary to Nick, I still doubt that any of this provides a "guarantee".
> Have said that I have to deal with various failure scenarios in the VM
> daily and I'd certainly like to see it work in a more reliable manner.
> 
> IMHO getting rid of the various workarounds to deal with the small 4k
> pages and avoiding additional layers that group these pages in subsystem
> specific ways is something that can simplify the kernel and make the
> kernel more reliable overall.
> 
> If people feel that a virtual fall back is needed then so be it. Maybe
> we can shed our security blanket later when the approaches to deal
> with fragmentation have matured.
> 
> The patch set is also available via git from the largeblock git tree via
> 
> git pull
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
> vcompound
> 
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread David Rientjes
On Wed, 19 Sep 2007, Anton Altaparmakov wrote:

> > > Index: linux-2.6/include/linux/mm.h
> > > ===
> > > --- linux-2.6.orig/include/linux/mm.h 2007-09-17 21:46:06.0
> > > -0700
> > > +++ linux-2.6/include/linux/mm.h  2007-09-17 23:56:54.0 -0700
> > > @@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
> > >   return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > }
> > > 
> > > +/* Determine if an address is within the vmalloc range */
> > > +static inline int is_vmalloc_addr(const void *x)
> > > +{
> > > + unsigned long addr = (unsigned long)x;
> > > +
> > > + return addr >= VMALLOC_START && addr < VMALLOC_END;
> > > +}
> > 
> > This breaks on i386 because VMALLOC_END is defined in terms of PKMAP_BASE
> > in the CONFIG_HIGHMEM case.
> 
> That is incorrect.  This works perfectly on i386 and on ALL architectures
> supported by Linux.  A lot of places in the kernel already do this today
> (mostly hand coded though, eg XFS and NTFS)...
> 

Hmm, really?

After applying patches 1-3 in this series and compiling on my i386 with 
defconfig, I get this:

In file included from include/linux/suspend.h:11,
 from arch/i386/kernel/asm-offsets.c:11:
include/linux/mm.h: In function 'is_vmalloc_addr':
include/linux/mm.h:1166: error: 'PKMAP_BASE' undeclared (first use in this 
function)
include/linux/mm.h:1166: error: (Each undeclared identifier is reported only 
once
include/linux/mm.h:1166: error: for each function it appears in.)

so I don't know what you're talking about.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [05/17] vunmap: return page array

2007-09-19 Thread KAMEZAWA Hiroyuki
On Tue, 18 Sep 2007 20:36:10 -0700
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> Make vunmap return the page array that was used at vmap. This is useful
> if one has no structures to track the page array but simply stores the
> virtual address somewhere. The disposition of the page array can be
> decided upon after vunmap. vfree() may now also be used instead of
> vunmap which will release the page array after vunmap'ping it.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 

Hmm, I don't like returning array which someone allocated in past and forgot.
And, area->page[] array under vmalloc() is allocated as following (in -rc6-mm1)
==
   if (array_size > PAGE_SIZE) {
pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
PAGE_KERNEL, node);
area->flags |= VM_VPAGES;
} else {
pages = kmalloc_node(array_size,
(gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO,
node);
}
==
A bit complicating.

At least, please add comments "How to free page-array returned by vummap"

Thanks,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [03/17] is_vmalloc_addr(): Check if an address is within the vmalloc boundaries

2007-09-19 Thread Anton Altaparmakov

On 19 Sep 2007, at 07:32, David Rientjes wrote:

On Tue, 18 Sep 2007, Christoph Lameter wrote:

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h	2007-09-17  
21:46:06.0 -0700

+++ linux-2.6/include/linux/mm.h2007-09-17 23:56:54.0 -0700
@@ -1158,6 +1158,14 @@ static inline unsigned long vma_pages(st
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }

+/* Determine if an address is within the vmalloc range */
+static inline int is_vmalloc_addr(const void *x)
+{
+   unsigned long addr = (unsigned long)x;
+
+   return addr >= VMALLOC_START && addr < VMALLOC_END;
+}


This breaks on i386 because VMALLOC_END is defined in terms of  
PKMAP_BASE

in the CONFIG_HIGHMEM case.


That is incorrect.  This works perfectly on i386 and on ALL  
architectures supported by Linux.  A lot of places in the kernel  
already do this today (mostly hand coded though, eg XFS and NTFS)...


There even is such a function already in mm/ 
sparse.c::vaddr_in_vmalloc_area() with pretty identical content.  I  
would suggest either this new inline should go away completely and  
use the existing one and export it or the existing one should go away  
and the inline should be used.  It seems silly to have two functions  
with different names doing exactly the same thing!


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Anton Altaparmakov

Hi Christoph,

On 19 Sep 2007, at 04:36, Christoph Lameter wrote:
Currently there is a strong tendency to avoid larger page  
allocations in

the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to what  
extend

they can provide reliable allocations for higher order pages (plus the
definition of "reliable" seems to be in the eye of the beholder).

Currently we use vmalloc allocations in many locations to provide a  
safe
way to allocate larger arrays. That is due to the danger of higher  
order

allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is an  
actual

problem with acquiring a higher order page.

This patch set provides a way for a higher page allocation to fall  
back.

Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish a  
virtually

contiguous area.


I like this a lot.  It will get rid of all the silly games we have to  
play when needing both large allocations and efficient allocations  
where possible.  In NTFS I can then just allocated higher order pages  
instead of having to mess about with the allocation size and  
allocating a single page if the requested size is <= PAGE_SIZE or  
using vmalloc() if the size is bigger.  And it will make it faster  
because a lot of the time a higher order page allocation will succeed  
with your patchset without resorting to vmalloc() so that will be a  
lot faster.


So where I currently have fs/ntfs/malloc.h the below mess I could get  
rid of it completely and just use the normal page allocator/ 
deallocator instead...


static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
{
if (likely(size <= PAGE_SIZE)) {
BUG_ON(!size);
/* kmalloc() has per-CPU caches so is faster for  
now. */

return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
/* return (void *)__get_free_page(gfp_mask); */
}
if (likely(size >> PAGE_SHIFT < num_physpages))
return __vmalloc(size, gfp_mask, PAGE_KERNEL);
return NULL;
}

And other places in the kernel can make use of the same.  I think XFS  
does very similar things to NTFS in terms of larger allocations at  
least and there are probably more places I don't know about off the  
top of my head...


I am looking forward to your patchset going into mainline.  (-:

Best regards,

Anton


Advantages:

- If higher order allocations are failing then virtual compound pages
  consisting of a series of order-0 pages can stand in for those
  allocations.

- "Reliability" as long as the vmalloc layer can provide virtual  
mappings.


- Ability to reduce the use of vmalloc layer significantly by using
  physically contiguous memory instead of virtual contiguous memory.
  Most uses of vmalloc() can be converted to page allocator calls.

- The use of physically contiguous memory instead of vmalloc may  
allow the
  use larger TLB entries thus reducing TLB pressure. Also reduces  
the need

  for page table walks.

Disadvantages:

- In order to use fall back the logic accessing the memory must be
  aware that the memory could be backed by a virtual mapping and take
  precautions. virt_to_page() and page_address() may not work and
  vmalloc_to_page() and vmalloc_address() (introduced through this
  patch set) may have to be called.

- Virtual mappings are less efficient than physical mappings.
  Performance will drop once virtual fall back occurs.

- Virtual mappings have more memory overhead. vm_area control  
structures
  page tables, page arrays etc need to be allocated and managed to  
provide

  virtual mappings.

The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with

alloc_page(GFP_VFALLBACK, order, )

which signifies to the page allocator that a higher order is to be  
found
but a virtual mapping may stand in if there is an issue with  
fragmentation.


Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.

The stage 1 series ends with the conversion of a few key uses of  
vmalloc
in the VM to alloc_pages() for the allocation of sparsemems memmap  
table
and the wait table in each zone. Other uses of vmalloc could be  
converted

in the same way.


Stage 2 functionality enhances the fallback even more allowing  
allocation

and frees in interrupt context.

SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked this  
way

then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.

Tw