Re: [RFC 03/11] slim down debugfs

2008-02-23 Thread Hugh Dickins
On Sat, 23 Feb 2008, Al Viro wrote:
> On Tue, Feb 19, 2008 at 05:04:38AM +0100, Arnd Bergmann wrote:
> > With most of debugfs now copied to generic code in libfs,
> > we can remove the original copy and replace it with thin
> > wrappers around libfs.
> > 
> > Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
> > Index: linux-2.6/fs/Kconfig
> > ===
> > --- linux-2.6.orig/fs/Kconfig
> > +++ linux-2.6/fs/Kconfig
> > @@ -1001,6 +1001,14 @@ config CONFIGFS_FS
> >   Both sysfs and configfs can and should exist together on the
> >   same system. One is not a replacement for the other.
> >  
> > +config LIBFS
> > +   tristate
> > +   default m
> > +   help
> > + libfs is a helper library used by many of the simpler file
> > + systems. Parts of libfs can be modular when all of its users
> > + are modules as well, and the users should select this symbol.
> 
> NAK.  For one thing, you need dependencies or selects and none of the
> patches later in the series introduces them.  For another, neither
> dependencies nor selects work well if you have non-modular users of
> that sucker.  Seriously, try to add those and do allmodconfig.  Then
> look what a mess it produces.

For sure.  Luckily, ramfs makes extensive use of simple_ libfs functions
(ramfs is where they came from originally), and is always configured in,
so libfs should always be in (unless Arnd's patches change things around
and ramfs no longer uses it).

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unionfs_copy_attr_times oopses

2008-02-19 Thread Hugh Dickins
On Sat, 16 Feb 2008, Erez Zadok wrote:
> 
> Check out my latest set of patches (which correspond to release 2.2.4 of
> Unionfs).  Thanks to your info and the patch, I was able to trigger several
> races more frequently, and fix them.  I've tested my code with make -j N
> (for N=4 and N=20), on a 4 cpu machine a well as a 2 cpu machine (w/
> different amounts of memory and CPU speeds, also 32-bit vs 64-bit); I ran a
> kernel compile for ~10-12 hours.  With the patches I just posted, I wasn't
> able to trigger any of the WARN_ON's in unionfs_copy_attr_times.  I also
> tried it while flushing caches via /proc, and/or performing branch-mgmt
> commands in unionfs.
> 
> Give it a good shake and let me know what you find.

I've now shaken it for 26 hours each on three machines, while running
a few other tests (including LTP, with CONFIG_LOCKDEP=y) on another;
used 2.6.25-rc2-mm1 plus your patches plus undo dependence on BROKEN.

I took the precaution of running with my WARN_ONs
reinstated in unionfs_copy_attr_times, i.e.

int bindex, bend;
struct inode *lower;
struct inode **lower_inodes;

if (!upper)
return;
bindex = ibstart(upper);
if (bindex < 0)
return;
while (1) {
bend = ibend(upper);
if (WARN_ON(bend < 0))
break;
if (bindex > bend)
break;
lower_inodes = UNIONFS_I(upper)->lower_inodes;
if (WARN_ON(!lower_inodes))
break;
lower = lower_inodes[bindex++];
if (!lower)
continue; /* not all lower dir objects may exist */
...

Things look much better with this version than they did, but in that
26 hours one of the machines did issue one of those warnings, the
WARN_ON(bend < 0): that does indicate still some raciness,
doesn't it, since bindex >= 0 before entering the loop?
Here's the accompanying stacktrace:

[ cut here ]
WARNING: at fs/unionfs/subr.c:258 unionfs_copy_attr_times+0x70/0x11c()
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device 
acpi_cpufreq processor button
Pid: 10438, comm: pdflush Not tainted 2.6.25-rc2-mm1 #3

Call Trace:
 [] warn_on_slowpath+0x62/0x8d
 [] ? preempt_schedule_irq+0x63/0x7c
 [] ? unionfs_copy_attr_times+0x85/0x11c
 [] unionfs_copy_attr_times+0x70/0x11c
 [] ? __writepage+0x0/0x45
 [] unionfs_writepages+0x5c/0x69
 [] do_writepages+0x36/0x51
 [] __sync_single_inode+0x71/0x1a1
 [] __writeback_single_inode+0x131/0x13e
 [] ? preempt_schedule+0x5d/0x77
 [] ? iput+0x55/0x8f
 [] generic_sync_sb_inodes+0x157/0x257
 [] sync_sb_inodes+0x3b/0x3d
 [] writeback_inodes+0x6d/0xc7
 [] ? background_writeout+0x35/0xe9
 [] background_writeout+0xa6/0xe9
 [] __pdflush+0x148/0x1f5
 [] ? pdflush+0x0/0x50
 [] pdflush+0x4e/0x50
 [] ? background_writeout+0x0/0xe9
 [] kthread+0x56/0x86
 [] ? schedule_tail+0x36/0x72
 [] child_rip+0xb/0x12
 [] ? kthread+0x0/0x86
 [] ? child_rip+0x0/0x12

---[ end trace 1b6402a1105c37cb ]---

Another of the machines, with CONFIG_UNION_FS_DEBUG=y,
occasionally issued one of your debug warnings, six in all:

PC:fs/unionfs/rename.c:unionfs_rename:536
 Ci7: inode/linode=f08e9bec: bindex=0 istart/end=0:0
...
PC:fs/unionfs/dentry.c:unionfs_d_revalidate:481
 CI8: bindex=0 mtime/lmtime=1203370546.467901272/1203370546.467901272 
ctime/lctime=1203370546.467901272/1203370546.467901272
...
PC:fs/unionfs/rename.c:unionfs_rename:536
 Ci7: inode/linode=efabb31c: bindex=0 istart/end=0:0
...
PC:fs/unionfs/rename.c:unionfs_rename:536
 Ci7: inode/linode=ee06dbec: bindex=0 istart/end=0:0
...
PC:fs/unionfs/rename.c:unionfs_rename:536
 Ci4: no lower_inodes e1442494
...
PC:fs/unionfs/rename.c:unionfs_rename:536
 Ci7: inode/linode=d71731a4: bindex=0 istart/end=0:0

The third (like the first, without UNION_FS_DEBUG) had a clean run.

The LTP run generally went fine, but one anomaly I happened to
notice this time, probably been there for months: when it's testing
swapon on unionfs over tmpfs, the kernel's "Adding 32k swap" message
said "across:4k" each time, which is odd since it's hard to fit 32k
into 4k - unless you're compressing, which is someone else's project!

I think that's because bmap() is not very well defined, and actually,
you need to be sure to return sector 0 whenever there's an error
(including the lower level not supporting bmap, as in the tmpfs case),
not -EINVAL.  I think that's how generic_block_bmap tries to play it.

Previously I'd felt indulgent about you "supporting" swapon of a
unionfs file even when the lower level didn't support it; but now
I think it's probably dangerously wrong (though limited to root).
I didn't dare try what happens when you actually get to swapping to
such a beast; and in fact, I didn't even try to wrap my head around
even your "good" bmap support in unionfs - it's not obvious to me

unionfs_copy_attr_times oopses

2008-02-01 Thread Hugh Dickins
Hi Erez,

Aside from the occasional "unionfs: new lower inode mtime" messages
on directories (which I've got into the habit of ignoring now), the
only problem I'm still suffering with unionfs over tmpfs (not tested
any other fs's below it recently) is oops in unionfs_copy_attr_times.

I believe I'm working with your latest: 2.6.24-rc8-mm1 plus the four
patches you posted to lkml on 26 Jan.  But this problem has been around
for a while before that: I'd been hoping to debug it myself, but taken
too long to make too little progress, so now handing over to you.

The oops occurs while doing repeated "make -j20" kernel builds in a
unionfs mount of a tmpfs (though I doubt tmpfs is relevant): most of
my testing was while swapping, but today I find that's irrelevant,
and it should happen much quicker without.  SMP kernels (4 cpus),
I haven't tried UP; happens with or without PREEMPT, may just be
coincidence that it happens quicker on the machines with PREEMPT.

Most commonly it's unionfs_copy_attr_times called from unionfs_create,
but that's probably just the most common route in this workload:
I've seen it also when called from unionfs_rename or unionfs_open or
unionfs_unlink.  It looks like there's a locking or refcounting bug,
hence a race: the unionfs_inode_info which unionfs_copy_attr_times
is working on gets changed underneath it, so it oopses on NULL
lower_inodes.  It took me far too long to realize that ibstart
(and ibend) are -1 when it oopses, yet the function returns
immediately if ibstart is negative on entry.  I've not a clue
what it is that's changing it.

What I did make progress with yesterday was a workaround patch, which
additionally makes the problem more manifest: by WARNing in much more
common cases which were invisible (didn't go so far as to oops) before.
The oops had typically taken 12 hours to happen, but I'm getting like
one warning per hour (varies from machine to machine) with the patch
while swapping, and now much more frequently giving it more memory.

I've two patches.  The first isn't interesting, so I just attached
it.  It moves unionfs_copy_attr_times (and unionfs_copy_attr_all)
from inline in fanout.h to out-of-line in copyup.c: the function's
too big and too widely used to be suitable for inlining, this reduces
kernel text by about 6k.  But since making that patch, I've seen that
copyup.c is really about copying up the data, and you may well prefer
to move the functions to subr.c or elsewhere.

The patch to go on top of that is appended below: makes unionfs_copy
_attr_times more paranoid about the fields it's accessing; but this
can only be a temporary workaround - so long as there's such a race,
it could be accessing freed/reused memory with unpredictable results.

Most of the warnings I see with this patch are the first, on bend < 0:
the old "bindex <= ibend(upper)" condition would often have terminated
the loop in that case, without noticing the discrepancy that ibend
had gone negative.  Occasionally I'm seeing the second warning, on
lower_inodes, which would correspond to the oops: here's a trace
(cutting out the ? lines, just stale addresses left on the stack):

WARNING: at fs/unionfs/copyup.c:906 unionfs_copy_attr_times+0x5f/0xcc()
Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device thermal button processor
Pid: 13791, comm: cc1 Not tainted 2.6.24-rc8-mm1 #19
 [] warn_on_slowpath+0x3e/0x51
 [] unionfs_copy_attr_times+0x5f/0xcc
 [] unionfs_create+0x160/0x1e7
 [] vfs_create+0x62/0xa8
 [] __open_namei_create+0x44/0x8e
 [] open_pathname+0x14d/0x55c
 [] do_sys_open+0x41/0xbd
 [] sys_open+0x13/0x15
 [] sysenter_past_esp+0x5f/0x85

--- a/fs/unionfs/copyup.c   2008-01-29 16:59:13.0 +
+++ b/fs/unionfs/copyup.c   2008-01-31 12:24:58.0 +
@@ -887,13 +887,25 @@ void unionfs_postcopyup_release(struct d
 /* copy a/m/ctime from the lower branch with the newest times */
 void unionfs_copy_attr_times(struct inode *upper)
 {
-   int bindex;
+   int bindex, bend;
struct inode *lower;
+   struct inode **lower_inodes;
 
-   if (!upper || ibstart(upper) < 0)
+   if (!upper)
return;
-   for (bindex = ibstart(upper); bindex <= ibend(upper); bindex++) {
-   lower = unionfs_lower_inode_idx(upper, bindex);
+   bindex = ibstart(upper);
+   if (bindex < 0)
+   return;
+   while (1) {
+   bend = ibend(upper);
+   if (WARN_ON(bend < 0))
+   break;
+   if (bindex > bend)
+   break;
+   lower_inodes = UNIONFS_I(upper)->lower_inodes;
+   if (WARN_ON(!lower_inodes))
+   break;
+   lower = lower_inodes[bindex++];
if (!lower)
continue; /* not all lower dir objects may exist */
if (unlikely(timespec_compare(&upper->i_mtime,--- a/fs/unionfs/copyup.c	2008-01-27 13:00:24.0 +

Re: [patch 24/26] mount options: fix tmpfs

2008-01-29 Thread Hugh Dickins
On Mon, 28 Jan 2008, Miklos Szeredi wrote:
> 
> I'll add functions for calculating the default max values, so the
> calculations won't accidentally become different for the
> initialization and the option showing.

Excellent.

> > If you agree with the version below, please take it into your collection
> > and insert your Signed-off-by.  I should admit, I've not yet tested how
> > the NUMA policies look: you'll hear from me again tomorrow morning if
> > those turn out to wrong.
> 
> OK, I'll send this to Andrew.  Maybe I'll wait until tomorrow to hear
> if it's working on NUMA.

I've just now checked: the NUMA options are showing fine.

Thanks,
Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 24/26] mount options: fix tmpfs

2008-01-27 Thread Hugh Dickins
On Thu, 24 Jan 2008, Miklos Szeredi wrote:
>

Thanks Miklos, that's a welcome enhancement, nicely done.  I've only
noticed one thing wrong (MPOL_PREFERRED shown as "default"); but thought
shmem_config didn't add much value - I'd rather avoid those syntactic
changes to unchanged code; and several tmpfs defaults being relative
(e.g. to totalram_pages, or to mounter's fsuid), I ended up preferring
to do real tests in shmem_show_options.

Thus, for example, if memory is hotplugged in or out later, what started
out as an unspecified size option will then get shown as explicit size.
(I did think for a while that I wanted to show explicit size in all
cases; but it looked pretty silly on udev.)  I think that's the correct
behaviour, that otherwise would be misleading; but I may be looking at
this the wrong way round, what's your view?

If you agree with the version below, please take it into your collection
and insert your Signed-off-by.  I should admit, I've not yet tested how
the NUMA policies look: you'll hear from me again tomorrow morning if
those turn out to wrong.

Hugh

From: Miklos Szeredi <[EMAIL PROTECTED]>

Add .show_options super operation to tmpfs.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---

 include/linux/shmem_fs.h |5 
 mm/shmem.c   |  189 -
 2 files changed, 129 insertions(+), 65 deletions(-)

--- 2.6.24-rc8-mm1/include/linux/shmem_fs.h 2006-11-29 21:57:37.0 
+
+++ linux/include/linux/shmem_fs.h  2008-01-27 22:42:52.0 +
@@ -30,9 +30,12 @@ struct shmem_sb_info {
unsigned long free_blocks;  /* How many are left for allocation */
unsigned long max_inodes;   /* How many inodes are allowed */
unsigned long free_inodes;  /* How many are left for allocation */
+   spinlock_t stat_lock;   /* Serialize shmem_sb_info changes */
+   uid_t uid;  /* Mount uid for root directory */
+   gid_t gid;  /* Mount gid for root directory */
+   mode_t mode;/* Mount mode for root directory */
int policy; /* Default NUMA memory alloc policy */
nodemask_t policy_nodes;/* nodemask for preferred and bind */
-   spinlock_tstat_lock;
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
--- 2.6.24-rc8-mm1/mm/shmem.c   2008-01-17 16:51:21.0 +
+++ linux/mm/shmem.c2008-01-27 23:41:31.0 +
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1068,7 +1069,8 @@ redirty:
 }
 
 #ifdef CONFIG_NUMA
-static inline int shmem_parse_mpol(char *value, int *policy, nodemask_t 
*policy_nodes)
+#ifdef CONFIG_TMPFS
+static int shmem_parse_mpol(char *value, int *policy, nodemask_t *policy_nodes)
 {
char *nodelist = strchr(value, ':');
int err = 1;
@@ -1117,6 +1119,42 @@ out:
return err;
 }
 
+static void shmem_show_mpol(struct seq_file *seq, int policy,
+   const nodemask_t policy_nodes)
+{
+   char *policy_string;
+
+   switch (policy) {
+   case MPOL_PREFERRED:
+   policy_string = "prefer";
+   break;
+   case MPOL_BIND:
+   policy_string = "bind";
+   break;
+   case MPOL_INTERLEAVE:
+   policy_string = "interleave";
+   break;
+   default:
+   /* MPOL_DEFAULT */
+   return;
+   }
+
+   seq_printf(seq, ",mpol=%s", policy_string);
+
+   if (policy != MPOL_INTERLEAVE ||
+   !nodes_equal(policy_nodes, node_states[N_HIGH_MEMORY])) {
+   char buffer[64];
+   int len;
+
+   len = nodelist_scnprintf(buffer, sizeof(buffer), policy_nodes);
+   if (len < sizeof(buffer))
+   seq_printf(seq, ":%s", buffer);
+   else
+   seq_printf(seq, ":?");
+   }
+}
+#endif /* CONFIG_TMPFS */
+
 static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
struct shmem_inode_info *info, unsigned long idx)
 {
@@ -1148,13 +1186,20 @@ static struct page *shmem_alloc_page(gfp
mpol_free(pvma.vm_policy);
return page;
 }
-#else
+#else /* !CONFIG_NUMA */
+#ifdef CONFIG_TMPFS
 static inline int shmem_parse_mpol(char *value, int *policy,
nodemask_t *policy_nodes)
 {
return 1;
 }
 
+static inline void shmem_show_mpol(struct seq_file *seq, int policy,
+   const nodemask_t policy_nodes)
+{
+}
+#endif /* CONFIG_TMPFS */
+
 static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
struct shmem_inode_info *info, unsigned long idx)
 {
@@ -1166,7 +1211,7 @@ static inline struct page *shmem_alloc_

Re: [PATCH] xip: fix get_zeroed_page with __GFP_HIGHMEM

2007-12-25 Thread Hugh Dickins
On Tue, 25 Dec 2007, Akinobu Mita wrote:
> The use of get_zeroed_page() with __GFP_HIGHMEM is invalid.
> Use alloc_page() with __GFP_ZERO instead of invalid get_zeroed_page().
> 
> (This patch is only compile tested)
> 
> Cc: Carsten Otte <[EMAIL PROTECTED]>
> Signed-off-by: Akinobu Mita <[EMAIL PROTECTED]>

Good find!  You got me very worried, how this escaped testing before.
Presumed explanation: it hasn't been needed beyond s390, which has no
CONFIG_HIGHMEM; and it has never been tested with CONFIG_DEBUG_VM on.

Acked-by: Hugh Dickins <[EMAIL PROTECTED]>

But I haven't tested it either: let's wait for Carsten to report.
I believe Nick has changes on the way which will make it possible for
ordinary mortals to test XIP: here's a good argument to bring them on.

May I cross-reference my "prep_zero_page: remove bogus BUG_ON"
09f345da758fca1222b0971b65b2fddbdf78bb83 in 2.6.24-rc: that bogus
(actually VM_)BUG_ON would have stood in the way too, so there's
no point in backporting this without that.  But if only non-HIGHMEM
architectures can have been using XIP, a backport is not essential.

Hugh

p.s. Nick's ZERO_PAGE changes, in 2.6.24-rc, actually cancel the need
for a special xip_sparse_page distinct from ZERO_PAGE.  But let's not
become dependent on those: keep this doing it the way it does now.

> 
> ---
>  mm/filemap_xip.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> Index: 2.6-git/mm/filemap_xip.c
> ===
> --- 2.6-git.orig/mm/filemap_xip.c
> +++ 2.6-git/mm/filemap_xip.c
> @@ -25,14 +25,15 @@ static struct page *__xip_sparse_page;
>  static struct page *xip_sparse_page(void)
>  {
>   if (!__xip_sparse_page) {
> - unsigned long zeroes = get_zeroed_page(GFP_HIGHUSER);
> - if (zeroes) {
> + struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> +
> + if (page) {
>   static DEFINE_SPINLOCK(xip_alloc_lock);
>   spin_lock(&xip_alloc_lock);
>   if (!__xip_sparse_page)
> - __xip_sparse_page = virt_to_page(zeroes);
> + __xip_sparse_page = page;
>   else
> - free_page(zeroes);
> + __free_page(page);
>   spin_unlock(&xip_alloc_lock);
>   }
>   }
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-17 Thread Hugh Dickins
On Tue, 13 Nov 2007, Erez Zadok wrote:
> 
> I posted all of these patches just now.  You're CC'ed.  Hopefully Andrew can
> pull from my unionfs.git branch soon.
> 
> You also reported in your previous emails some hangs/oopses while doing make
> -j 20 in unionfs on top of a single tmpfs, using -mm.  After several days,
> I've not been able to reproduce this w/ my latest set of patches.  If you
> can send me your .config and the specs on the h/w you're using (cpus, mem,
> etc.), I'll see if I can find something similar to it on my end and run the
> same tests.

I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
but the one including those 9 patches you posted, now gets through
my testing with tmpfs without a problem.  I do still get occasional
"unionfs: new lower inode mtime (bindex=0, name=)"
messages, but nothing worse seen yet: a big improvement.

I deceived myself for a while that the danger of shmem_writepage
hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
I must go back to working out an escape from that one (despite never
seeing it).

I did think you could clean up the doubled set_page_dirtys,
but it's of no consequence.

Hugh

--- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c  2007-11-17 12:23:30.0 +
+++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 +
@@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
copy_highpage(lower_page, page);
flush_dcache_page(lower_page);
SetPageUptodate(lower_page);
+   set_page_dirty(lower_page);
 
/*
 * Call lower writepage (expects locked page).  However, if we are
@@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
 * success.
 */
if (wbc->for_reclaim) {
-   set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
+
BUG_ON(!lower_mapping->a_ops->writepage);
-   set_page_dirty(lower_page);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-12 Thread Hugh Dickins
On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> 
> > Three, I believe you need to add a flush_dcache_page(lower_page)
> > after the copy_highpage(lower_page): some architectures will need
> > that to see the new data if they have lower_page mapped (though I
> > expect it's anyway shaky ground to be accessing through the lower
> > mount at the same time as modifying through the upper).
> 
> OK.

While looking into something else entirely, I realize that _here_
you are missing a SetPageUptodate(lower_page): should go in after
the flush_dcache_page(lower_page) I'm suggesting.  (Nick would argue
for some kind of barrier there too, but I don't think unionfs has a
special need to be ahead of the pack on that issue.)

Think about it:
when find_or_create_page has created a fresh page in the cache,
and you've just done copy_highpage to put the data into it, you
now need to mark it as Uptodate: otherwise a subsequent vfs_read
or whatever on the lower level will find that page !Uptodate and
read stale data back from disk instead of what you just copied in,
unless its dirtiness has got it written back to disk meanwhile.

Odd that that hasn't been noticed at all: I guess it may be hard
to get testing to reclaim lower/upper pages in such a way as to
test out such paths thoroughly.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-11 Thread Hugh Dickins
On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > One, I think you would be safer to do a set_page_dirty(lower_page)
> > before your clear_page_dirty_for_io(lower_page).  I know that sounds
> > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> > there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> > usual path closer if you set_page_dirty first - there's nothing else
> > doing it on that lower_page, is there?  I'm not certain that you need
> > to, but I think you'd do well to look into it and make up your own mind.
> 
> Hugh, my code looks like:
> 
>   if (wbc->for_reclaim) {
>   set_page_dirty(lower_page);
>   unlock_page(lower_page);
>   goto out_release;
>   }
>   BUG_ON(!lower_mapping->a_ops->writepage);
>   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
>   err = lower_mapping->a_ops->writepage(lower_page, wbc);
> 
> Do you mean I should set_page_dirty(lower_page) unconditionally before
> clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

Yes.  Whether you're wrong not to be doing that already, I've not checked;
but I think doing so will make unionfs safer against any future changes
in the relationship between set_page_dirty and clear_page_dirty_for_io.

For example, look at clear_page_dirty_for_io: it's decrementing some
statistics which __set_page_dirty_nobuffers increments.  Does use of
unionfs (over some filesystems) make those numbers wrap increasingly
negative?  Does adding this set_page_dirty(lower_page) correct that?
I suspect so, but may be wrong.

> > Two, I'm unsure of the way you're clearing or setting PageUptodate on
> > the upper page there.  The rules for PageUptodate are fairly obvious
> > when reading, but when a write fails, it's not so obvious.  Again, I'm
> > not saying what you've got is wrong (it may be unavoidable, to keep
> > synch between lower and upper), but it deserves a second thought.
> 
> I looked at all mainline filesystems's ->writepage to see what, if any, they
> do with their page's uptodate flag.  Most f/s don't touch the flag one way
> or another.

I'm not going to try and guess what assorted filesystems are up to,
and not all of them will be bugfree.  The crucial point of PageUptodate
is that we insert a filesystem page into the page cache before it's had
any data read in: it needs to be !PageUptodate until the data is there,
and then marked PageUptodate to say the data is good and others can
start using it.  See mm/filemap.c.

> And finally, unionfs clears the uptodate flag on error from the lower
> ->writepage, and otherwise sets the flag on success from the lower
> ->writepage.  My gut feeling is that unionfs shouldn't change the page
> uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
> uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
> write out a page which isn't up-to-date, right?  Otherwise, whether
> unionfs_writepage manages to write the lower page or not, why should that
> invalidate the state of the unionfs page itself?  Come to think of it, I
> think clearing pageuptodate on error from ->writepage(lower_page) may be
> bad.  Imagine if after such a failed unionfs_writepage, I get a
> unionfs_readpage: that ->readpage will get data from the lower f/s page and
> copy it *over* the unionfs page, even if the upper page's data was more
> recent prior to the failed call to unionfs_writepage.  IOW, we could be
> reverting a user-visible mmap'ed page to a previous on-disk version.  What
> do you think: could this happen?  Anyway, I'll run some exhaustive testing
> next and see what happens if I don't set/clear the uptodate flag in
> unionfs_writepage.

That was my point, and I don't really have more to add.  It's unusual
to do anything with PageUptodate when writing.  By clearing it when the
lower level has an error, you're throwing away the changes already made
at the upper level.  You might have some good reason for that, but it's
worth questioning.  If you don't know why you're Set/Clear'ing it there,
better to just take that out.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins
On Mon, 5 Nov 2007, Dave Hansen wrote:
> 
> Actually, I think your s/while/if/ change is probably a decent fix.

Any resemblance to a decent fix is purely coincidental.

> Barring any other races, that loop should always have made progress on
> mnt->__mnt_writers the way it is written.  If we get to:
> 
> > lock_and_coalesce_cpu_mnt_writer_counts();
> ->HERE
> > mnt_unlock_cpus();
> 
> and don't have a positive mnt->__mnt_writers, we know something is going
> badly.  We WARN_ON() there, which should at least give an earlier
> warning that the system is not doing well.  But it doesn't fix the
> inevitable.  Could you try the attached patch and see if it at least
> warns you earlier?

Thanks, Dave, yes, that gives me a nice warning:

leak detected on mount(c25ebd80) writers count: -65537
WARNING: at fs/namespace.c:249 handle_write_count_underflow()
 [] show_trace_log_lvl+0x1b/0x2e
 [] show_trace+0x16/0x1b
 [] dump_stack+0x19/0x1e
 [] handle_write_count_underflow+0x4c/0x60
 [] mnt_drop_write+0x69/0x8e
 [] __fput+0xff/0x162
 [] fput+0x2e/0x33
 [] unionfs_file_release+0xc2/0x1c5
 [] __fput+0x8f/0x162
 [] fput+0x2e/0x33
 [] filp_close+0x50/0x5d
 [] sys_close+0x74/0xb4
 [] sysenter_past_esp+0x5f/0x85

and the test then goes quietly on its way instead of hanging.  Though
I imagine, with your patch or mine, that it's then making an unfortunate
frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type
thereafter.  But it's hardly your responsibility to optimize for bugs
elsewhere.

The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to
#define MNT_IMBALANCED_WRITE_COUNT  0x400 /* just for debugging */

> 
> I have a decent guess what the bug is, too.  In the unionfs code:

I'll let Erez take it from there...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins
[Dave, I've Cc'ed you re handle_write_count_underflow, see below.]

On Wed, 31 Oct 2007, Erez Zadok wrote:
> 
> Hi Hugh, I've addressed all of your concerns and am happy to report that the
> newly revised unionfs_writepage works even better, including under my
> memory-pressure conditions.  To summarize my changes since the last time:
> 
> - I'm only masking __GFP_FS, not __GFP_IO
> - using find_or_create_page to avoid locking issues around mapping mask
> - handle for_reclaim case more efficiently
> - using copy_highpage so we handle KM_USER*
> - un/locking upper/lower page as/when needed
> - updated comments to clarify what/why
> - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
>   to have it)
> 
> Below is the newest version of unionfs_writepage.  Let me know what you
> think.
> 
> I have to say that with these changes, unionfs appears visibly faster under
> memory pressure.  I suspect the for_reclaim handling is probably the largest
> contributor to this speedup.

That's good news, and that unionfs_writepage looks good to me -
with three reservations I've not observed before.

One, I think you would be safer to do a set_page_dirty(lower_page)
before your clear_page_dirty_for_io(lower_page).  I know that sounds
silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
there's a lot of subtlety hereabouts, and I think you'd be mimicing the
usual path closer if you set_page_dirty first - there's nothing else
doing it on that lower_page, is there?  I'm not certain that you need
to, but I think you'd do well to look into it and make up your own mind.

Two, I'm unsure of the way you're clearing or setting PageUptodate on
the upper page there.  The rules for PageUptodate are fairly obvious
when reading, but when a write fails, it's not so obvious.  Again, I'm
not saying what you've got is wrong (it may be unavoidable, to keep
synch between lower and upper), but it deserves a second thought.

Three, I believe you need to add a flush_dcache_page(lower_page)
after the copy_highpage(lower_page): some architectures will need
that to see the new data if they have lower_page mapped (though I
expect it's anyway shaky ground to be accessing through the lower
mount at the same time as modifying through the upper).

I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9
and your 2 Nov 1-8/8 patches applied (rejects being patches which
were already in 2.6.23-mm1).  I was hoping to reproduce the
BUG_ON(entry->val) that I fear from shmem_writepage(), before
fixing it; but not seen that at all yet - that might be good
news, but it's more likely I just haven't tried hard enough yet.

For now I'm doing repeated make -j20 kernel builds, pushing into
swap, in a unionfs mount of just a single dir on tmpfs.  This has
shown up several problems, two of which I've had to hack around to
get further.

The first: I very quickly hit "BUG: atomic counter underflow"
from -mm's i386 atomic_dec_and_test: from filp_close calling
unionfs_flush.  I did a little test fork()ing while holding a file
open on unionfs, and indeed it appears that your totalopens code is
broken, being unaware of how fork() bumps up a file count without
an open.  That's rather basic, I'm puzzled that this has remained
undiscovered until now - or perhaps it's just a recent addition.

It looked to me as if the totalopens count was about avoiding some
d_deleted processing in unionfs_flush, which actually should be left
until unionfs_release (and that your unionfs_flush ought to be calling
the lower level flush in all cases).  To get going, I've been running
with the quick hack patch below: but I've spent very little time
thinking about it, plus it's a long time since I've done VFS stuff;
so that patch may be nothing but an embarrassment that reflects
neither your intentions nor the VFS rules!  And it may itself be
responsible for the further problems I've seen.

The second problem was a hang: all cpus in handle_write_count_underflow
doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
Hansen.  At first I thought that was a locking problem in Dave's code,
but I now suspect it's that your unionfs reference counting is wrong
somewhere, and the error accumulates until __mnt_writers drops below
MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
and we're stuck in that loop.  My even greater hack to solve that one
was to change Dave's "while" to "if"!  Then indeed tests can run for
some while.  As I say, my suspicion is that the actual error is within
unionfs (perhaps introduced by my first hack); but I hope Dave can
also make handle_write_count_underflow more robust, it's unfortunate
if refcount errors elsewhere first show up as a hang there.

I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when
I come back to this, since it's rather noisy at present.  I've not
checked whether its reports are peculiar to having tmpfs below or not.
I get lots of "unionfs: new lower inode mtime" r

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-29 Thread Hugh Dickins
On Sun, 28 Oct 2007, Erez Zadok wrote:
> 
> I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
> and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
> it under memory pressure: I have a couple of live CDs that use tmpfs and can
> deterministically reproduce the conditions resulting in A_W_A.  I also went
> back to using grab_cache_page but with the gfp_mask suggestions you made.
> 
> I'm happy to report that it all works great now!

That's very encouraging...

> Below is the entirety of
> the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
> others can look it over and see if you find any problems.

... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...

> static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
> {
>   int err = -EIO;
>   struct inode *inode;
>   struct inode *lower_inode;
>   struct page *lower_page;
>   char *kaddr, *lower_kaddr;
>   struct address_space *mapping; /* lower inode mapping */
>   gfp_t old_gfp_mask;
> 
>   inode = page->mapping->host;
>   lower_inode = unionfs_lower_inode(inode);
>   mapping = lower_inode->i_mapping;
> 
>   /*
>* find lower page (returns a locked page)
>*
>* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under

On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.

>* memory pressure conditions.  This is similar to how the loop
>* driver behaves (see loop_set_fd in drivers/block/loop.c).
>* If we can't find the lower page, we redirty our page and return
>* "success" so that the VM will call us again in the (hopefully
>* near) future.
>*/
>   old_gfp_mask = mapping_gfp_mask(mapping);
>   mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> 
>   lower_page = grab_cache_page(mapping, page->index);
>   mapping_set_gfp_mask(mapping, old_gfp_mask);

Hmm, several points on that.

When suggesting something like this, I did remark "what locking needed?".
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
 * This is non-atomic.  Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage.  I think you're
probably okay to be doing it only here within ->writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level ->writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the ->writepage (but rather depends on
filesystem).  And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc->for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc->for_reclaim case, I think you don't really
need to call the lower level writepage at all.  Just set_page_dirty
on the lower page, unlock it and return.  In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it.  Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page.  Look at the declaration of grab_cache_page
in linux/pag

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Hugh Dickins
On Fri, 26 Oct 2007, Neil Brown wrote:
> On Thursday October 25, [EMAIL PROTECTED] wrote:
> 
> The patch looks like it makes perfect sense to me.

Great, thanks a lot for looking at it, Neil and Pekka.

> Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE
> without unlocking the page, and this has precisely the effect of:
>ClearPageReclaim();  (if the call path was through pageout)
>SetPageActive();  (if the call was through shrink_page_list)
>unlock_page();
> 
> With the patch, the ->writepage method does the SetPageActive and the
> unlock_page, which on the whole seems cleaner.
> 
> We seem to have lost a call to ClearPageReclaim - I don't know if that
> is significant.

It doesn't show up in the diff at all, but pageout() already has
if (!PageWriteback(page)) {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
which will clear it since we've never set PageWriteback.

I think no harm would come from leaving it set there, since it only
takes effect in end_page_writeback (its effect being to let the just
written page be moved to the end of the LRU, so that it will then
be soon reclaimed), and clear_page_dirty_for_io clears it before
coming down this way.  But I'd never argue for that: I hate having
leftover flags hanging around outside the scope of their relevance.

> > Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
> > stole his own secret and used it when concocting ramdisk_writepage.
> > Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
> > revealed the secret to the uninitiated in 2.6.17: now, what's the
> > appropriate punishment for that?
> 
> Surely the punishment should be for writing hidden undocumented hacks
> in the first place!  I vote we just make him maintainer for the whole
> kernel - that will keep him so busy that he will never have a chance
> to do it again :-)

That is a splendid retort, which has won you absolution.
But it makes me a little sad: that smiley should be a weepy.

> > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
> > 07:15:11.0 +0100
> > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
> > +0100
> > @@ -567,9 +567,7 @@ struct address_space_operations {
> >If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
> >try too hard if there are problems, and may choose to write out
> >other pages from the mapping if that is easier (e.g. due to
> > -  internal dependencies).  If it chooses not to start writeout, it
> > -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> > -  calling ->writepage on that page.
> > +  internal dependencies).
> >  
> 
> It seems that the new requirement is that if the address_space
> chooses not to write out the page, it should now call SetPageActive().
> If that is the case, I think it should be explicit in the
> documentation - please?

No, it's not the case; but you're right that I should add something
there, to put an end to the idea.  It'll be something along the lines
of "You may notice shmem setting PageActive there, but please don't do
that; or if you insist, be sure never to do so in the !wbc->for_reclaim
case".

The PageActive thing is for when a filesystem regrets that it even
had a ->writepage (it replicates the behaviour of the writepage == NULL
case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying
the return of this page to writepage for as long as it can).  It's done
in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only
discovered within that writepage, and no-swap is discovered there too.

ramdisk does it too: I've not tried to understand ramdisk as Nick and
Eric have, but it used to have no writepage, and would prefer to have
no writepage, but appears to need one for some PageUptodate reasons.

It's fairly normal for a filesystem to find that for some reason it
cannot carry out a writepage on this page right now (in the reclaim
case: the sync case demands action, IIRC); so it then simply does
set_page_dirty and unlock_page and returns "success".

I'll try to condense this down for the Doc when finalizing the patch;
which I've still not yet tested properly - thanks for the eyes, but
I can't submit it until I've checked in detail that it really gets
to do what we think it does.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins
On Thu, 25 Oct 2007, Erez Zadok wrote:
> 
> On a related note, I would just love to get rid of calling the lower
> ->writepage in unionfs b/c I can't even tell if I have a lower page to use
> all the time.  I'd prefer to call vfs_write() if I can, but I'll need a
> struct file, or at least a dentry.

Why do you want to do that?  You gave a good reason why it's easier
for ecryptfs, but I doubt it's robust.  The higher the level you
choose to use, the harder to guarantee it won't deadlock.

Or that's my gut feeling anyway.  It's several years since I've
thought about such issues: just because I came into this knowing
about shmem_writepage, is perhaps not a good reason to choose me
as advisor!

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins
On Mon, 22 Oct 2007, Erez Zadok wrote:
> 
> What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?

Sigh - not at you, at it!  It's a secret that couldn't be kept secret,
a hack for tmpfs reclaim, let's just look forward to it going away.

> Is it considered an error or not?

No, it's definitely not an error.  It'a a private note from tmpfs
(or ramdisk) to vmscan, saying "don't waste your time coming back
to me with this page until you have to, please move on to another
more likely to be freeable".

> If it's an error, then I usually feel that it's important for
> a stacked f/s to return that error indication upwards.

Indeed, but this is not an error.  Remember, neither ramdisk nor
tmpfs is stable storage: okay, tmpfs can go out to disk by using
swap, but that's not stable storage - it's not reconstituted after
reboot.  (If there's an error in writing to swap, well, that's a
different issue; and there's few filesystems where such an I/O
error would be reported from ->writepage.)

> 
> The unionfs page and the lower page are somewhat tied together, at least
> logically.  For unionfs's page to be considered to have been written
> successfully, the lower page has to be written successfully.  So again, if
> the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
> page to have been written successfully or not?

Consider it written successfully.  (What does written mean with tmpfs?
it means a page can be freed, it doesn't mean the data is forever safe.)

> If I don't return
> AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
> never get flushed out?

Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE.
If you mark your page as clean and successfully written, vmscan will
be able to free it.  If needed, we can get the data back from the
lower page on demand, but meanwhile a page has been freed, which
is what vmscan reclaim is all about.  (But of course, in the case
where you couldn't get hold of a page for the lower, you must redirty
yours before returning.)

> > unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
> > find_lock_page: that case may be appropriate.  Though I don't really
> > understand it: seems dangerous to be relying upon the lower level page
> > just happening to be there already.  Isn't memory pressure then likely
> > to clog up with lots of upper level dirty pages which cannot get
> > written out to the lower level?
> 
> Based on vfs.txt (which perhaps should be revised :-), I was trying to do
> the best I can to ensure that no data is lost if the current page cannot be
> written out to the lower f/s.
> 
> I used to do grab_cache_page() before, but that caused problems: writepage
> is not the right place to _increase_ memory pressure by allocating a new
> page...

Yes, but just hoping the lower page will be there, and doing nothing
to encourage it to become there, sounds an even poorer strategy to me.

It's not easy, I know.  Your position reminds me of the loop driver
(drivers/block/loop.c), which has long handled this situation (with
great success, though I doubt an absolute guarantee) by taking
__GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file:
look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(),
that's new to me and seems to be for a very special case).

I grepped for gfp in unionfs, and there seems to be nothing: I doubt
you can be robust under memory pressure without doing something about
that.  If you can take __GFP_IO|__GFP_FS off the lower mapping (just
while in unionfs_writepage, or longer term? what locking needed?),
then you should be able to go back to using grab_cache_page().

> 
> One solution I thought of is do what ecryptfs does: keep an open struct file
> in my inode and call vfs_write(), but I don't see that as a significantly
> cleaner/better solution.

I agree with you.

> (BTW, ecrypfts kinda had to go for vfs_write b/c
> it changes the data size and content of what it writes below; unionfs is
> simpler in that manner b/c it needs to write the same data to the lower file
> at the same offset.)

Ah, yes.

> 
> Another idea we've experimented with before is "page pointer flipping."  In
> writepage, we temporarily set the page->mapping->host to the lower_inode;
> then we call the lower writepage with OUR page; then fix back the
> page->mapping->host to the upper inode.  This had two benefits: first we can
> guarantee that we always have a page to write below, and second we don't
> need to keep both upper and lower pages (reduces memory pressure).  Before
> we did this page pointer flipping, we verified that the page is locked so no
> other user could be written the page->mapping->host in this transient state,
> and we ensured that no lower f/s was somehow caching the temporarily changed
> value of page->mapping->host for later use.  But, mucking with the pointers
> in this manner is kinda ugly, to say the least.  Still, I'd love to find a
> clean and simple way t

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins
On Mon, 22 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> > if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
> 
> Yes, based on vfs.txt I figured unionfs should return
> AOP_WRITEPAGE_ACTIVATE.

unionfs_writepage returns it in two different cases: when it can't
find the underlying page; and when the underlying writepage returns
it.  I'd say it's wrong to return it in both cases.

In the first case, you don't really want your page put back to the head
of the active list, you want to come back to try it again quite soon
(I think): so you should just redirty and unlock and pretend success.

ramdisk uses A_W_A because none of its pages will ever become freeable
(and comment points out it'd be better if they weren't even on the
LRUs - I think several people have recently been putting forward
patches to keep such timewasters off the LRUs).

shmem uses A_W_A when there's no swap (left), or when the underlying
shm is marked as locked in memory: in each case, best to move on to
look for other pages to swap out.  (But I'm not quite convincing myself
that the temporarily out-of-swap case is different from yours above.)
It's about fixing some horrid busy loops where vmscan kept going
over the same hopeless pages repeatedly, instead of moving on to
better candidates.  Oh, there's a third case, when move_to_swap_cache
fails: that's rare, and I think I was just too lazy to separate them.

In your second case, I fail to see why the unionfs level should
mimic the lower level: you've successfully copied data and marked
the lower level pages as dirty, vmscan will come back to those in
due course, but it's just a waste of time for it to come back to
the unionfs pages again - isn't it?

> But, now that unionfs has ->writepages which won't
> even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
> no longer need unionfs_writepage to bother checking for
> AOP_WRITEPAGE_ACTIVATE, or even return it up?

unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to
userspace issue, as does Pekka & Andrew's patch to write_cache_pages,
as does my patch to shmem_writepage.  And I'm contending that
unionfs_writepage should in no case return A_W_A up.

But so long as A_W_A is still defined, unionfs_writepage does
still need to check for it after calling the lower level ->writepage
(because it needs to do the missing unlock_page): unionfs_writepages
prevents unionfs_writepage being called on the normal writeout path,
but it's still getting called by vmscan under memory pressure.

(I'm in the habit of saying "vmscan" rather than naming the functions
in question, because every few months someone restructures that file
and changes their names.  I exaggerate, but it's happened often enough.)

> But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
> BDI_CAP_NO_WRITEBACK, right?  In that case, unionfs will still need to
> handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right?

For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to
check for it coming from the lower level ->writepage, as I said above.

But your/Pekka's unionfs_writepages doesn't need to worry about it
at all, because Andrew/Pekka's write_cache_pages fix prevents it
leaking up in the !reclaim case (as does my shmem_writepage fix):
please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins
On Mon, 22 Oct 2007, Pekka Enberg wrote:
> On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> > if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
> >
> > I think this is where people use the phrase "go figure" ;)
> 
> Heh. As far as I can tell, the implication of "wider use" was added by
> Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
> VFS documentation", so perhaps he might know? Neil?

I take as gospel this extract from Andrew's original 2.5.52 comment:

  So the locking rules for writepage() are unchanged.  They are:
  
  - Called with the page locked
  - Returns with the page unlocked
  - Must redirty the page itself if it wasn't all written.
  
  But there is a new, special, hidden, undocumented, secret hack for
  tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
  the page to the active list.  The page must be kept locked in this one
  case.

Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
stole his own secret and used it when concocting ramdisk_writepage.
Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
revealed the secret to the uninitiated in 2.6.17: now, what's the
appropriate punishment for that?

In the full 2.5.52 comment, Andrew explains how prior to this secret
code, we used fail_writepage, which in the memory pressure case did
an activate_page, with the intention of moving the page to the active
list - but that didn't actually work, because the page is off the
LRUs at this point, being passed around between pagevecs.

I've always preferred the way it was originally trying to do it, which
seems clearer and less error-prone than having a special code which
people then have to worry about.  Here's the patch I'd like to present
in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk
simply SetPageActive for this case (and go back to obeying the usual
unlocking rule for writepage), vmscan.c observe and act accordingly.

But I've not tested it at all (well, I've run with it in, but not
actually going down the paths in question): it may suffer from
something silly like the original fail_writepage.  Plus I might be
persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE)
conditional on !PageActive(page), just to produce the same stats
as before (though they don't make a lot of sense, counting other
non-writes as writes).  And would it need a deprecation phase?

Hugh

 Documentation/filesystems/Locking |6 +-
 Documentation/filesystems/vfs.txt |4 +---
 drivers/block/rd.c|5 ++---
 include/linux/fs.h|   10 --
 mm/migrate.c  |5 ++---
 mm/page-writeback.c   |4 
 mm/shmem.c|   11 ---
 mm/vmscan.c   |   17 ++---
 8 files changed, 20 insertions(+), 42 deletions(-)

--- 2.6.24-rc1/Documentation/filesystems/Locking2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.0 
+0100
@@ -228,11 +228,7 @@ If the filesystem is called for sync the
 in-progress I/O and then start new I/O.
 
 The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
+caller.
 
 Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
 and return zero, writepage *must* run set_page_writeback() against the page,
--- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
+0100
@@ -567,9 +567,7 @@ struct address_space_operations {
   If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
   try too hard if there are problems, and may choose to write out
   other pages from the mapping if that is easier (e.g. due to
-  internal dependencies).  If it chooses not to start writeout, it
-  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
-  calling ->writepage on that page.
+  internal dependencies).
 
   See the file "Locking" for more details.
 
--- 2.6.24-rc1/drivers/block/rd.c   2007-10-24 07:15:23.0 +0100
+++ linux/drivers

Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

2007-10-24 Thread Hugh Dickins
On Thu, 25 Oct 2007, Pekka Enberg wrote:
> On 10/25/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > --- 2.6.24-rc1/mm/shmem.c   2007-10-24 07:16:04.0 +0100
> > +++ linux/mm/shmem.c2007-10-24 22:31:09.0 +0100
> > @@ -915,6 +915,21 @@ static int shmem_writepage(struct page *
> > struct inode *inode;
> >
> > BUG_ON(!PageLocked(page));
> > +   /*
> > +* shmem_backing_dev_info's capabilities prevent regular writeback 
> > or
> > +* sync from ever calling shmem_writepage; but a stacking filesystem
> > +* may use the ->writepage of its underlying filesystem, in which 
> > case
> 
> I find the above bit somewhat misleading as it implies that the
> !wbc->for_reclaim case can be removed after ecryptfs has similar fix
> as unionfs. Can we just say that while BDI_CAP_NO_WRITEBACK does
> prevent some callers from entering ->writepage(), it's just an
> optimization and ->writepage() must deal with !wbc->for_reclaim case
> properly?

Sorry for being obtuse, but I don't see how that's misleading at all.

ecryptfs already has a (dissimilar) fix in 2.6.24-rc1, not using the
writepage route at all.  But it remains the case that some stacking
filesystem may (would you prefer "might" to "may"?  "may" has a nice
double meaning of "might" and "we'll allow it", but this patch does
indeed allow it) use the ->writepage of its underlying filesystem.

With unionfs also fixed, we don't know of an absolute need for this
patch (and so, on that basis, the !wbc->for_reclaim case could indeed
be removed very soon); but as I see it, the unionfs case has shown
that it's time to future-proof this code against whatever stacking
filesystems come along.  Hence I didn't mention the names of such
filesystems in the source comment.

The !page_mapped assumption has been built in there since earliest
2.4, but it took a while for us to get a way to express it in a BUG.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

2007-10-24 Thread Hugh Dickins
It's possible to provoke unionfs (not yet in mainline, though in mm
and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).
I expect it's possible to provoke the 2.6.23 ecryptfs in the same way
(but the 2.6.24 ecryptfs no longer calls lower level's ->writepage).

This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE
could leak from tmpfs via write_cache_pages and unionfs to userspace.
There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 -
writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for
that, and it's okay so far as it goes; but insufficient because it
doesn't address the underlying issue, that shmem_writepage expects
to be called only by vmscan (relying on backing_dev_info capabilities
to prevent the normal writeback path from ever approaching it).

That's an increasingly fragile assumption, and ramdisk_writepage
(the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful
to check wbc->for_reclaim before returning it.  Make the same check
in shmem_writepage, thereby sidestepping the page_mapped BUG also.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---
Unionfs intends its own, third fix to these issues, checking
backing_dev_info capabilities as the normal writeback path does.
And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE
entirely (mainly to put a stop to everybody asking what it means
and when it happens and how to handle it) - but that's a slightly
bigger patch, needing a little more testing, probably for 2.6.25.

I've CC'ed this to stable as you did for the write_cache_pages
fix: it's probably required for ecryptfs (but unionfs was much
easier to set up and test), and helpful to distros using unionfs
and checking stable for fixes.  Does this make the write_cache_pages
fix redundant?  Probably, but let's have both in for safety.

 mm/shmem.c |   15 +++
 1 file changed, 15 insertions(+)

--- 2.6.24-rc1/mm/shmem.c   2007-10-24 07:16:04.0 +0100
+++ linux/mm/shmem.c2007-10-24 22:31:09.0 +0100
@@ -915,6 +915,21 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   /*
+* shmem_backing_dev_info's capabilities prevent regular writeback or
+* sync from ever calling shmem_writepage; but a stacking filesystem
+* may use the ->writepage of its underlying filesystem, in which case
+* we want to do nothing when that underlying filesystem is tmpfs
+* (writing out to swap is useful as a response to memory pressure, but
+* of no use to stabilize the data) - just redirty the page, unlock it
+* and claim success in this case.  AOP_WRITEPAGE_ACTIVATE, and the
+* page_mapped check below, must be avoided unless we're in reclaim.
+*/
+   if (!wbc->for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page->mapping;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

2007-10-24 Thread Hugh Dickins
It's possible to provoke unionfs (not yet in mainline, though in mm
and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).
I expect it's possible to provoke the 2.6.23 ecryptfs in the same way
(but the 2.6.24 ecryptfs no longer calls lower level's ->writepage).

This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE
could leak from tmpfs via write_cache_pages and unionfs to userspace.
There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 -
writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for
that, and it's okay so far as it goes; but insufficient because it
doesn't address the underlying issue, that shmem_writepage expects
to be called only by vmscan (relying on backing_dev_info capabilities
to prevent the normal writeback path from ever approaching it).

That's an increasingly fragile expectation, and ramdisk_writepage
(the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful
to check wbc->for_reclaim before returning it.  Make the same check
in shmem_writepage, thereby sidestepping the page_mapped BUG also.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---
Unionfs intends its own, third fix to these issues, checking
backing_dev_info capabilities as the normal writeback path does.
And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE
entirely (mainly to put a stop to everybody asking what it means
and when it happens and how to handle it) - but that's a slightly
bigger patch, needing a little more testing, probably for 2.6.25.

I've CC'ed this to stable as you did for the write_cache_pages
fix: it's probably required for ecryptfs (but unionfs was much
easier to set up and test), and helpful to distros using unionfs
and checking stable for fixes.  Does this make the write_cache_pages
fix redundant?  Probably, but let's have both in for safety.

 mm/shmem.c |5 +
 1 file changed, 5 insertions(+)

--- 2.6.24-rc1/mm/shmem.c   2007-10-24 07:16:04.0 +0100
+++ linux/mm/shmem.c2007-10-24 20:24:31.0 +0100
@@ -915,6 +915,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   if (!wbc->for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page->mapping;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins
On Mon, 15 Oct 2007, Pekka Enberg wrote:
> 
> I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> !wbc->for_reclaim case which would explain why we haven't hit this bug
> before. Hugh, Andrew?

Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
a special code to get the LRU rotation right, not useful elsewhere.
Though Documentation/filesystems/vfs.txt does imply wider use.

I think this is where people use the phrase "go figure" ;)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins
On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Pekka J Enberg writes:
> > 
> > Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
> > calling unionfs_writepage() _at all_ if the lower mapping has 
> > BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
> > untested patch below?
...

I don't disagree with your unionfs_writepages patch, Pekka, but I think
it should be viewed as an optimization (don't waste time trying to write
a group of pages when we know that nothing will be done) rather than as
essential.

Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only
been ramdisk and shmem generating it.  ramdisk is careful only to
return it in the wbc->for_reclaim case: I think (as in the patch
I sent out before) shmem now ought to do so too for safety.

Back in 2.4 days it was reasonable to assume that ->writepage would
only get called from certain places, but things move faster nowadays,
and the unionfs example shows others are liable to start ab/using it.
I'll send Andrew that patch tomorrow (it's simple enough, but I'd
like at least to try to reproduce the page_mapped bug first).

> 
> Pekka, with a small change to your patch (to handle time-based cache
> coherency), your patch worked well and passed all my tests.  Thanks.
> 
> So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> from being returned to userland.  I guess we still need it, b/c even with
> your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> the VFS and we need to ensure that doesn't "leak" outside the kernel.

Can it now?  Current git has a patch from Andrew which bears a striking
resemblance to that from Pekka, stopping the leak from write_cache_pages.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins
Sorry for my delay, here are a few replies.

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes:
> > 
> > However, I don't think the mapping_cap_writeback_dirty() check in
> > __filemap_fdatawrite_range() works as expected when tmpfs is a lower
> > mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
> > for unionfs mappings so do_fsync() will call write_cache_pages() that
> > unconditionally invokes shmem_writepage() via unionfs_writepage().
> > Unless, of course, there's some other unionfs magic I am missing.

Thanks, Pekka, yes that made a lot of sense.

> 
> In unionfs_writepage() I tried to emulate as best possible what the lower
> f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
> AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

I think that's inappropriate.  Why should unionfs_writepage re-mark its
page as dirty when the lower level does so?  Unionfs has successfully
done its write to the lower level, what the lower level then gets up to
(writing then or not) is its own business: needn't be propagated upwards.

The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better.
What I'd like most of all is to eliminate it, in favour of vmscan.c
working out the condition for itself: but I've given that no thought,
it may not be reasonable.

unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
find_lock_page: that case may be appropriate.  Though I don't really
understand it: seems dangerous to be relying upon the lower level page
just happening to be there already.  Isn't memory pressure then likely
to clog up with lots of upper level dirty pages which cannot get
written out to the lower level?

> 
> Should I be doing something different when unionfs stacks on top of tmpfs?

I think not.

> (BTW, this is probably also relevant to ecryptfs.)

You're both agreed on that, but I don't see how: ecryptfs writes the
lower level via vfs_write, it's not using the lower level's writepage,
is it?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Hugh Dickins
On Sat, 13 Oct 2007, Pekka Enberg wrote:
> On 10/12/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > But I keep suspecting that the answer might be the patch below (which
> > rather follows what drivers/block/rd.c is doing).  I'm especially
> > worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
> > to userspace, bad enough in itself, you might be liable to hit that
> > BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
> > called by anyone outside mm/vmscan.c, but unionfs can now get to it?
> 
> Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> without unionfs even?

I believe not.  Please do double-check my assertions, I've always found
the _writepages paths rather twisty; but my belief (supported by the
fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
in five years) is that tmpfs/shmem opts out of all of that with its
.capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
in shmem_backing_dev_info, which avoids all those _writepages avenues
(e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
just a subfunction of the _writepages.

So, while I don't disagree with your patch to write_cache_pages (though
it wasn't clear to me whether it should break from or continue the loop
if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's
really the root of the problem.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Hugh Dickins
On Thu, 11 Oct 2007, Ryan Finnie wrote:
> On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > shit.  That's a nasty bug.  Really userspace should be testing for -1, but
> > the msync() library function should only ever return 0 or -1.
> >
> > Does this fix it?
> >
> > --- a/mm/page-writeback.c~a
> > +++ a/mm/page-writeback.c
> > @@ -850,8 +850,10 @@ retry:
> >
> > ret = (*writepage)(page, wbc, data);
> >
> > -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> > +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > unlock_page(page);
> > +   ret = 0;
> > +   }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> > done = 1;
> > if (wbc->nonblocking && bdi_write_congested(bdi)) {
> > _
> >
> 
> Pekka Enberg replied with an identical patch a few days ago, but for
> some reason the same condition flows up to msync as -1 EIO instead of
> AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
> thread is below.  Thanks.

Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.

But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing).  I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.

Thanks,
Hugh

--- 2.6.23/mm/shmem.c   2007-10-09 21:31:38.0 +0100
+++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   if (!wbc->for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page->mapping;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Hugh Dickins
On Mon, 8 Oct 2007, Yan Zheng wrote:
> 
> The test for VM_CAN_NONLINEAR always fails

Good catch indeed.  Though I was puzzled how we do nonlinear at all,
until I realized it's "The test for not VM_CAN_NONLINEAR always fails".

It's not as serious as it appears, since code further down has been
added more recently to simulate nonlinear on non-RAM-backed filesystems,
instead of going the real nonlinear way; so most filesystems are now not
required to do what VM_CAN_NONLINEAR was put in to ensure they could do.

I'm confused as to where that leaves us: is this actually a fix that
needs to go into 2.6.23?  or will it suddenly disable a system call
which has been silently working fine on various filesystems which did
not add VM_CAN_NONLINEAR?  could we just rip out VM_CAN_NONLINEAR?
I hope Nick or Miklos is clearer on what the risks are.

(Apologies for all the "not"s and "non"s here, I'm embarrassed
after just criticizing Ingo's SCHED_NO_NO_OMIT_FRAME_POINTER!)

Hugh

> 
> Signed-off-by: Yan Zheng<[EMAIL PROTECTED]>
> 
> diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c
> --- linux-2.6.23-rc9/mm/fremap.c  2007-10-07 15:03:33.0 +0800
> +++ linux/mm/fremap.c 2007-10-08 19:33:44.0 +0800
> @@ -160,7 +160,7 @@
>   if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
>   goto out;
> 
> - if (!vma->vm_flags & VM_CAN_NONLINEAR)
> + if (!(vma->vm_flags & VM_CAN_NONLINEAR))
>   goto out;
> 
>   if (end <= start || start < vma->vm_start || end > vma->vm_end)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-21 Thread Hugh Dickins
On Thu, 20 Sep 2007, Christoph Lameter wrote:
> On Thu, 20 Sep 2007, David Chinner wrote:
> > > Disagree, the mmap side is not a little change.
> > 
> > That's not in the filesystem, though. ;)
> 
> And its really only a minimal change for some function to loop over all 
> 4k pages and elsewhere index the right 4k subpage.

I agree with you on that: the changes you had to make to support mmap
were _much_ less bothersome than I'd been fearing, and I'm surprised
some people still see that side of it as a sticking point.

But I've kept very quiet because I remain quite ambivalent about the
patchset: I'm somewhere on the spectrum between you and Nick, shifting
my position from hour to hour.  Don't expect any decisiveness from me.

In some senses I'm even further off the scale away from you: I'm
dubious even of Nick and Andrew's belief in opportunistic contiguity.  
Just how hard should we trying for contiguity?  How far should we
go in sacrificing our earlier "LRU" principles?  It's easy to bump
up PAGE_ALLOC_COSTLY_ORDER, but what price do we pay when we do?

I agree with those who want to see how the competing approaches
work out in practice: which is frustrating for you, yes, because
you are so close to ready.  (I've not glanced at virtual compound,
but had been wondering in that direction before you suggested it.)

I do think your patchset is, for the time being at least, a nice
out-of-tree set, and it's grand to be able to bring a filesystem
from another arch with larger pagesize and get at the data from it.

I've found some fixes needed on top of your Large Blocksize Support
patches: I'll send those to you in a moment.  Looks like you didn't
try much swapping!

I only managed to get ext2 working with larger blocksizes:
reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
wants too many credits", even after adding JBD patches that you
turned out to be depending on); and I didn't try ext4 or xfs
(I'm guessing the latter has been quite well tested ;)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 11/26] tmpfs white-out support

2007-08-01 Thread Hugh Dickins
On Mon, 30 Jul 2007, Jan Blunck wrote:

> Introduce white-out support to tmpfs.
> 
> Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> ---
>  include/linux/shmem_fs.h |1 
>  mm/shmem.c   |   54 
> +++
>  2 files changed, 55 insertions(+)

I see there's debate about whether this (and its fellows) give the
right semantic to whiteouts; and I've not begun to think about that.

But as a patch to tmpfs for what you're trying to do, it looks just
about fine.  I say "just about" because the reference counting looks
right, but I wouldn't dare say that it _is_ right without testing.

And I'd probably want to add a minor adjustment, so that a mount with
nr_inodes=1000 could still support exactly 1000 inodes, despite your
allocating one for the whiteout (usually never used) at mount time.
But that can follow along later, no problem.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Hugh Dickins
On Thu, 17 May 2007, Nick Piggin wrote:
> 
> ... and I don't want to change the
> VM in a way that people are not unhappy with :)

I'm hoping you intended one less negative ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Hugh Dickins
On Wed, 16 May 2007, Nick Piggin wrote:
> Andrew Morton wrote:
> > I need to work out what to do with
> > 
> > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-doc-fix.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-fix.patch
> > mm-merge-nopfn-into-fault.patch
> > convert-hugetlbfs-to-use-vm_ops-fault.patch
> > mm-remove-legacy-cruft.patch
> > mm-debug-check-for-the-fault-vs-invalidate-race.patch
> > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> > 
> > Probably merge them, I guess.  Hugh had concerns, I think over small
> > additional overhead from the lock_page()?

That's right, the overhead of the lock_page()/unlock_page() in the common
path of faulting, and of the extra call to unmap_mapping_range() when
truncating (because page lock doesn't satisfactorily replace the old
sequence count when COWing).

> Yes he did. It seems to only be noticable in microbenchmarks.

So far, yes.  I expect it'll surface in some reallife workload
sometime, but let's not get too depressed about that.  I guess
your blithe "Scalability is not an issue" comment still irks me.

> In my opinion not enough to withhold pagecache corruption bug fixes.

It is a pity to be adding overhead to a common path in order to fix
such very rare cases, but yes, we probably have to live with that.

> Still, I have some lock_page speedup work that eliminates that regression
> anyway.

Again, rather too blithely said.  You have a deep well of ingenuity,
but I doubt it can actually wash away all of the small overhead added.

> However, Hugh hasn't exactly said yes or no yet...

Getting a "yes" or "no" out of me is very hard work indeed.
But I didn't realize that was gating this work: if the world
had to wait on me, we'd be in trouble.

I think there are quite a few people eager to see the subsequent
->fault changes go in.  And I think I'd just like to minimize the
difference between -mm and mainline, so maximize comprehensibilty,
by getting this all in.  I've not heard of any correctness issues
with it, and we all agree that the page lock there is attractively
simple.

If I ever find a better way of handling it,
I'm free to send patches in future, after all.

Did that sound something like a "yes"?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Hugh Dickins
On Mon, 14 May 2007, Jan Blunck wrote:
> On 5/14/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > >
> > > /* Pretend that each entry is of this size in directory's i_size */
> > > -#define BOGO_DIRENT_SIZE 20
> > > +#define BOGO_DIRENT_SIZE 1
> >
> > Why would that change be needed for whiteout support?
> 
> Good question. It seems that this a survivor of the changes necessary
> for union readdir.

(I'd be asking the same question in that case, but don't worry about it!)

> This isn't necessary for white-outs.

Phew, thanks, please drop that hunk.

> BTW: Why do we claim this to be 20??? Is there any meaning behind this?

No great meaning, hence "BOGO".  I put that in when hpa (IIRC) found
tmpfs directory size 0 didn't suit some apps.  I thought it would be
nice to have a size which indicates the current number of entries
(which your 1 would do), looks plausible (for short filenames),
and easy to make sense of in an "ls -l".  Bogus, yes; but I'd
resist changing it after all this time, without very good reason.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 14/14] tmpfs whiteout support

2007-05-14 Thread Hugh Dickins
On Mon, 14 May 2007, Bharata B Rao wrote:
> From: Jan Blunck <[EMAIL PROTECTED]>
> Subject: tmpfs whiteout support
> 
> Introduce whiteout support to tmpfs.
> 
> Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
> ---
>  mm/shmem.c |9 -
>  1 files changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -74,7 +74,7 @@
>  #define LATENCY_LIMIT 64
>  
>  /* Pretend that each entry is of this size in directory's i_size */
> -#define BOGO_DIRENT_SIZE 20
> +#define BOGO_DIRENT_SIZE 1

Why would that change be needed for whiteout support?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] Implement generic block_page_mkwrite() functionality

2007-02-07 Thread Hugh Dickins
On Thu, 8 Feb 2007, David Chinner wrote:
> On Wed, Feb 07, 2007 at 01:00:28PM +0000, Hugh Dickins wrote:
> > 
> > I'm worried about concurrent truncation.  Isn't it the case that
> > i_mutex is held when prepare_write and commit_write are normally
> > called?  But not here when page_mkwrite is called.
> 
> I'm not holding i_mutex. I assumed that it was probably safe to do
> because we are likely to be reading the page off disk just before we
> call mkwrite and that has to be synchronised with truncate in some
> manner

"assumed"..."probably"..."likely"..."just before"..."in some manner"
doesn't sound very safe, does it :-?

The well-established paths are almost safe against truncation (I insert
"almost" there because although we like to think they're entirely safe,
from time to time a hole is discovered, and Nick has been wrestling
with filling them for some while now).

But page_mkwrite is something new: so far, it's up to the implementor
(the filesystem) to work out how to guard against truncation.

> 
> So, do I need to grab the i_mutex here? Is that safe to do that in
> the middle of a page fault?

It's certainly easier to think about if you don't grab i_mutex there:
sys_msync used to take i_mutex within down_read of mmap_sem, but we
were quite happy to get rid of that, because usually it's down_read
of mmap_sem within i_mutex (page fault when writing from userspace
to file).  I can't at this moment put my finger on an actual deadlock
if you take i_mutex in page_mkwrite, but it feels wrong: hmm, if you
add another thread waiting to down_write the mmap_sem, I think you
would be able to deadlock.

You don't need to lock out all truncation, but you do need to lock
out truncation of the page in question.  Instead of your i_size
checks, check page->mapping isn't NULL after the lock_page?

But aside from the truncation issue, if prepare_write and commit_write
are always called with i_mutex held at present, I'm doubtful whether
you can provide a generic default page_mkwrite which calls them without.
Which would take us back to grabbing i_mutex within page_mkwrite.  Ugh.

> If we do race with a truncate and the
> page is now beyond EOF, what am I supposed to return?

Something negative.  Nothing presently reports the error code in
question, it just does SIGBUS; but it would be better for the
truncation case to avoid -ENOMEM and -ENOSPC, which could easily
have meanings here.  I don't see a good choice, so maybe -EINVAL.

> 
> I'm fishing for what I'm supposed to be doing here because there's
> zero implementations of this callout in the kernel and the comments
> in the code explaining the interface constraints are
> non-existant

Well, you seem to be the first to implement it.  Hmm, that's not true,
David was: what magic saved him from addressing the truncation issue?

Don't be surprised if it turns out page_mkwrite needs more thought.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] Implement generic block_page_mkwrite() functionality

2007-02-07 Thread Hugh Dickins
On Wed, 7 Feb 2007, David Chinner wrote:

> On Christoph's suggestion, take the guts of the proposed
> xfs_vm_page_mkwrite function and implement it as a generic
> core function as it used no specific XFS code at all.
> 
> This allows any filesystem to easily hook the ->page_mkwrite()
> VM callout to allow them to set up pages dirtied by mmap
> writes correctly for later writeout.
> 
> Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

I'm worried about concurrent truncation.  Isn't it the case that
i_mutex is held when prepare_write and commit_write are normally
called?  But not here when page_mkwrite is called.

Hugh

> 
> ---
>  fs/buffer.c |   30 ++
>  include/linux/buffer_head.h |2 ++
>  2 files changed, 32 insertions(+)
> 
> Index: 2.6.x-xfs-new/fs/buffer.c
> ===
> --- 2.6.x-xfs-new.orig/fs/buffer.c2007-02-07 23:00:05.0 +1100
> +++ 2.6.x-xfs-new/fs/buffer.c 2007-02-07 23:09:47.642356116 +1100
> @@ -2194,6 +2194,36 @@ int generic_commit_write(struct file *fi
>   return 0;
>  }
>  
> +/*
> + * block_page_mkwrite() is not allowed to change the file size as
> + * it gets called from a page fault handler when a page is first
> + * dirtied. Hence we must be careful to check for EOF conditions
> + * here. We set the page up correctly for a written page which means
> + * we get ENOSPC checking when writing into holes and correct
> + * delalloc and unwritten extent mapping on filesystems that support
> + * these features.
> + */
> +int
> +block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
> +get_block_t get_block)
> +{
> + struct inode*inode = vma->vm_file->f_path.dentry->d_inode;
> + unsigned long   end;
> + int ret = 0;
> +
> + if (((page->index + 1) << PAGE_CACHE_SHIFT) > i_size_read(inode))
> + end = i_size_read(inode) & ~PAGE_CACHE_MASK;
> + else
> + end = PAGE_CACHE_SIZE;
> +
> + lock_page(page);
> + ret = block_prepare_write(page, 0, end, get_block);
> + if (!ret)
> + ret = block_commit_write(page, 0, end);
> + unlock_page(page);
> +
> + return ret;
> +}
>  
>  /*
>   * nobh_prepare_write()'s prereads are special: the buffer_heads are freed
> Index: 2.6.x-xfs-new/include/linux/buffer_head.h
> ===
> --- 2.6.x-xfs-new.orig/include/linux/buffer_head.h2007-02-07 
> 23:00:02.0 +1100
> +++ 2.6.x-xfs-new/include/linux/buffer_head.h 2007-02-07 23:12:33.156749344 
> +1100
> @@ -206,6 +206,8 @@ int cont_prepare_write(struct page*, uns
>  int generic_cont_expand(struct inode *inode, loff_t size);
>  int generic_cont_expand_simple(struct inode *inode, loff_t size);
>  int block_commit_write(struct page *page, unsigned from, unsigned to);
> +int block_page_mkwrite(struct vma_area_struct *vma, struct page *page,
> + get_block_t get_block);
>  void block_sync_page(struct page *);
>  sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
>  int generic_commit_write(struct file *, struct page *, unsigned, unsigned);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Hugh Dickins
On Fri, 4 Feb 2005, Anton Altaparmakov wrote:
> On Thu, 2005-02-03 at 11:23 -0800, Bryan Henderson wrote:
> > 
> > I think the point is that we can't have a "handler for writes," because 
> > the writes are being done by simple CPU Store instructions in a user 
> > program.  The handler we're talking about is just for page faults.  Other 
> 
> That was my understanding.
> 
> > operating systems approach this by actually _having_ a handler for a CPU 
> > store instruction, in the form of a page protection fault handler -- the 
> > nopage routine adds the page to the user's address space, but write 
> > protects it.  The first time the user tries to store into it, the 
> > filesystem driver gets a chance to do what's necessary to support a dirty 
> > cache page -- allocate a block, add additional dirty pages to the cache, 
> > etc.  It would be wonderful to have that in Linux.
> 
> It would.  This would certainly solve this problem.

Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
add-page-becoming-writable-notification.patch is designed for?

Though it looks a little broken to me as it stands (beyond the two
fixup patches already there).  I've not found time to double-check
or test, apologies in advance if I'm libelling, but...

(a) I thought the prot bits do_nopage gives a pte in a shared writable
mapping include write permission, even when it's a read fault:
that can't be allowed if there's a page_mkwrite.

(b) I don't understand how do_wp_page's "reuse" logic for whether it
can just go ahead and use the existing anonymous page, would have
any relevance to calling page_mkwrite on a shared writable page,
which must be used and not COWed however many references there are.

Didn't look further, think you should take a look at page_mkwrite,
but don't expect the implementation to be correct yet.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html