Re: [PATCH 0/3] btrfs: Convert kmap/memset/kunmap to memzero_user()

2021-03-10 Thread Andrew Morton
On Tue,  9 Mar 2021 13:21:34 -0800 ira.we...@intel.com wrote:

> Previously this was submitted to convert to zero_user()[1].  zero_user() is 
> not
> the same as memzero_user() and in fact some zero_user() calls may be better 
> off
> as memzero_user().  Regardless it was incorrect to convert btrfs to
> zero_user().
> 
> This series corrects this by lifting memzero_user(), converting it to
> kmap_local_page(), and then using it in btrfs.

This impacts btrfs more than MM.  I suggest the btrfs developers grab
it, with my

Acked-by: Andrew Morton 



Re: [PATCH 1/4] zstd: pass pointer rathen than structure to functions

2019-06-05 Thread Andrew Morton
On Wed, 5 Jun 2019 14:32:53 +0200 David Sterba  wrote:

> > >  
> > > -static ZSTD_parameters zstd_get_btrfs_parameters(unsigned int level,
> > > +static ZSTD_parameters *zstd_get_btrfs_parameters(unsigned int level,
> > >size_t src_len)
> > >  {
> > > - ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
> > > + static ZSTD_parameters params;
> > 
> > > +
> > > + params = ZSTD_getParams(level, src_len, 0);
> > 
> > No thats' broken, the params can't be static as it depends on level and
> > src_len. What happens if there are several requests in parallel with
> > eg. different levels?

I wondered.  I'll drop the patch series as some more serious thinking
is needed.

> > Would be really great if the mailinglist is CCed when the code is
> > changed in a non-trivial way.

Well we didn't actually change btrfs until I discovered that Maninder
had missed it.

> So this does not compile fs/btrfs/zstd.o which Andrew probably found
> too, otherwise btrfs is the only in-tree user of the function outside of
> lib/ and crypto/.

Worked for me - I might have sent the wrong version.

> I think that Nick Terrell should have been CCed too, as he ported zstd
> to linux.


Re: [PATCH 1/4] zstd: pass pointer rathen than structure to functions

2019-06-04 Thread Andrew Morton
On Mon,  3 Jun 2019 14:32:03 +0530 Maninder Singh  
wrote:

> currently params structure is passed in all functions, which increases
> stack usage in all the function and lead to stack overflow on target like
> ARM with kernel stack size of 8 KB so better to pass pointer.
> 
> Checked for ARM:
> 
> Original   Patched
> Call FLow Size:  1264   1040
> 
> (HUF_sort)  -> 296
> (HUF_buildCTable_wksp)  -> 144
> (HUF_compress4X_repeat) -> 88
> (ZSTD_compressBlock_internal)   -> 200
> (ZSTD_compressContinue_internal)-> 136  -> 88
> (ZSTD_compressCCtx) -> 192  -> 64
> (zstd_compress) -> 144  -> 96
> (crypto_compress)   -> 32
> (zcomp_compress)-> 32
> 
> 
> Signed-off-by: Maninder Singh 
> Signed-off-by: Vaneet Narang 
> 

You missed btrfs.  This needs review, please - particularly the
kernel-wide static ZSTD_parameters in zstd_get_btrfs_parameters().

The base patch is here:

http://lkml.kernel.org/r/1559552526-4317-2-git-send-email-maninder...@samsung.com
  

--- a/fs/btrfs/zstd.c~zstd-pass-pointer-rathen-than-structure-to-functions-fix
+++ a/fs/btrfs/zstd.c
@@ -27,15 +27,17 @@
 /* 307s to avoid pathologically clashing with transaction commit */
 #define ZSTD_BTRFS_RECLAIM_JIFFIES (307 * HZ)
 
-static ZSTD_parameters zstd_get_btrfs_parameters(unsigned int level,
+static ZSTD_parameters *zstd_get_btrfs_parameters(unsigned int level,
 size_t src_len)
 {
-   ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
+   static ZSTD_parameters params;
+
+   params = ZSTD_getParams(level, src_len, 0);
 
if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
WARN_ON(src_len > ZSTD_BTRFS_MAX_INPUT);
-   return params;
+   return ¶ms;
 }
 
 struct workspace {
@@ -155,11 +157,12 @@ static void zstd_calc_ws_mem_sizes(void)
unsigned int level;
 
for (level = 1; level <= ZSTD_BTRFS_MAX_LEVEL; level++) {
-   ZSTD_parameters params =
-   zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
-   size_t level_size =
-   max_t(size_t,
- ZSTD_CStreamWorkspaceBound(params.cParams),
+   ZSTD_parameters *params;
+   size_t level_size;
+
+   params = zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
+   level_size = max_t(size_t,
+ ZSTD_CStreamWorkspaceBound(params->cParams),
  ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
 
max_size = max_t(size_t, max_size, level_size);
@@ -385,8 +388,9 @@ static int zstd_compress_pages(struct li
unsigned long len = *total_out;
const unsigned long nr_dest_pages = *out_pages;
unsigned long max_out = nr_dest_pages * PAGE_SIZE;
-   ZSTD_parameters params = zstd_get_btrfs_parameters(workspace->req_level,
-  len);
+   ZSTD_parameters *params;
+
+   params = zstd_get_btrfs_parameters(workspace->req_level, len);
 
*out_pages = 0;
*total_out = 0;
_



Re: [PATCH v7 2/6] mm: export add_swap_extent()

2018-10-12 Thread Andrew Morton
On Tue, 11 Sep 2018 15:34:45 -0700 Omar Sandoval  wrote:

> From: Omar Sandoval 
> 
> Btrfs will need this for swap file support.
> 

Acked-by: Andrew Morton 


Re: [PATCH v7 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-10-12 Thread Andrew Morton
On Tue, 11 Sep 2018 15:34:44 -0700 Omar Sandoval  wrote:

> From: Omar Sandoval 
> 
> The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
> go through the filesystem, and to make swapoff() call
> ->swap_deactivate(). For Btrfs, we want the latter but not the former,
> so split this flag into two. This makes us always call
> ->swap_deactivate() if ->swap_activate() succeeded, not just if it
> didn't add any swap extents itself.
> 
> This also resolves the issue of the very misleading name of SWP_FILE,
> which is only used for swap files over NFS.
> 

Acked-by: Andrew Morton 


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Andrew Morton
On Wed, 6 Jun 2018 06:22:25 +0900 Tetsuo Handa 
 wrote:

> On 2018/06/06 5:03, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org 
> > wrote:
> > 
> >> https://bugzilla.kernel.org/show_bug.cgi?id=199931
> >>
> >> Bug ID: 199931
> >>Summary: systemd/rtorrent file data corruption when using echo
> >> 3 >/proc/sys/vm/drop_caches
> > 
> > A long tale of woe here.  Chris, do you think the pagecache corruption
> > is a general thing, or is it possible that btrfs is contributing?
> 
> According to timestamp of my testcases, I was observing corrupted-bytes issue 
> upon OOM-kill
> (without using btrfs) as of 2017 Aug 11. Thus, I don't think that this is 
> specific to btrfs.
> But I can't find which patch fixed this issue.

That sounds different.  Here, the corruption is caused by performing
drop_caches, not by an oom-killing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Andrew Morton


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=199931
> 
> Bug ID: 199931
>Summary: systemd/rtorrent file data corruption when using echo
> 3 >/proc/sys/vm/drop_caches

A long tale of woe here.  Chris, do you think the pagecache corruption
is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.

>Product: Memory Management
>Version: 2.5
> Kernel Version: 4.14.33
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: a...@linux-foundation.org
>   Reporter: bugzilla.kernel@plan9.de
> Regression: No
> 
> We found that
> 
>echo 3 >/proc/sys/vm/drop_caches
> 
> causes file data corruption. We found this because we saw systemd journal
> corruption (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897266) and
> tracked this to a cron job dropping caches every hour. The filesystem in use 
> is
> btrfs, but I don't know if it only happens with this filesystem. btrfs scrub
> reports no problems, so this is not filesystem metdata corruption.
> 
> Basically:
> 
># journalctl --verify
>[everything fine at this point]
># echo 3 >/proc/sys/vm/drop_caches
># journalctl --verify
>[journalctl now reporting corruption problems]
> 
> This is not always reproducible, but when deleting our journal, creating log
> messages for a few hours and then doing the above manually has a ~50% chance 
> of
> corrupting the journal.
> 
> After investigating we found that rtorrent also suffers from corrupted
> downloads when using the above echo - basically, downloading torrents is fine,
> except when executing the above echo a few times during a download, after 
> which
> rtorrent very likely reports a failed hash check.
> 
> All of this is reproducible on two different boxes, so is unlikely to be a
> hardware issue.
> 
> On one affected server we have over 50TB of files, many that have been created
> with the cronjob in place, and none of them are corrupted (we have md5sums of
> everything), so it seems to be related to something that systemd and rtorrent
> do, rather than a generic file corruption issue.
> 
> I also was able to "cmp -l" two corrupted files with their correct version, 
> and
> the corruption manifests itself as streaks of ~100-3000 zero bytes instead of
> the real data. The start offset sems random, but the end offset seems to be
> always aligned to a 4K offset - speculating without the hindrance of knowledge
> this feels like a race somewhere between writing to a mmapped area and freeing
> it, or so.
> 
> Here is the output of cmp -l between a working and a corrupted file, for two
> files:
> 
> http://data.plan9.de/01.cmp.txt
> http://data.plan9.de/02.cmp.txt
> 
> We also have a mysql database with hundreds of gigabytes of writes per day on
> one server which also does not seem to suffer from any corruption.
> 
> As for why we would do something silly as dropping the caches every hour (in a
> cronjob), we started doing this recently because after kernel 4.4, we got
> frequent OOM kills despite having gigabytes of available memory (e.g. 12GB in
> use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We 
> found
> that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use the
> above as an hourly cron job, so we did that, and afterwards run into
> rtorrent/journald corruption issues. Without the echo in place, mysql usually
> gets oom-killed after a few days of uptime.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-13 Thread Andrew Morton
On Mon, 12 Mar 2018 21:28:57 -0700 Kees Cook  wrote:

> On Mon, Mar 12, 2018 at 4:57 PM, Linus Torvalds
>  wrote:
> > On Mon, Mar 12, 2018 at 3:55 PM, Andrew Morton
> >  wrote:
> >>
> >> Replacing the __builtin_choose_expr() with ?: works of course.
> >
> > Hmm. That sounds like the right thing to do. We were so myopically
> > staring at the __builtin_choose_expr() problem that we overlooked the
> > obvious solution.
> >
> > Using __builtin_constant_p() together with a ?: is in fact our common
> > pattern, so that should be fine. The only real reason to use
> > __builtin_choose_expr() is if you want to get the *type* to vary
> > depending on which side you choose, but that's not an issue for
> > min/max.
> 
> This doesn't solve it for -Wvla, unfortunately. That was the point of
> Josh's original suggestion of __builtin_choose_expr().
> 
> Try building with KCFLAGS=-Wval and checking net/ipv6/proc.c:
> 
> net/ipv6/proc.c: In function ‘snmp6_seq_show_item’:
> net/ipv6/proc.c:198:2: warning: ISO C90 forbids array ‘buff’ whose
> size can’t be evaluated [-Wvla]
>   unsigned long buff[SNMP_MIB_MAX];
>   ^~~~

PITA.  Didn't we once have a different way of detecting VLAs?  Some
post-compilation asm parser, iirc.

I suppose the world wouldn't end if we had a gcc version ifdef in
kernel.h.  We'll get to remove it in, oh, ten years.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-12 Thread Andrew Morton
On Fri, 9 Mar 2018 17:30:15 -0800 Kees Cook  wrote:

> > It's one reason why I wondered if simplifying the expression to have
> > just that single __builtin_constant_p() might not end up working..
> 
> Yeah, it seems like it doesn't bail out as "false" for complex
> expressions given to __builtin_constant_p().
> 
> If no magic solution, then which of these?
> 
> - go back to my original "multi-eval max only for constants" macro (meh)
> - add gcc version checks around this and similarly for -Wvla in the future 
> (eww)
> - raise gcc version (yikes)

Replacing the __builtin_choose_expr() with ?: works of course.  What
will be the runtime effects?

I tried replacing

__builtin_choose_expr(__builtin_constant_p(x) &&
  __builtin_constant_p(y),

with

__builtin_choose_expr(__builtin_constant_p(x + y),

but no success.

I'm not sure what else to try to trick gcc into working.

--- 
a/include/linux/kernel.h~kernelh-skip-single-eval-logic-on-literals-in-min-max-v3-fix
+++ a/include/linux/kernel.h
@@ -804,13 +804,10 @@ static inline void ftrace_dump(enum ftra
  * accidental VLA.
  */
 #define __min(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_min(t1, t2, \
-   __UNIQUE_ID(min1_), \
-   __UNIQUE_ID(min2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_min(t1, t2, __UNIQUE_ID(min1_),  \
+ __UNIQUE_ID(min2_), x, y)))
 
 /**
  * min - return minimum of two values of the same or compatible types
@@ -826,13 +823,11 @@ static inline void ftrace_dump(enum ftra
max1 > max2 ? max1 : max2; })
 
 #define __max(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_max(t1, t2, \
-   __UNIQUE_ID(max1_), \
-   __UNIQUE_ID(max2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_max(t1, t2, __UNIQUE_ID(max1_),  \
+ __UNIQUE_ID(max2_), x, y)))
+
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
_

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton
On Fri, 9 Mar 2018 16:28:51 -0800 Linus Torvalds 
 wrote:

> On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  
> wrote:
> >
> > A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> > to know that __builtin_constant_p(x) is a constant.  Or something.
> 
> LOL.
> 
> I suspect it might be that it wants to evaluate
> __builtin_choose_expr() at an earlier stage than it evaluates
> __builtin_constant_p(), so it's not that it doesn't know that
> __builtin_constant_p() is a constant, it just might not know it *yet*.
> 
> Maybe.
> 
> Side note, if it's not that, but just the "complex" expression that
> has the logical 'and' etc, maybe the code could just use
> 
>   __builtin_constant_p((x)+(y))
> 
> or something.

I'll do a bit more poking at it.

> But yeah:
> 
> > Sigh.  Wasn't there some talk about modernizing our toolchain
> > requirements?
> 
> Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
> and once we upgrade to 4.5 I think Arnd said that no distro actually
> ships it, so we might as well go to 4.6.
> 
> So maybe this is just the excuse to finally make that official, if
> there is no clever workaround any more.

I wonder which gcc versions actually accept Kees's addition.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton
On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.

v1, v2 and v3 of this patch all fail with gcc-4.4.4:

./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
./include/linux/jiffies.h:444: error: first argument to '__builtin_choose_expr' 
not a constant

That's with

#define __max(t1, t2, x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y) &&\
  __builtin_types_compatible_p(t1, t2), \
  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
  __single_eval_max(t1, t2, \
__UNIQUE_ID(max1_), \
__UNIQUE_ID(max2_), \
x, y))
/**
 * max - return maximum of two values of the same or compatible types
 * @x: first value
 * @y: second value
 */
#define max(x, y)   __max(typeof(x), typeof(y), x, y)


A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
to know that __builtin_constant_p(x) is a constant.  Or something.

Sigh.  Wasn't there some talk about modernizing our toolchain
requirements?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Andrew Morton
On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.
> 
> ...
>
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> oops_dump_mode) { }
>   * strict type-checking.. See the
>   * "unnecessary" pointer comparison.
>   */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({   \
>   t1 min1 = (x);  \
>   t2 min2 = (y);  \
>   (void) (&min1 == &min2);\
>   min1 < min2 ? min1 : min2; })
>  
> +/*
> + * In the case of builtin constant values, there is no need to do the
> + * double-evaluation protection, so the raw comparison can be made.
> + * This allows min()/max() to be used in stack array allocations and
> + * avoid the compiler thinking it is a dynamic value leading to an
> + * accidental VLA.
> + */
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_constant_p(x) &&\
> +   __builtin_constant_p(y) &&\
> +   __builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   __single_eval_min(t1, t2, \
> + __UNIQUE_ID(max1_), \
> + __UNIQUE_ID(max2_), \
> + x, y))
> +

Holy crap.

I suppose gcc will one day be fixed and we won't need this.

Is there a good reason to convert min()?  Surely nobody will be using
min to dimension an array - always max?  Just for symmetry, I guess.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Andrew Morton
On Thu, 8 Mar 2018 09:02:36 -0600 Josh Poimboeuf  wrote:

> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> > This series adds SIMPLE_MAX() to be used in places where a stack array
> > is actually fixed, but the compiler still warns about VLA usage due to
> > confusion caused by the safety checks in the max() macro.
> > 
> > I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> > and they should all have no operational differences.
> 
> What if we instead simplify the max() macro's type checking so that GCC
> can more easily fold the array size constants?  The below patch seems to
> work:
> 
> -/*
> - * min()/max()/clamp() macros that also do
> - * strict type-checking.. See the
> - * "unnecessary" pointer comparison.
> - */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> - t1 min1 = (x);  \
> - t2 min2 = (y);  \
> - (void) (&min1 == &min2);\
> - min1 < min2 ? min1 : min2; })
> +extern long __error_incompatible_types_in_min_macro;
> +extern long __error_incompatible_types_in_max_macro;
> +
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   (t1)__error_incompatible_types_in_min_macro)

This will move the error detection from compile-time to link-time. 
That's tolerable I guess, but a bit sad and should be flagged in the
changelog at least.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] remove mapping from balance_dirty_pages*()

2017-11-21 Thread Andrew Morton
On Tue, 14 Nov 2017 16:56:47 -0500 Josef Bacik  wrote:

> From: Josef Bacik 
> 
> The only reason we pass in the mapping is to get the inode in order to see if
> writeback cgroups is enabled, and even then it only checks the bdi and a super
> block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
> balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
> block directly so we can avoid using mapping.  This will allow us to still use
> balance_dirty_pages for dirty metadata pages that are not backed by an
> address_mapping.
>
> ...
> 
> @@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, 
> loff_t to, size_t len)
>   memset(page_address(page), 0xff, PAGE_SIZE);
>   set_page_dirty(page);
>   unlock_page(page);
> - balance_dirty_pages_ratelimited(mapping);
> + 
> balance_dirty_pages_ratelimited(inode_to_bdi(inode),
> + inode->i_sb);
>   break;
>   }

So we do a bunch more work in each caller and we pass two args rather
than one.  That doesn't make things better!

I see that this is enablement for "dirty metadata pages that are not
backed by an address_mapping" (address_space) so I look into [7/10] and
the changelog doesn't tell me much.

So color me confused.  What is this patchset actually *for*?  Is there
some filesystem which has non-address_space-backed metadata?  Or will
there be so soon?  Or what.

I think we need a [0/n] email please.  One which fully describes the
intent of the patchset.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-06 Thread Andrew Morton
On Mon,  6 Mar 2017 14:14:05 +0100 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> GFP_NOFS context is used for the following 5 reasons currently
>   - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
>   - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
>   - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
>   - just in case because this would be safe from the fs POV
>   - silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> 
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -210,8 +210,16 @@ struct vm_area_struct;
>   *
>   * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
>   *   that do not require the starting of any physical IO.
> + *   Please try to avoid using this flag directly and instead use
> + *   memalloc_noio_{save,restore} to mark the whole scope which cannot
> + *   perform any IO with a short explanation why. All allocation requests
> + *   will inherit GFP_NOIO implicitly.
>   *
>   * GFP_NOFS will use direct reclaim but will not use any filesystem 
> interfaces.
> + *   Please try to avoid using this flag directly and instead use
> + *   memalloc_nofs_{save,restore} to mark the whole scope which 
> cannot/shouldn't
> + *   recurse into the FS layer with a short explanation why. All allocation
> + *   requests will inherit GFP_NOFS implicitly.

I wonder if these are worth a checkpatch rule.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] lib: add size unit t/p/e to memparse

2014-06-12 Thread Andrew Morton
On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng  wrote:

> For modern filesystems such as btrfs, t/p/e size level operations
> are common.
> add size unit t/p/e parsing to memparse
> 
> Signed-off-by: Gui Hecheng 
> ---
> changelog
>   v1->v2: replace kilobyte with kibibyte, and others
>   v2->v3: add missing unit "bytes" in comment
> ---
>  lib/cmdline.c | 25 -
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/cmdline.c b/lib/cmdline.c
> index eb67911..511b9be 100644
> --- a/lib/cmdline.c
> +++ b/lib/cmdline.c
> @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints)
>   *   @retptr: (output) Optional pointer to next char after parse completes
>   *
>   *   Parses a string into a number.  The number stored at @ptr is
> - *   potentially suffixed with %K (for kilobytes, or 1024 bytes),
> - *   %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
> - *   1073741824).  If the number is suffixed with K, M, or G, then
> - *   the return value is the number multiplied by one kilobyte, one
> - *   megabyte, or one gigabyte, respectively.
> + *   potentially suffixed with
> + *   %K (for kibibytes, or 1024 bytes),
> + *   %M (for mebibytes, or 1048576 bytes),
> + *   %G (for gibibytes, or 1073741824 bytes),
> + *   %T (for tebibytes, or 1099511627776 bytes),
> + *   %P (for pebibytes, or 1125899906842624 bytes),
> + *   %E (for exbibytes, or 1152921504606846976 bytes).

I'm afraid I find these names quite idiotic - we all know what the
traditional terms mean so why go and muck with it.

Also, kibibytes sounds like cat food.

> @@ -133,6 +139,15 @@ unsigned long long memparse(const char *ptr, char 
> **retptr)
>   unsigned long long ret = simple_strtoull(ptr, &endptr, 0);
>  
>   switch (*endptr) {
> + case 'E':
> + case 'e':
> + ret <<= 10;
> + case 'P':
> + case 'p':
> + ret <<= 10;
> + case 'T':
> + case 't':
> + ret <<= 10;
>   case 'G':
>   case 'g':
>   ret <<= 10;

That bit makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rwsem: add rwsem_is_contended

2013-09-16 Thread Andrew Morton
On Fri, 30 Aug 2013 10:14:01 -0400 Josef Bacik  wrote:

> Btrfs uses an rwsem to control access to its extent tree.  Threads will hold a
> read lock on this rwsem while they scan the extent tree, and if need_resched()
> they will drop the lock and schedule.  The transaction commit needs to take a
> write lock for this rwsem for a very short period to switch out the commit
> roots.  If there are a lot of threads doing this caching operation we can 
> starve
> out the committers which slows everybody out.  To address this we want to add
> this functionality to see if our rwsem has anybody waiting to take a write 
> lock
> so we can drop it and schedule for a bit to allow the commit to continue.
> Thanks,
> 

This sounds rather nasty and hacky.  Rather then working around a
locking shortcoming in a caller it would be better to fix/enhance the
core locking code.  What would such a change need to do?

Presently rwsem waiters are fifo-queued, are they not?  So the commit
thread will eventually get that lock.  Apparently that's not working
adequately for you but I don't fully understand what it is about these
dynamics which is causing observable problems.

> I've cc'ed people who seemed like they may be in charge/familiar with this 
> code,
> hopefully I got the right people.
> 
>  include/linux/rwsem.h |1 +
>  lib/rwsem.c   |   17 +

This will break CONFIG_RWSEM_GENERIC_SPINLOCK=n?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] fs: remove obsolete simple_strto

2013-01-17 Thread Andrew Morton
On Fri,  7 Dec 2012 17:25:19 +0530
Abhijit Pawar  wrote:

> This patch replace the obsolete simple_strto with kstrto
> 

The XFS part (or something like it) has been applied.

>
> ...
>
> --- a/fs/9p/v9fs.c
> +++ b/fs/9p/v9fs.c
> @@ -112,7 +112,7 @@ static int v9fs_parse_options(struct v9fs_session_info 
> *v9ses, char *opts)
>   substring_t args[MAX_OPT_ARGS];
>   char *p;
>   int option = 0;
> - char *s, *e;
> + char *s;
>   int ret = 0;
>  
>   /* setup defaults */
> @@ -249,8 +249,8 @@ static int v9fs_parse_options(struct v9fs_session_info 
> *v9ses, char *opts)
>   v9ses->flags |= V9FS_ACCESS_CLIENT;
>   } else {
>   v9ses->flags |= V9FS_ACCESS_SINGLE;
> - v9ses->uid = simple_strtoul(s, &e, 10);
> - if (*e != '\0') {
> + ret = kstrtouint(s, 10, &v9ses->uid);
> + if (ret) {
>   ret = -EINVAL;
>   pr_info("Unknown access argument %s\n",
>   s);

Here we should propagate the kstrtouint() errno back to the caller
rather than overwriting it with EINVAL.

> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 5b3429a..95d9e09 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1335,7 +1335,11 @@ static noinline int btrfs_ioctl_resize(struct 
> btrfs_root *root,
>   sizestr = devstr + 1;
>   *devstr = '\0';
>   devstr = vol_args->name;
> - devid = simple_strtoull(devstr, &end, 10);
> + ret = kstrtoull(devstr, 10, &devid);
> + if (ret) {
> + ret = -EINVAL;
> + goto out_free;
> + }

Propagate the kstrtoull errno back to the caller.

>   printk(KERN_INFO "btrfs: resizing devid %llu\n",
>  (unsigned long long)devid);
>   }
>
> ...
>
> @@ -609,8 +610,9 @@ static ssize_t cifs_security_flags_proc_write(struct file 
> *file,
>   }
>   /* else we have a number */
>  
> - flags = simple_strtoul(flags_string, NULL, 0);
> -
> + rc = kstrtouint(flags_string, 0, &flags);
> + if (rc)
> + return -EINVAL;

Here we should propagate the return value.

But if this error path is taken, we might already have altered
global_secflags.  Perhaps that change should be undone.  Or, better,
check the string before starting to change state.

>   cFYI(1, "sec flags 0x%x", flags);
>  
>   if (flags <= 0)  {
> --- a/fs/dlm/config.c
> +++ b/fs/dlm/config.c
> @@ -156,11 +156,14 @@ static ssize_t cluster_set(struct dlm_cluster *cl, 
> unsigned int *cl_field,
>  const char *buf, size_t len)
>  {
>   unsigned int x;
> + int rc;
>  
>   if (!capable(CAP_SYS_ADMIN))
>   return -EPERM;
>  
> - x = simple_strtoul(buf, NULL, 0);
> + rc = kstrtouint(buf, 0, &x);
> + if (rc)
> + return -EINVAL;

Propagate it back.

>   if (check_zero && !x)
>   return -EINVAL;
> @@ -729,7 +732,10 @@ static ssize_t comm_nodeid_read(struct dlm_comm *cm, 
> char *buf)
>  static ssize_t comm_nodeid_write(struct dlm_comm *cm, const char *buf,
>size_t len)
>  {
> - cm->nodeid = simple_strtol(buf, NULL, 0);
> + int rc;
> + rc = kstrtoint(buf, 0, &cm->nodeid);
> + if (rc)
> + return -EINVAL;

Ditto

>   return len;
>  }
>  
> @@ -741,7 +747,10 @@ static ssize_t comm_local_read(struct dlm_comm *cm, char 
> *buf)
>  static ssize_t comm_local_write(struct dlm_comm *cm, const char *buf,
>   size_t len)
>  {
> - cm->local= simple_strtol(buf, NULL, 0);
> + int rc;
> + rc = kstrtoint(buf, 0, &cm->local);
> + if (rc)
> + return -EINVAL;

Ditto

>   if (cm->local && !local_comm)
>   local_comm = cm;
>   return len;
> @@ -845,7 +854,10 @@ static ssize_t node_nodeid_write(struct dlm_node *nd, 
> const char *buf,
>size_t len)
>  {
>   uint32_t seq = 0;
> - nd->nodeid = simple_strtol(buf, NULL, 0);
> + int rc;
> + rc = kstrtoint(buf, 0, &nd->nodeid);
> + if (rc)
> + return -EINVAL;

Ditto

>   dlm_comm_seq(nd->nodeid, &seq);
>   nd->comm_seq = seq;
>   return len;
> @@ -859,7 +871,10 @@ static ssize_t node_weight_read(struct dlm_node *nd, 
> char *buf)
>  static ssize_t node_weight_write(struct dlm_node *nd, const char *buf,
>size_t len)
>  {
> - nd->weight = simple_strtol(buf, NULL, 0);
> + int rc;
> + rc = kstrtoint(buf, 0, &nd->weight);
> + if (rc)
> + return -EINVAL;

Ditto

>   return len;
>  }
>  
> diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
> index 2e99fb0..e83abfb 100644
> -

Re: [PATCH 0/3] Update LZO compression

2012-10-09 Thread Andrew Morton
On Sun, 7 Oct 2012 17:07:55 +0200
"Markus F.X.J. Oberhumer"  wrote:

> As requested by akpm I am sending my "lzo-update" branch at
> 
>   git://github.com/markus-oberhumer/linux.git lzo-update
> 
> to lkml as a patch series created by "git format-patch -M v3.5..lzo-update".
> 
> You can also browse the branch at
> 
>   https://github.com/markus-oberhumer/linux/compare/lzo-update
> 
> and review the three patches at
> 
>   
> https://github.com/markus-oberhumer/linux/commit/7c979cebc0f93dc692b734c12665a6824d219c20
>   
> https://github.com/markus-oberhumer/linux/commit/10f6781c8591fe5fe4c8c733131915e5ae057826
>   
> https://github.com/markus-oberhumer/linux/commit/5f702781f158cb59075cfa97e5c21f52275057f1

The changes look OK to me.  Please ask Stephen to include the tree in
linux-next, for a 3.7 merge.



The changelog for patch 2/3 says:

: This commit updates the kernel LZO code to the current upsteam version
: which features a significant speed improvement - benchmarking the Calgary
: and Silesia test corpora typically shows a doubled performance in
: both compression and decompression on modern i386/x86_64/powerpc machines.


There are significant clients of the LZO library - crypto, btrfs,
jffs2, ubifs, squashfs and zcache.  So let's give all those people a cc
and ask that they test the LZO changes once they land in linux-next. 
For correctness and performance, please.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v2] Update LZO compression

2012-10-03 Thread Andrew Morton
On Wed, 3 Oct 2012 23:19:04 +0200
Andi Kleen  wrote:

> > No, lib/lzo has no identifiable maintainer.  I suggest you proceed as
> > follows:
> > 
> > - Post the entire patch series to lkml for review (I'd like a cc please)
> 
> Already happened, multiple people reviewed and tested.

um, I would not consider "Looks ok to me from a quick look." and "I
couldn't tell from the github view, but I assume you follow standard
coding style." to indicate a rigorous code review!

That's the problem with the git presentation: hardly anyone reads the
patches and there is no patch for a reviewer to reply to.

So please send the patches out for review.  One at a time, via email.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v2] Update LZO compression

2012-10-03 Thread Andrew Morton
On Wed, 3 Oct 2012 12:48:46 +0200
richard -rw- weinberger  wrote:

> CC'in akpm.

Thanks.

> >>> Hi all,
> >>>
> >>> I finally have prepared a small package that updates the LZO version
> >>> in the Linux kernel. Please get it from:
> >>>
> >>> http://www.oberhumer.com/opensource/lzo/download/Testing/linux-kernel-lzo-20120716.tar.gz
> >>>
> >>> As stated in the README this version is significantly faster (typically 
> >>> more
> >>> than 2 times faster!) than the current version, has been thoroughly 
> >>> tested on
> >>> x86_64/i386/powerpc platforms and is intended to get included into the
> >>> official Linux 3.6 or 3.7 release.
> >>>
> >>> I encourage all compression users to test and benchmark this new version,
> >>> and I also would ask some official LZO maintainer to convert the updated
> >>> source files into a GIT commit and possibly push it to Linus or 
> >>> linux-next.

No, lib/lzo has no identifiable maintainer.  I suggest you proceed as
follows:

- Post the entire patch series to lkml for review (I'd like a cc please)

- After that process has played out, ask Stephen to add this git tree
  to linux-next.

- After that process has played out, ask Linus to pull the tree
  during a merge window.

I haven't actually looked at the patches yet, but if they are as
extensive as they sound, it would be appropriate for you become the
formal maintainer of lib/lzo.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-09 Thread Andrew Morton
On Sat, 09 Jun 2012 09:06:28 +0200 Marco Stornelli  
wrote:

> Il 09/06/2012 02:28, Andrew Morton ha scritto:
> > On Fri, 8 Jun 2012 16:46:47 -0700 Linus 
> > Torvalds  wrote:
> >
> >> Of course, if you just mean having a VFS wrapper that does
> >>
> >>  static void vfs_inode_kmem_cache_destroy(struct kmem_cache *cachep)
> >>  {
> >>  rcu_barrier();
> >>  kmem_cache_destroy(cachep);
> >>  }
> >>
> >> then we could do that. Not much better than what Kirill's patch did,
> >> but at least we could have that comment in just one single place.
> >
> > That's conceptually what I meant.  But it has the problem that new and
> > out-of-tree filesystems might forget to do it.  Which is why I suggest
> > adding a kmem_cache* argument to unregister_filesystem() for this.
> >
> > It's a bit awkward, and the fs can pass in NULL if it knows what it's
> > doing.  But it's reliable.
> > --
> 
> The call of rcu_barrier should be mandatory for the "unload fs module" 
> problem, right? If the fs is compiled statically maybe we could avoid 
> it, but (eventually) this kind of decision is per-fs, so this could be a 
> clue that the call of rcu_barrier maybe is inside each fs not in VFS.
> 

No, this is unrelated to module unloading and the problem affects
statically linked filesystems also.  The requirement is that all inodes
which are pending rcu freeing be flushed (and freed) before their cache
is destroyed in kmem_cache_destroy().

And...  it seems that I misread what's going on.  The individual
filesystems are doing the rcu freeing of their inodes, so it is
appropriate that they also call rcu_barrier() prior to running
kmem_cache_free().  Which is what Kirill's patch does.  oops.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Andrew Morton
On Fri, 8 Jun 2012 16:46:47 -0700 Linus Torvalds 
 wrote:

> Of course, if you just mean having a VFS wrapper that does
> 
> static void vfs_inode_kmem_cache_destroy(struct kmem_cache *cachep)
> {
> rcu_barrier();
> kmem_cache_destroy(cachep);
> }
> 
> then we could do that. Not much better than what Kirill's patch did,
> but at least we could have that comment in just one single place.

That's conceptually what I meant.  But it has the problem that new and
out-of-tree filesystems might forget to do it.  Which is why I suggest
adding a kmem_cache* argument to unregister_filesystem() for this.

It's a bit awkward, and the fs can pass in NULL if it knows what it's
doing.  But it's reliable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Andrew Morton
On Sat, 9 Jun 2012 02:31:27 +0300
"Kirill A. Shutemov"  wrote:

> On Fri, Jun 08, 2012 at 03:31:20PM -0700, Andrew Morton wrote:
> > On Fri, 8 Jun 2012 23:27:34 +0100
> > Al Viro  wrote:
> > 
> > > On Fri, Jun 08, 2012 at 03:25:50PM -0700, Andrew Morton wrote:
> > > 
> > > > A neater implementation might be to add a kmem_cache* argument to
> > > > unregister_filesystem().  If that is non-NULL, unregister_filesystem()
> > > > does the rcu_barrier() and destroys the cache.  That way we get to
> > > > delete (rather than add) a bunch of code from all filesystems and new
> > > > and out-of-tree filesystems cannot forget to perform the rcu_barrier().
> > > 
> > > There's often enough more than one cache, so that one is no-go.
> > 
> > kmem_cache** ;)
> > 
> > Which filesystems have multiple inode caches?
> 
> Multiple inode caches? No.
> Multiple caches with call_rcu() free? See btrfs or gfs2.

OK.  But for those non-inode caches, the rcu treatment is private to
the filesystem.  Hence it is appropriate that the filesystem call
rcu_barrier() for those caches.

But in the case of the inode caches, the rcu treatment is a vfs thing,
so it is the vfs which should perform the rcu_barrier().

This is a red herring - those non-inode caches have nothing to do with
the issue we're dicussing.


So how about open-coding the rcu_barrier() in btrfs and gfs2 for the
non-inode caches (which is the appropriate place), and hand the inode
cache over to the vfs for treatment (which is the appropriate place).

The downside is that btrfs and gfs2 will do an extra rcu_barrier() at
umount time.  Shrug.

If they really want to super-optimise that, they can skip the private
rcu_barrier() call and assume that the vfs will be doing it.  Not a
good idea, IMO.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Andrew Morton
On Fri, 8 Jun 2012 23:27:34 +0100
Al Viro  wrote:

> On Fri, Jun 08, 2012 at 03:25:50PM -0700, Andrew Morton wrote:
> 
> > A neater implementation might be to add a kmem_cache* argument to
> > unregister_filesystem().  If that is non-NULL, unregister_filesystem()
> > does the rcu_barrier() and destroys the cache.  That way we get to
> > delete (rather than add) a bunch of code from all filesystems and new
> > and out-of-tree filesystems cannot forget to perform the rcu_barrier().
> 
> There's often enough more than one cache, so that one is no-go.

kmem_cache** ;)

Which filesystems have multiple inode caches?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Andrew Morton
On Sat, 9 Jun 2012 01:14:46 +0300
"Kirill A. Shutemov"  wrote:

> On Fri, Jun 08, 2012 at 03:02:53PM -0700, Andrew Morton wrote:
> > On Sat,  9 Jun 2012 00:41:03 +0300
> > "Kirill A. Shutemov"  wrote:
> > 
> > > There's no reason to call rcu_barrier() on every 
> > > deactivate_locked_super().
> > > We only need to make sure that all delayed rcu free inodes are flushed
> > > before we destroy related cache.
> > > 
> > > Removing rcu_barrier() from deactivate_locked_super() affects some
> > > fas paths. E.g. on my machine exit_group() of a last process in IPC
> > > namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
> > 
> > What an unpleasant patch.  Is final-process-exiting-ipc-namespace a
> > sufficiently high-frequency operation to justify the change?

This, please.

> > I don't really understand what's going on here.  Are you saying that
> > there is some filesystem against which we run deactivate_locked_super()
> > during exit_group(), and that this filesystem doesn't use rcu-freeing
> > of inodes?  The description needs this level of detail, please.

You still haven't explained where this deactivate_locked_super() call
is coming from.  Oh well.

> I think the rcu_barrier() is in wrong place. We need it to safely destroy
> inode cache. deactivate_locked_super() is part of umount() path, but all
> filesystems I've checked have inode cache for whole filesystem, not
> per-mount.

Well from a design perspective, putting the rcu_barrier() in the vfs is
the *correct* place.  Individual filesystems shouldn't be hard-coding
knowledge about vfs internal machinery.

A neater implementation might be to add a kmem_cache* argument to
unregister_filesystem().  If that is non-NULL, unregister_filesystem()
does the rcu_barrier() and destroys the cache.  That way we get to
delete (rather than add) a bunch of code from all filesystems and new
and out-of-tree filesystems cannot forget to perform the rcu_barrier().

> > The implementation would be less unpleasant if we could do the
> > rcu_barrier() in kmem_cache_destroy().  I can't see a way of doing that
> > without adding a dedicated slab flag, which would require editing all
> > the filesystems anyway.
> 
> I think rcu_barrier() for all kmem_cache_destroy() would be too expensive.

That is not what I proposed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Andrew Morton
On Sat,  9 Jun 2012 00:41:03 +0300
"Kirill A. Shutemov"  wrote:

> There's no reason to call rcu_barrier() on every deactivate_locked_super().
> We only need to make sure that all delayed rcu free inodes are flushed
> before we destroy related cache.
> 
> Removing rcu_barrier() from deactivate_locked_super() affects some
> fas paths. E.g. on my machine exit_group() of a last process in IPC
> namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

What an unpleasant patch.  Is final-process-exiting-ipc-namespace a
sufficiently high-frequency operation to justify the change?

I don't really understand what's going on here.  Are you saying that
there is some filesystem against which we run deactivate_locked_super()
during exit_group(), and that this filesystem doesn't use rcu-freeing
of inodes?  The description needs this level of detail, please.


The implementation would be less unpleasant if we could do the
rcu_barrier() in kmem_cache_destroy().  I can't see a way of doing that
without adding a dedicated slab flag, which would require editing all
the filesystems anyway.


(kmem_cache_destroy() already has an rcu_barrier().  Can we do away
with the private rcu games in the vfs and switch to
SLAB_DESTROY_BY_RCU?)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue, 05 Jun 2012 17:40:05 -0700 Joe Perches  wrote:

> On Tue, 2012-06-05 at 17:37 -0700, Andrew Morton wrote:
> > On Tue, 05 Jun 2012 17:07:27 -0700 Joe Perches  wrote:
> > 
> > > On Tue, 2012-06-05 at 16:58 -0700, Andrew Morton wrote:
> > > >  echo "\0014Hello Joe" > /dev/kmsg
> > > 
> > > # echo -e "\x014Hello Me" > /dev/kmsg
> > > gives:
> > > 12,778,4057982669;Hello Me
> > 
> > That's changed behavior.
> 
> Which is an improvement too.

No it isn't.  It exposes internal kernel implementation details in
random weird inexplicable ways.  It doesn't seem at all important
though.

> I very much doubt a single app will change
> because of this.

I doubt it as well.

> > printk_emit() does parse the leading \0014, and then skips over it,
> > removing it from the output stream.  printk_emit() then throws away the
> > resulting level because devkmsg_writev() did not pass in level==-1.
> 
> I'm glad you know how it works now.

It's nice to see you learning about it as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue, 05 Jun 2012 17:07:27 -0700 Joe Perches  wrote:

> On Tue, 2012-06-05 at 16:58 -0700, Andrew Morton wrote:
> >  echo "\0014Hello Joe" > /dev/kmsg
> 
> # echo -e "\x014Hello Me" > /dev/kmsg
> gives:
> 12,778,4057982669;Hello Me

That's changed behavior.

On Wed, 6 Jun 2012 02:28:39 +0200 Kay Sievers  wrote:

> Yeah, but printk_emit() will not try to parse it? I did not check, but
> with your change, the prefix parsing in printk_emit() is still skipped
> if a level is given as a parameter to printk_emit(), right?

printk_emit() does parse the leading \0014, and then skips over it,
removing it from the output stream.  printk_emit() then throws away the
resulting level because devkmsg_writev() did not pass in level==-1.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue, 05 Jun 2012 16:52:25 -0700
Joe Perches  wrote:

> On Wed, 2012-06-06 at 01:48 +0200, Kay Sievers wrote:
> > On Wed, Jun 6, 2012 at 1:43 AM, Joe Perches  wrote:
> > > On Wed, 2012-06-06 at 01:39 +0200, Kay Sievers wrote:
> > 
> > >> > # echo "\001Hello Andrew" > /dev/kmsg
> > >> > /dev/kmsg has
> > >> > 12,774,2462339252;\001Hello Andrew
> > >>
> > >> Try "echo -e"? The stuff is copied verbatim otherwise.
> > >
> > > # echo -e "\001Hello Kay" > /dev/kmsg
> > > gives
> > > 12,776,3046752764;\x01Hello Kay
> > 
> > Don't you need two bytes to trigger the logic?
> 
> Yes.  Angle brackets fore and aft.

he means

echo "\0014Hello Joe" > /dev/kmsg
  ^

It needs one of [0-7cd] to trigger the prefix handling.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue, 05 Jun 2012 15:49:32 -0700
Joe Perches  wrote:

> On Tue, 2012-06-05 at 15:17 -0700, Andrew Morton wrote:
> > On Tue, 05 Jun 2012 15:11:43 -0700
> > Joe Perches  wrote:
> > 
> > > On Tue, 2012-06-05 at 14:28 -0700, Andrew Morton wrote:
> > > > Unfortunately the  thing is part of the kernel ABI:
> > > > 
> > > > echo "<4>foo" > /dev/kmsg
> > > 
> > > Which works the same way it did before.
> > 
> > I didn't say it didn't.
> > 
> > What I did say is that echo "\0014">/dev/kmsg will subvert the intent
> > of the new logging code.  Or might.  But you just ignored all that,
> > forcing me to repeat myself, irritatedly.
> 
> It works the same way before and after the patch.
> 
> Any write to /dev/kmsg without a KERN_
> emits at (1 << 3) + KERN_DEFAULT.
> 
> Writes with  values >= 8 are emitted at
> that level.

What about writes starting with \001n?  AFACIT, that will be stripped
away and the printk level will be altered.  This is new behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Wed, 06 Jun 2012 00:55:00 +0200
Kay Sievers  wrote:

> On Tue, 2012-06-05 at 14:28 -0700, Andrew Morton wrote:
> 
> > devkmsg_writev() does weird and wonderful things with
> > facilities/levels.  That function incorrectly returns "success" when
> > copy_from_user() faults, btw.
> 
> Oh. Better?
> 
> Thanks,
> Kay
> 
> 
> From: Kay Sievers 
> Subject: kmsg: /dev/kmsg - properly return possible copy_from_user() failure
> 
> Reported-By: Andrew Morton 
> Signed-off-by: Kay Sievers  ---
>  printk.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/printk.c b/kernel/printk.c
> index 32462d2..6bdacab 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -365,8 +365,10 @@ static ssize_t devkmsg_writev(struct kiocb *iocb, const 
> struct iovec *iv,
>  
>   line = buf;
>   for (i = 0; i < count; i++) {
> - if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len))
> + if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len)) {
> + ret = -EFAULT;
>   goto out;
> + }
>   line += iv[i].iov_len;
>   }

Strictly speaking, when write() encounters an error it should return
number-of-bytes-written, or an errno if it wrote zero bytes.  So
something like

--- a/kernel/printk.c~a
+++ a/kernel/printk.c
@@ -355,7 +355,7 @@ static ssize_t devkmsg_writev(struct kio
int level = default_message_loglevel;
int facility = 1;   /* LOG_USER */
size_t len = iov_length(iv, count);
-   ssize_t ret = len;
+   ssize_t ret;
 
if (len > LOG_LINE_MAX)
return -EINVAL;
@@ -365,8 +365,12 @@ static ssize_t devkmsg_writev(struct kio
 
line = buf;
for (i = 0; i < count; i++) {
-   if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len))
+   if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len)) {
+   ret = line - buf;
+   if (!ret)
+   ret = -EFAULT;
goto out;
+   }
line += iv[i].iov_len;
}
 
@@ -396,6 +400,7 @@ static ssize_t devkmsg_writev(struct kio
line[len] = '\0';
 
printk_emit(facility, level, NULL, 0, "%s", line);
+   ret = 0;
 out:
kfree(buf);
return ret;
_

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue, 05 Jun 2012 15:11:43 -0700
Joe Perches  wrote:

> On Tue, 2012-06-05 at 14:28 -0700, Andrew Morton wrote:
> > Unfortunately the  thing is part of the kernel ABI:
> > 
> > echo "<4>foo" > /dev/kmsg
> 
> Which works the same way it did before.

I didn't say it didn't.

What I did say is that echo "\0014">/dev/kmsg will subvert the intent
of the new logging code.  Or might.  But you just ignored all that,
forcing me to repeat myself, irritatedly.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Rework KERN_

2012-06-05 Thread Andrew Morton
On Tue,  5 Jun 2012 02:46:29 -0700
Joe Perches  wrote:

> KERN_ currently takes up 3 bytes.
> Shrink the kernel size by using an ASCII SOH and then the level byte.
> Remove the need for KERN_CONT.
> Convert directly embedded uses of <.> to KERN_

What an epic patchset.  I guess that saving a byte per printk does make
the world a better place, and forcibly ensuring that nothing is
dependent upon the internal format of the KERN_foo strings is nice.


Unfortunately the  thing is part of the kernel ABI:

echo "<4>foo" > /dev/kmsg

devkmsg_writev() does weird and wonderful things with
facilities/levels.  That function incorrectly returns "success" when
copy_from_user() faults, btw.  It also babbles on about LOG_USER and
LOG_KERN without ever defining these things.  I guess they're
userspace-only concepts and are hardwired to 0 and 1 in the kernel.  Or
not.

So what to do about /dev/kmsg?  I'd say "nothing": we retain "" as
the externally-presented kernel format for a facility level, and the
fact that the kernel internally uses a different encoding is hidden
from userspace.

And if the user does

echo "\0014foo" > /dev/kmsg

then I guess we should pass it straight through, retaining the \0014. 
But from my reading of your code, this doesn't work - vprintk_emit()
will go ahead and strip and interpret the \0014, evading the stuff
which devkmsg_writev() did.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] treewide: fix memory corruptions when TASK_COMM_LEN != 16

2012-02-23 Thread Andrew Morton
On Thu, 23 Feb 2012 12:19:28 +0100 (CET) Jan Engelhardt  
wrote:

> 
> On Thursday 2012-02-23 10:57, Andrew Morton wrote:
> >>
> But there's more,
> >> 
> >> 24931 ?S  0:00  \_ [btrfs-endio-met]
>\_ [kconservative/5]
>\_ [ext4-dio-unwrit]
> >> 
> >> [with a wondersome patch:] $ grep Name /proc/{29431,29432}/stat*
> >> /proc/29431/status:Name: btrfs-endio-meta-1
> >> /proc/29432/status:Name: btrfs-endio-meta-write-1
>   Name: kconservative/512
>   Name: ext4-dio-unwritten
> >
> >doh.  The fix for that is to have less clueless btrfs developers.
> 
> And truncate their names to SUNWbtfs, ORCLintg and EXT4diou?
> I think not :)

Teach ps(1) to look in /proc/pid/status for kernel threads?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] treewide: fix memory corruptions when TASK_COMM_LEN != 16

2012-02-23 Thread Andrew Morton
On Thu, 23 Feb 2012 10:09:33 +0100 (CET) Jan Engelhardt  
wrote:

> On Wednesday 2012-02-22 21:58, Andrew Morton wrote:
> 
> >On Wed, 22 Feb 2012 13:48:08 +0100 (CET)
> >Jan Engelhardt  wrote:
> >
> >> task: provide a larger task command buffer
> >
> >
> >
> >Why are we bothering ourselves about this?
> 
> Some prefer to know what's going on in the system. Every other or 
> so kernel release there are some new happy kthreads, such as
> 
> 24930 ?S  0:00  \_ [btrfs-endio-1]
> 24931 ?S  0:00  \_ [btrfs-endio-met]
> 24932 ?S  0:00  \_ [btrfs-endio-met]
> 24933 ?S  0:00  \_ [btrfs-endio-wri]
> 24934 ?S  0:00  \_ [btrfs-freespace]
> 
> at which point one is curious to find out the rest of the met and why 
> there are two of them. If expanded one actually sees they are different 
> kthreads (rather than just per-cpu instances for a WQ, for example)
> 
> $ grep Name /proc/{29431,29432}/stat*
> /proc/29431/status:Name: btrfs-endio-meta-1
> /proc/29432/status:Name: btrfs-endio-meta-write-1
> 
> That's all.

doh.  The fix for that is to have less clueless btrfs developers.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

2011-09-21 Thread Andrew Morton
On Tue, 20 Sep 2011 15:45:13 +0200
Johannes Weiner  wrote:

> This patch allows allocators to pass __GFP_WRITE when they know in
> advance that the allocated page will be written to and become dirty
> soon.  The page allocator will then attempt to distribute those
> allocations across zones, such that no single zone will end up full of
> dirty, and thus more or less, unreclaimable pages.

Across all zones, or across the zones within the node or what?  Some
more description of how all this plays with NUMA is needed, please.

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory

I don't know what this means.  How can a global limit be controlled by
what is happening within each single zone?  Please describe this design
concept fully.

> and allocations diverted to other zones
> when the limit is reached.

hm.

> For now, the problem remains for NUMA configurations where the zones
> allowed for allocation are in sum not big enough to trigger the global
> dirty limits, but a future approach to solve this can reuse the
> per-zone dirty limit infrastructure laid out in this patch to have
> dirty throttling and the flusher threads consider individual zones.
> 
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
>  #endif
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
> +#define ___GFP_WRITE 0x100u
>  
>  /*
>   * GFP bitmasks..
> 
> ...
>
> +static unsigned long zone_dirtyable_memory(struct zone *zone)

Appears to return the number of pages in a particular zone which are
considered "dirtyable".  Some discussion of how this decision is made
would be illuminating.

> +{
> + unsigned long x;
> + /*
> +  * To keep a reasonable ratio between dirty memory and lowmem,
> +  * highmem is not considered dirtyable on a global level.

Whereabouts in the kernel is this policy implemented? 
determine_dirtyable_memory()?  It does (or can) consider highmem
pages?  Comment seems wrong?

Should we rename determine_dirtyable_memory() to
global_dirtyable_memory(), to get some sense of its relationship with
zone_dirtyable_memory()?

> +  * But we allow individual highmem zones to hold a potentially
> +  * bigger share of that global amount of dirty pages as long
> +  * as they have enough free or reclaimable pages around.
> +  */
> + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> + x += zone_reclaimable_pages(zone);
> + return x;
> +}
> +
> 
> ...
>
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +static void dirty_limits(struct zone *zone,
> +  unsigned long *pbackground,
> +  unsigned long *pdirty)
>  {
> + unsigned long uninitialized_var(zone_memory);
> + unsigned long available_memory;
> + unsigned long global_memory;
>   unsigned long background;
> - unsigned long dirty;
> - unsigned long uninitialized_var(available_memory);
>   struct task_struct *tsk;
> + unsigned long dirty;
>  
> - if (!vm_dirty_bytes || !dirty_background_bytes)
> - available_memory = determine_dirtyable_memory();
> + global_memory = determine_dirtyable_memory();
> + if (zone)
> + available_memory = zone_memory = zone_dirtyable_memory(zone);
> + else
> + available_memory = global_memory;
>  
> - if (vm_dirty_bytes)
> + if (vm_dirty_bytes) {
>   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> - else
> + if (zone)

So passing zone==NULL alters dirty_limits()'s behaviour.  Seems that it
flips the function between global_dirty_limits and zone_dirty_limits?

Would it be better if we actually had separate global_dirty_limits()
and zone_dirty_limits() rather than a magical mode?

> + dirty = dirty * zone_memory / global_memory;
> + } else
>   dirty = (vm_dirty_ratio * available_memory) / 100;
>  
> - if (dirty_background_bytes)
> + if (dirty_background_bytes) {
>   background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + background = background * zone_memory / global_memory;
> + } else
>   background = (dirty_background_ratio * available_memory) / 100;
>  
>   if (background >= dirty)
> 
> ...
>
> +bool zone_dirty_ok(struct zone *zone)

Full description of the return value, please.

> +{
> + unsigned long background_thresh, dirty_thresh;
> +
> + dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> + return zone_page_state(zone, NR_FILE_DIRTY) +
> + zone_page_state(zone, NR_UNSTABLE_NFS) +
> + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> +}

We never needed to calculate &background_thresh,.  I wonder if that
matters.

> 
> ...
>

--
To unsubscribe from this list:

Re: [PATCH V8 4/8] mm/fs: add hooks to support cleancache

2011-04-15 Thread Andrew Morton
On Fri, 15 Apr 2011 07:47:57 -0700 (PDT) Dan Magenheimer 
 wrote:

> Hi Minchan --
> 
> > First of all, thanks for resolving conflict with my patch.
> 
> You're welcome!  As I pointed out offlist, yours was the first
> change in MM that caused any semantic changes to the cleancache
> core hooks patch since before 2.6.18.
>  
> > Before I suggested a thing about cleancache_flush_page,
> > cleancache_flush_inode.
> > 
> > what's the meaning of flush's semantic?
> > I thought it means invalidation.
> > AFAIC, how about change flush with invalidate?
> 
> I'm not sure the words "flush" and "invalidate" are defined
> precisely or used consistently everywhere in computer
> science, but I think that "invalidate" is to destroy
> a "pointer" to some data, but not necessarily destroy the
> data itself.   And "flush" means to actually remove
> the data.  So one would "invalidate a mapping" but one
> would "flush a cache".
> 
> Since cleancache_flush_page and cleancache_flush_inode
> semantically remove data from cleancache, I think flush
> is a better name than invalidate.
> 
> Does that make sense?
> 

nope ;)

Kernel code freely uses "flush" to refer to both invalidation and to
writeback, sometimes in confusing ways.  In this case,
cleancache_flush_inode and cleancache_flush_page rather sound like they
might write those things to backing store.  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 29302] New: Null pointer dereference with large max_sectors_kb

2011-02-18 Thread Andrew Morton
On Fri, 18 Feb 2011 14:16:12 -0500
Chris Mason  wrote:

> Are there any more kernel messages involved before the oops starts? 

The full dmesg is in bugzilla. https://bugzilla.kernel.org/show_bug.cgi?id=29302
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 29302] New: Null pointer dereference with large max_sectors_kb

2011-02-17 Thread Andrew Morton

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 17 Feb 2011 13:20:20 GMT
bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=29302
> 
>Summary: Null pointer dereference with large max_sectors_kb
>Product: IO/Storage
>Version: 2.5
> Kernel Version: 2.6.36 - 2.6.38-rc5
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Block Layer
> AssignedTo: ax...@kernel.dk
> ReportedBy: f...@murder.cz
> Regression: No
> 
> 
> Created an attachment (id=48132)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=48132)
> dmesg with error displayed.
> 
> Hello, 
> I'm not really sure I identified the problem product/component correctly, but
> this seems most appropriate.
> 
> 
> [140517]stg-bot ~ # echo 256 >  /sys/block/sdb/queue/max_sectors_kb
> [140523]stg-bot ~ # mkfs.btrfs /dev/sdb
> 
> WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
> 
> fs created label (null) on /dev/sdb
> nodesize 4096 leafsize 4096 sectorsize 4096 size 2.73TB
> Btrfs Btrfs v0.19
> [140532]stg-bot ~ # mount /dev/sdb /mnt
> [140540]stg-bot ~ # umount /mnt
> [140543]stg-bot ~ # cat /sys/block/sdb/queue/max_hw_sectors_kb >
> /sys/block/sdb/queue/max_sectors_kb
> [140710]stg-bot ~ # mkfs.btrfs /dev/sdb
> 
> WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
> 
> fs created label (null) on /dev/sdb
> nodesize 4096 leafsize 4096 sectorsize 4096 size 2.73TB
> Btrfs Btrfs v0.19
> [140713]stg-bot ~ # mount /dev/sdb /mnt
> Killed
> [140715]stg-bot ~ #
> 
> Now there is a bug in dmesg (output attached) and another attempt to mount the
> device kind of freezes it. The mount blocks, sync blocks, but i can read/write
> the device using dd. And if I, instead of trying to mount again, zero out 
> first
> 1MB, mkfs.btrfs and mount, I get the bug again. Freeze again on second mount
> attempt after that. 
> 
> This happens on 2.6.36 and 2.6.37 with scst patches, 2.6.37 vanilla and on
> 2.6.38-rc5 it just freezes the first time I try to mount. No outuput in dmesg.
> 
> The hardware is Dual Xeon E5620, 12GB ram, Areca 1880 with 3 arrays (testing 
> on
> 3TB raid10 and 10TB raid6), SuperMicro X8DTU-F.
> 
> If I left out any important info, please let me know ;).
> 

A btrfs bug, I suspect.


> [  605.109630] BUG: unable to handle kernel NULL pointer dereference at 
> 0010
> [  605.109928] IP: [] bio_add_page+0xa/0x40
> [  605.110089] PGD 277d70067 PUD 277e0a067 PMD 0 
> [  605.110247] Oops:  [#1] SMP 
> [  605.110394] last sysfs file: 
> /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> [  605.110686] CPU 6 
> [  605.110698] Modules linked in: ip6table_filter ip6_tables nf_nat_tftp 
> nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp 
> nf_nat_amanda nf_conntrack_amanda nf_conntrack_tftp nf_conntrack_sip 
> nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre 
> nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_irc 
> nf_conntrack_h323 nf_conntrack_ftp xt_physdev xt_hashlimit nfs ib_iser 
> libiscsi scsi_transport_iscsi ib_ucm ib_ipoib rdma_ucm rdma_cm ib_cm iw_cm 
> ib_sa ib_addr ib_uverbs ib_umad mlx4_ib ib_mthca ib_mad ib_core i7core_edac 
> edac_core mlx4_core iTCO_wdt iTCO_vendor_support
> [  605.112285] 
> [  605.112419] Pid: 1, comm: mount Not tainted 2.6.37stg #6 X8DTU/X8DTU
> [  605.112586] RIP: 0010:[]  [] 
> bio_add_page+0xa/0x40
> [  605.112879] RSP: :8801833b39b8  EFLAGS: 00010296
> [  605.113035] RAX:  RBX:  RCX: 
> 
> [  605.113207] RDX: 1000 RSI: ea000c3cd200 RDI: 
> 
> [  605.113382] RBP: 8801833b3ba0 R08:  R09: 
> 
> [  605.113554] R10:  R11:  R12: 
> 
> [  605.113723] R13:  R14: a000 R15: 
> 88024a19ab98
> [  605.113895] FS:  7fbcfd971740() GS:880339c8() 
> knlGS:
> [  605.114188] CS:  0010 DS:  ES:  CR0: 80050033
> [  605.114352] CR2: 0010 CR3: 0001c17d5000 CR4: 
> 06e0
> [  605.114525] DR0:  DR1:  DR2: 
> 
> [  605.114695] DR3:  DR6: 0ff0 DR7: 
> 0400
> [  605.114864] Process mount (pid: 1, threadinfo 8801833b2000, task 
> 8801b8b48cf0)
> [  605.115157] Stack:
> [  605.115290]   81251384 0140 
> ea000c3cd200
> [  605.115590]   4a19ab88 8801b966f380 
> 1000
> [  605.115884]  81255810 00

Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 14:19:50 +0800 Wu Fengguang  wrote:

> On Thu, Jan 20, 2011 at 02:12:33PM +0800, Li, Shaohua wrote:
> > On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li  
> > > wrote:
> > > 
> > > > > ext2, minix and probably others create an address_space for each
> > > > > directory.  Heaven knows what xfs does (for example).
> > > > yes, this is for one directiory, but the all files's metadata are in
> > > > block_dev address_space.
> > > > I thought you mean there are several block_dev address_space like
> > > > address_space in some filesystems, which doesn't fit well in my
> > > > implementation. for ext like filesystem, there is only one
> > > > address_space. for filesystems with several address_space, my proposal
> > > > is map them to a virtual big address_space in the new ioctls.
> > > 
> > > ext2 and minixfs (and I think sysv and ufs) have a separate
> > > address_space for each directory.  I don't see how those can be
> > > represented with a single "virtual big address_space" - we also need
> > > identifiers in there so each directory's address_space can be created
> > > and appropriately populated.
> > Oh, I misunderstand your comments. you are right, the ioctl methods
> > don't work for ext2. the dir's address_space can't be readahead either.
> > Looks we could only do the metadata readahead in filesystem specific
> > way.
> 
> There should be little interest on ext2 boot time optimization.

We're discussing the userspace interface design here.  If that design
is unsuitable for ext2, minixfs, sysvfs and udf then it's not the right
design!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 14:12:33 +0800 Shaohua Li  wrote:

> On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li  wrote:
> > 
> > > > ext2, minix and probably others create an address_space for each
> > > > directory.  Heaven knows what xfs does (for example).
> > > yes, this is for one directiory, but the all files's metadata are in
> > > block_dev address_space.
> > > I thought you mean there are several block_dev address_space like
> > > address_space in some filesystems, which doesn't fit well in my
> > > implementation. for ext like filesystem, there is only one
> > > address_space. for filesystems with several address_space, my proposal
> > > is map them to a virtual big address_space in the new ioctls.
> > 
> > ext2 and minixfs (and I think sysv and ufs) have a separate
> > address_space for each directory.  I don't see how those can be
> > represented with a single "virtual big address_space" - we also need
> > identifiers in there so each directory's address_space can be created
> > and appropriately populated.
> Oh, I misunderstand your comments. you are right, the ioctl methods
> don't work for ext2. the dir's address_space can't be readahead either.
> Looks we could only do the metadata readahead in filesystem specific
> way.

Another way of doing all this would be to implement some sort of
lookaside cache at the vfs->block boundary.  At boot time, load that
cache up with all the disk blocks which we know the boot will need (a
single ascending pass across the disk) and then when the vfs/fs goes to
read a disk block take a peek in that cache first and if it's a hit,
either steal the page or memcpy it.

It has the obvious coherence problems which would be pretty simple to
solve by hooking into the block write path as well.  The list of needed
blocks can be very simply generated with existing blktrace
infrastructure.  It does add permanent runtime overhead - once the
cache is invalidated and disabled, every IO operation would incur a
test-n-not-taken-branch.  Maybe not too bad.

Need to handle small-memory systems somehow, where the cache simply
ooms the machine or becomes ineffective because it's causing eviction
elsewhere.

It could perhaps all be implemented as an md or dm driver.

Or even as an IO scheduler.  I say this because IO schedulers can be
replaced on-the-fly, so the caching layer can be unloaded from the
stack once it is finished with.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li  wrote:

> > ext2, minix and probably others create an address_space for each
> > directory.  Heaven knows what xfs does (for example).
> yes, this is for one directiory, but the all files's metadata are in
> block_dev address_space.
> I thought you mean there are several block_dev address_space like
> address_space in some filesystems, which doesn't fit well in my
> implementation. for ext like filesystem, there is only one
> address_space. for filesystems with several address_space, my proposal
> is map them to a virtual big address_space in the new ioctls.

ext2 and minixfs (and I think sysv and ufs) have a separate
address_space for each directory.  I don't see how those can be
represented with a single "virtual big address_space" - we also need
identifiers in there so each directory's address_space can be created
and appropriately populated.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li  wrote:

> > It seems to return a single offset/length tuple which refers to the
> > btrfs metadata "file", with the intent that this tuple later be fed
> > into a btrfs-specific readahead ioctl.
> > 
> > I can see how this might be used with say fatfs or ext3 where all
> > metadata resides within the blockdev address_space.  But how is a
> > filesytem which keeps its metadata in multiple address_spaces supposed
> > to use this interface?
> Oh, this looks like a big problem, thanks for letting me know such
> filesystems. is it possible specific filesystem mapping multiple
> address_space ranges to a virtual big ranges? the new ioctls handle the
> mapping.

I'm not sure what you mean by that.

ext2, minix and probably others create an address_space for each
directory.  Heaven knows what xfs does (for example).

> If the issue can't be solved, we can only add the metadata readahead for
> specific implementation like my initial post instead of a generic
> interface.

Well.  One approach would be for the kernel to report the names of all
presently-cached files.  And for each file, report the offsets of all
the pages which are presently in pagecache.  This all gets put into a
database.

At cold-boot time we open all those files and read the relevant files.

To optimise that further, userspace would need to use fibmap to work
out the LBA(s) of each page, and then read the pages in an optimised order.

To optimise that even further, userspace would need to find the on-disk
locations all the metadata for each file, generate the metadata->data
dependencies and then incorporate that into the reading order.

I actually wrote code to do all this.  Gad, it was ten years ago.  I
forget how it works, but I do recall that it pioneered the technology
of doing (effecticely) a sys_write(1, ...) from a kernel module, so the
module's output appears on modprobe's stdout and can be redirected to
another file or a pipe.  So sue me!  It's in
http://userweb.kernel.org/~akpm/stuff/fboot.tar.gz.  Good luck with
that ;)



It walked mem_map[], indentifying pagecache pages, walking back from
the page* all the way to the filename then logging the pathname and the
file's pagecache indexes.  It also handled the blockdev superblock,
where all the ext3 metadata resides.

There are much smarter ways of doing this of course, especially with
the vfs data structures which we later added.



According to http://kerneltrap.org/node/2157 it sped up cold boot by
"10%", whatever that means.  Seems that I wasn't sufficiently impressed
by that and got distracted.


I'm not sure any of that was very useful, really.  A full-on coldboot
optimiser really wants visibility into every disk block which need to
be read, and then mechanisms to tell the kernel to load those blocks
into the correct address_spaces.  That's hard, because file data
depends on file metadata.  A vast simplification would be to do it in
two disk passes: read all the metadata on pass 1 then all the data on
pass 2.

A totally different approach is to reorder all the data and metadata
on-disk, so no special cold-boot processing is needed at all.

And a third approach is to save all the cache into a special
file/partition/etc and to preload all that into kernel data structures
at boot.  Obviously this one is ricky/tricky because the on-disk
replica of the real data can get out of sync with the real data.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 10:48:33 +0800 Shaohua Li  wrote:

> On Thu, 2011-01-20 at 10:42 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 10:30:47 +0800 Shaohua Li  wrote:
> > 
> > > > I don't know if this is worth addressing.  Perhaps require that the
> > > > filp refers to the root of the fs?
> > > I didn't see why this is needed, but I can limit the fip to the root of
> > > the fs.
> > 
> > I don't think it matters much either.  The only problem I can see is if
> > we were to later try to extend the ioctl into a per-file thing.
> since we return page range, a metadata page might be shared by several
> files, which makes the per-file thing doesn't work. For a fs using
> trees, it's even more hard to distinguish a file's metadata

hm, why.  A query for "which blocks need to be read to access this
file" may return blocks which are shared with other files, but it's
still useful info.  Because it will represent vastly less data (and
hence IO) than the current fs-wide thing.

Now I actually look at it, I cannot find any documentation for the ioctl!  

It seems to return a single offset/length tuple which refers to the
btrfs metadata "file", with the intent that this tuple later be fed
into a btrfs-specific readahead ioctl.

I can see how this might be used with say fatfs or ext3 where all
metadata resides within the blockdev address_space.  But how is a
filesytem which keeps its metadata in multiple address_spaces supposed
to use this interface?

So.  Please fully document the proposed userspace APIs!  This should be
the first thing we look at.  Then we can take a look at how applicable
that is to other-than-btrfs filesystems.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/5]add new ioctls to do metadata readahead in btrfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 10:34:18 +0800 Shaohua Li  wrote:

> > >   Under a harddisk based netbook with Meego, the metadata readahead
> > > reduced about 3.5s boot time in average from total 16s.
> > 
> > That's a respectable speedup.  And it *needs* to be a good speedup,
> > given how hacky all of this is!
> > 
> > But then..  reducing bootup time on a laptop/desktop/server by 3.5s
> > isn't exactly a world-shattering benefit, is it?  Is it worth all the
> > hacky code?
> a laptop/desktop/server need read more data from hard disks, this will
> give more bootup time saving I think, though not tested yet.

Well, the whole point of the patch is to improve boot times, so the
more boot-time testing you can do, the better that is!

> > It would be much more valuable if those 3.5 seconds were available to
> > devices which really really care about bootup times, but very few of
> > those devices use rotating disks nowadays, I expect?
> Currently most popular netbooks are using rotating disks actually. And
> this will benefit laptop/desktop too.

But my point is that three seconds boot-time improvement for a system
which has an uptime of days or months isn't terribly exciting.

What *would* be terribly exciting is a three-second improvement for
cameras, cellphones, etc.  But they don't use spinning disks.

Can we expect *any* benefit for flash-type storage devices?  If so, how
much?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Thu, 20 Jan 2011 10:30:47 +0800 Shaohua Li  wrote:

> > I don't know if this is worth addressing.  Perhaps require that the
> > filp refers to the root of the fs?
> I didn't see why this is needed, but I can limit the fip to the root of
> the fs.

I don't think it matters much either.  The only problem I can see is if
we were to later try to extend the ioctl into a per-file thing.

> > Also, is this a privileged operation?  If not, then that might be a
> > problem - could it be used by unprivileged users to work out which
> > files have been opened recently or something like that?
> it's harmless even a unprivileged user uses it. I don't think
> unprivileged user can decode the data returned from the ioctl.

um.

Well, by doing a before-and-after thing I can use this ioctl to work
out what metadata blocks are used when someone reads
/my/super/secret-directory/foo.  Then I can write a program which sits
there waiting until someone else reads /my/super/secret-directory/foo. 
Then I can use that information to start WWIII or something.

I dunno, strange things happen.  Unless there's a good *need* to make
this available to unprivileged users then we should not do so.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs

2011-01-19 Thread Andrew Morton
On Wed, 19 Jan 2011 09:15:18 +0800
Shaohua Li  wrote:

> Subject: add metadata_incore ioctl in vfs
> 
> Add an ioctl to dump filesystem's metadata in memory in vfs. Userspace 
> collects
> such info and uses it to do metadata readahead.
> Filesystem can hook to super_operations.metadata_incore to get metadata in
> specific approach. Next patch will give an example how to implement
> .metadata_incore in btrfs.
> 
> ...
>  
>  /*
> + * Copy info about metadata in memory to userspace
> + * Returns:
> + * = 1, one metadata range copied to userspace
> + * = 0, no more metadata
> + * < 0, error
> + */
> +static int ioctl_metadata_incore(struct file *filp, void __user *argp)
> +{
> + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
> + struct metadata_incore_args args;
> + loff_t offset;
> + ssize_t size;
> +
> + if (!sb->s_op->metadata_incore)
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + /* we check metadata info in page unit */
> + if (args.offset & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + offset = args.offset;
> +
> + if (sb->s_op->metadata_incore(sb, &offset, &size) < 0)
> + return 0;
> +
> + args.address = offset;
> + args.size = size;
> + args.unused = 0;
> +
> + if (copy_to_user(argp, &args, sizeof(args)))
> + return -EFAULT;
> +
> + return 1;
> +}

So userspace opens any file on the fs and runs this ioctl against it?

That's a pretty awkward interface - we're doing an fs-wide operation
but the fs is identified by a single file which happens to live on that
fs.  For example, this precludes a future extension whereby userspace
can query the incore metadata for a particular file.  The statfs
syscall sucks in the same manner.

I don't know if this is worth addressing.  Perhaps require that the
filp refers to the root of the fs?


Also, is this a privileged operation?  If not, then that might be a
problem - could it be used by unprivileged users to work out which
files have been opened recently or something like that?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/5]add new ioctls to do metadata readahead in btrfs

2011-01-19 Thread Andrew Morton
On Wed, 19 Jan 2011 09:15:15 +0800
Shaohua Li  wrote:

>   We have file readahead to do asyn file read, but has no metadata
> readahead. For a list of files, their metadata is stored in fragmented
> disk space and metadata read is a sync operation, which impacts the
> efficiency of readahead much. The patches try to add meatadata readahead
> for btrfs. It has two advantages. One is make metadata read async, the
> other is significant reducing disk I/O seek.
>   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> the inode to a fd so we could use existing syscalls (readahead, mincore
> or upcoming fincore) to do readahead, but the inode is hidden, there is
> no easy way for this from my understanding. Another problem is we need
> check page referenced bit to make sure if a page is valid, which isn't
> ok doing this in fincore/mincore. And in metadata readahead, filesystem
> need specific checking like the patch4. Doing the checking in current
> API (for example fadvise) will mess things too. So we add two ioctls for
> this. One is like readahead syscall, the other is like micore/fincore
> syscall.

Has anyone looked at implementing this for filesystems other than
btrfs?  Have the ext4 guys taken a look?  Did they see any impediments
to implementing it for ext4?

>   Under a harddisk based netbook with Meego, the metadata readahead
> reduced about 3.5s boot time in average from total 16s.

That's a respectable speedup.  And it *needs* to be a good speedup,
given how hacky all of this is!

But then..  reducing bootup time on a laptop/desktop/server by 3.5s
isn't exactly a world-shattering benefit, is it?  Is it worth all the
hacky code?

It would be much more valuable if those 3.5 seconds were available to
devices which really really care about bootup times, but very few of
those devices use rotating disks nowadays, I expect?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 26242] New: BUG: unable to handle kernel NULL pointer dereference at (null)

2011-01-06 Thread Andrew Morton

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 6 Jan 2011 20:59:08 GMT
bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=26242
> 
>Summary: BUG: unable to handle kernel NULL pointer dereference
> at   (null)
>Product: Memory Management
>Version: 2.5
> Kernel Version: 2.6.37
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: low
>   Priority: P1
>  Component: Other
> AssignedTo: a...@linux-foundation.org
> ReportedBy: stmicha...@web.de
> Regression: No
> 
> 
> My system crashed with the following output:
> 
> ___
> Jan  6 20:06:22 eser kernel: [19365.562621] BUG: unable to handle kernel NULL
> pointer dereference at   (null)
> Jan  6 20:06:22 eser kernel: [19365.562675] IP: []
> kmap_atomic_prot+0x1b/0x100
> Jan  6 20:06:22 eser kernel: [19365.562709] *pde =  
> Jan  6 20:06:22 eser kernel: [19365.562726] Oops:  [#1] PREEMPT SMP 
> Jan  6 20:06:22 eser kernel: [19365.562752] last sysfs file:
> /sys/devices/platform/coretemp.0/temp1_input
> Jan  6 20:06:22 eser kernel: [19365.562777] Modules linked in: isofs usblp
> usb_storage uas nls_utf8 udf crc_itu_t fuse ipt_MASQUERADE xt_pkttype 
> xt_TCPMSS
> xt_tcpudp ipt_LOG xt_limit iptable_nat nf_nat snd_pcm_oss snd_mixer_oss 
> snd_seq
> snd_seq_device xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter
> nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 
> ip_tables
> cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf
> speedstep_lib ip6_tables x_tables loop arc4 ecb b43 snd_hda_codec_si3054
> mac80211 snd_hda_codec_realtek snd_hda_intel r8169 snd_hda_codec cfg80211
> sdhci_pci mii snd_hwdep acer_wmi sdhci snd_pcm rfkill iTCO_wdt yenta_socket 
> ssb
> tifm_7xx1 iTCO_vendor_support sg sr_mod mmc_core snd_timer pcmcia_core
> tifm_core cdrom pcspkr wmi pcmcia_rsrc psmouse snd i2c_i801 shpchp evdev
> soundcore battery rng_core ac snd_page_alloc pci_hotplug dm_crypt usbhid hid
> nouveau ttm drm_kms_helper drm uhci_hcd rtc_cmos ata_piix i2c_algo_bit 
> i2c_core
> rtc_core cfbcopyarea ehci_hcd usb
> Jan  6 20:06:22 eser kernel: core video cfbimgblt cfbfillrect rtc_lib output
> button nls_base dm_snapshot sha512_generic sha256_generic xts cbc aes_i586
> aes_generic cfq_iosched blk_cgroup btrfs zlib_deflate libcrc32c reiserfs ahci
> libahci libata coretemp hwmon fan thermal processor unix [last unloaded:
> pktcdvd]
> Jan  6 20:06:22 eser kernel: [19365.563014] 
> Jan  6 20:06:22 eser kernel: [19365.563014] Pid: 15675, comm: gimp-2.6 Not
> tainted 2.6.37 #1 Myall2 /Aspire 9410
> Jan  6 20:06:22 eser kernel: [19365.563014] EIP: 0060:[] EFLAGS:
> 00010202 CPU: 0
> Jan  6 20:06:22 eser kernel: [19365.563014] EIP is at
> kmap_atomic_prot+0x1b/0x100
> Jan  6 20:06:22 eser kernel: [19365.563014] EAX:  EBX: 0600 ECX:
> f3a82000 EDX: 0163
> Jan  6 20:06:23 eser kernel: [19365.563014] ESI: f3a83eac EDI:  EBP:
> f3a83db8 ESP: f3a83da8
> Jan  6 20:06:23 eser kernel: [19365.563014]  DS: 007b ES: 007b FS: 00d8 GS:
> 0033 SS: 0068
> Jan  6 20:06:23 eser kernel: [19365.563014] Process gimp-2.6 (pid: 15675,
> ti=f3a82000 task=eaf28000 task.ti=f3a82000)
> Jan  6 20:06:23 eser kernel: [19365.563014] Stack:
> Jan  6 20:06:23 eser kernel: [19365.563014]  f3a83dc0 0600 f3a83eac
>  f3a83dc0 c022998e f3a83dd8 c0299c0c
> Jan  6 20:06:23 eser kernel: [19365.563014]  e0359240 0600 1000
> 1000 f3a83dfc f828d6da 0600 1008
> Jan  6 20:06:23 eser kernel: [19365.563014]  0002  0002
> 2000 1608 f3a83ed0 f828e1ff 1608
> Jan  6 20:06:23 eser kernel: [19365.563014] Call Trace:
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> __kmap_atomic+0xe/0x10
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> iov_iter_copy_from_user_atomic+0x3c/0x90
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> btrfs_copy_from_user+0x5a/0xb0 [btrfs]
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> btrfs_file_aio_write+0x52f/0x9c0 [btrfs]
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> __mem_cgroup_commit_charge+0x70/0xe0
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> do_sync_write+0x9c/0xd0
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> rw_verify_area+0x65/0x100
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> vfs_write+0x9a/0x160
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> fget_light+0x91/0xb0
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> do_sync_write+0x0/0xd0
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ? 
> sys_write+0x3d/0x70
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> sysenter_do_call+0x12/0x28
> Jan  6 20:06:23 eser kernel: [19365.563014]  [] ?
> quirk_amd_ide_mode+0x40/0x95
> Jan  6 20:06:23 eser kernel

Re: [RFC 3/5]add metadata_readahead ioctl in vfs

2010-12-13 Thread Andrew Morton
On Mon, 13 Dec 2010 15:22:14 +0800
Shaohua Li  wrote:

> Add metadata readahead ioctl in vfs. Filesystem can hook to
> super_operations.metadata_readahead to handle filesystem specific task.
> Next patch will give an example how btrfs implements it.
> 
> Signed-off-by: Shaohua Li 
> 
> ---
>  fs/compat_ioctl.c  |1 +
>  fs/ioctl.c |   21 +
>  include/linux/fs.h |8 
>  3 files changed, 30 insertions(+)
> 
> Index: linux/fs/ioctl.c
> ===
> --- linux.orig/fs/ioctl.c 2010-12-13 14:01:56.0 +0800
> +++ linux/fs/ioctl.c  2010-12-13 14:10:42.0 +0800
> @@ -605,6 +605,24 @@ static int ioctl_metadata_incore(struct
>   return entries;
>  }
>  
> +static int ioctl_metadata_readahead(struct file *filp, void __user *argp)
> +{
> + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
> + struct metadata_readahead_args args;
> +
> + if (!sb->s_op->metadata_readahead)
> + return -EOPNOTSUPP;

-ENOTNETWORKING

> + if (copy_from_user(&args, (struct metadata_readahead_args __user *)argp,

Unneeded cast.

> + sizeof(args)))
> + return -EFAULT;
> +
> + if ((args.offset & ~PAGE_CACHE_MASK) || (args.size & ~PAGE_CACHE_MASK))

Why?

> + return -EINVAL;
> +
> + return sb->s_op->metadata_readahead(sb, args.offset, args.size);
> +}
> +

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/5] implement metadata_incore in btrfs

2010-12-13 Thread Andrew Morton
On Mon, 13 Dec 2010 15:22:11 +0800
Shaohua Li  wrote:

> Implement btrfs specific .metadata_incore.
> In btrfs, all metadata pages are in a special btree_inode, we take pages from 
> it.
> we only account updated and referenced pages here. Say we collect metadata 
> info
> in one boot, do metadata readahead in next boot and we might collect metadata
> again. The readahead could read garbage data in as metadata could be changed
> from first run. If we only account updated pages, the metadata info collected
> by userspace will increase every run. Btrfs alloc_extent_buffer will do
> mark_page_accessed() for pages which will be used soon, so we could use
> referenced bit to filter some garbage pages.
> 
> Signed-off-by: Shaohua Li 
> 
> ---
>  fs/btrfs/super.c |   48 
>  1 file changed, 48 insertions(+)
> 
> Index: linux/fs/btrfs/super.c
> ===
> --- linux.orig/fs/btrfs/super.c   2010-12-07 10:10:20.0 +0800
> +++ linux/fs/btrfs/super.c2010-12-07 13:25:20.0 +0800
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "compat.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> @@ -845,6 +846,52 @@ static int btrfs_unfreeze(struct super_b
>   return 0;
>  }
>  
> +static int btrfs_metadata_incore(struct super_block *sb, loff_t *offset,
> + ssize_t *size)
> +{
> + struct btrfs_root *tree_root = btrfs_sb(sb);
> + struct inode *btree_inode = tree_root->fs_info->btree_inode;
> + struct pagevec pvec;
> + loff_t index = (*offset) >> PAGE_CACHE_SHIFT;

pgoff_t would be a more appropriate type.

> + int i, nr_pages;
> +
> + *size = 0;
> +retry:
> + pagevec_init(&pvec, 0);
> + nr_pages = pagevec_lookup(&pvec, btree_inode->i_mapping, index,
> + PAGEVEC_SIZE);
> + if (nr_pages == 0)
> + goto out;
> + for (i = 0; i < nr_pages; i++) {
> + struct page *page = pvec.pages[i];
> +
> + /* Only take pages with 'referenced' bit set */

The comment describes the utterly obvious and doesn't explain the
utterly unobvious: "why?".

> + if (PageUptodate(page) && PageReferenced(page)) {
> + if (*size == 0) {
> + *size += PAGE_CACHE_SIZE;
> + *offset = page->index << PAGE_CACHE_SHIFT;
> + continue;
> + }
> + if (page->index !=
> + (*offset + *size) >> PAGE_CACHE_SHIFT)
> + break;
> + *size += PAGE_CACHE_SIZE;
> + } else if (*size > 0)
> + break;
> + else
> + index = page->index + 1;
> + }
> + pagevec_release(&pvec);
> +
> + if (nr_pages > 0 && *size == 0)
> + goto retry;

I don't think I know why this retry loop exists.  A comment would be
nice.

> +out:
> + if (*size > 0)
> + return 0;
> + else
> + return -ENOENT;
> +}
> +
>  static const struct super_operations btrfs_super_ops = {
>   .drop_inode = btrfs_drop_inode,
>   .evict_inode= btrfs_evict_inode,
> @@ -859,6 +906,7 @@ static const struct super_operations btr
>   .remount_fs = btrfs_remount,
>   .freeze_fs  = btrfs_freeze,
>   .unfreeze_fs= btrfs_unfreeze,
> + .metadata_incore = btrfs_metadata_incore,
>  };
>  
>  static const struct file_operations btrfs_ctl_fops = {
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/5] add metadata_incore ioctl in vfs

2010-12-13 Thread Andrew Morton
On Mon, 13 Dec 2010 15:22:10 +0800
Shaohua Li  wrote:

> Add an ioctl to dump filesystem's metadata in memory in vfs. Userspace 
> collects
> such info and uses it to do metadata readahead.
> Filesystem can hook to super_operations.metadata_incore to get metadata in
> specific approach. Next patch will give an example how to implement
> .metadata_incore in btrfs.
> 

Please cc Michael Kerrisk  and
linux-...@vger.kernel.org.  I'm sure that assistance writing the
manpage would be appreciated.

>
> ...
>
>  /*
> + * Copy info about metadata in memory to userspace
> + * Returns:
> + * > 0, number of metadata_incore_ent entries copied to userspace
> + * = 0, no more metadata
> + * < 0, error
> + */
> +static int ioctl_metadata_incore(struct file *filp, void __user *argp)
> +{
> + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
> + struct metadata_incore_args args;
> + struct metadata_incore_ent ent;
> + loff_t offset, last_offset = 0;
> + ssize_t size, last_size = 0;
> + __u64 __user vec_addr;
> + int entries = 0;
> +
> + if (!sb->s_op->metadata_incore)
> + return -EOPNOTSUPP;

EOPNOTSUPP is a networking errno - it doesn't seem appropriate for an
fs ioctl.

> + if (copy_from_user(&args, (struct metadata_incore_args __user *)argp,

Unneeded typecast.

> + sizeof(args)))
> + return -EFAULT;
> +
> + /* Check the start address: needs to be page-aligned.. */

Why?  The comment should tell me this.

> + if (args.offset & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + if ((args.vec_size % sizeof(struct metadata_incore_ent)) != 0)
> + return -EINVAL;
> +
> + if (!access_ok(VERIFY_WRITE, args.vec_addr, args.vec_size))

Seems unneccessary - copy_to_user() checks this.

> + return -EFAULT;
> +
> + offset = args.offset;
> +
> + ent.unused = 0;
> + vec_addr = args.vec_addr;
> +
> + while (vec_addr < args.vec_addr + args.vec_size) {
> + if (signal_pending(current))
> + return -EINTR;
> + cond_resched();
> +
> + if (sb->s_op->metadata_incore(sb, &offset, &size) < 0)
> + break;
> + /* A merge or offset == 0 */
> + if (offset == last_offset + last_size) {
> + last_size += size;
> + offset = offset + size;
> + continue;
> + }
> + ent.offset = last_offset;
> + ent.size = last_size;
> + if (copy_to_user((void *)(long)vec_addr, &ent, sizeof(ent)))
> + return -EFAULT;
> + vec_addr += sizeof(ent);
> + entries++;
> +
> + last_offset = offset;
> + last_size = size;
> + ent.unused = 0;
> + offset = offset + size;
> + }
> +
> + if (last_size > 0 && vec_addr < args.vec_addr + args.vec_size) {
> + ent.offset = last_offset;
> + ent.size = last_size;
> + if (copy_to_user((void *)(long)vec_addr, &ent, sizeof(ent)))
> + return -EFAULT;
> + entries++;
> + }
> +
> + return entries;
> +}
> +
> +/*
>   * When you add any new common ioctls to the switches above and below
>   * please update compat_sys_ioctl() too.
>   *
>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-29 Thread Andrew Morton
On Thu, 25 Nov 2010 14:53:56 +1100
Nick Piggin  wrote:

> On Wed, Nov 24, 2010 at 02:10:28PM +0100, Jan Kara wrote:
> > On Wed 24-11-10 12:03:43, Nick Piggin wrote:
> > > > For the _nr variant that btrfs uses, it's worse for the filesystems
> > > > that don't have a 1:1 bdi<->sb mapping.  It might not actually write any
> > > > of the pages from the SB that is out of space.
> > > 
> > > That's true, but it might not write anything anyway (and after we
> > > check whether writeout is in progress, the writeout thread could go
> > > to sleep and not do anything anyway).
> > > 
> > > So it's a pretty hacky interface anyway. If you want to do anything
> > > deterministic, you obviously need real coupling between producer and
> > > consumer. This should only be a performance tweak (or a workaround
> > > hack in worst case).
> >   Yes, the current interface is a band aid for the problem and better
> > interface is welcome. But it's not trivial to do better...
> > 
> > > > > It makes no further guarantees, and anyway
> > > > > the sb has to compete for normal writeback within this bdi.
> > > > 
> > > > > 
> > > > > I think Christoph is right because filesystems should not really
> > > > > know about how bdi writeback queueing works. But I don't know if it's
> > > > > worth doing anything more complex for this functionality?
> > > > 
> > > > I think we should make a writeback_inodes_sb_unlocked() that doesn't
> > > > warn when the semaphore isn't held.  That should be enough given where
> > > > btrfs and ext4 are calling it from.
> > > 
> > > It doesn't solve the bugs -- calling and waiting for writeback is
> > > still broken because completion requires i_mutex and it is called
> > > from under i_mutex.
> >   Well, as I wrote in my previous email, only ext4 has the problem with
> > i_mutex and I personally view it as a bug. But ultimately it's Ted's call
> > to decide.
> 
> Well, for now, the easiest and simplest fix is my patch, I think. The
> objection is that we may not write out anything for the specified sb,
> but the current implementation provides no such guarantees at all
> anyway, so I don't think it's a big issue.

Well yes.  We take something which will fail occasionally and with your
patch replace it with something which will fail a bit more often.  Why
don't we go all the way and do something which will fail *even more
often*.  Namely, just delete the damn function in the hope that the
resulting failures will provoke the ext4 crew into doing something sane
this time?

Guys, this delalloc thing *sucks*.  And here we are just sticking new
bandaids on top of the old bandaids.  And the btrfs approach isn't
exactly a thing of glory, either.

So...  nope.  I won't be applying Nick's patch.  Please fix this thing
properly - you have a whole month!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-25 Thread Andrew Morton
On Thu, 25 Nov 2010 11:41:50 +0200 Boaz Harrosh  wrote:

> On 11/25/2010 12:47 AM, Andrew Morton wrote:
> > On Tue, 23 Nov 2010 07:34:07 -0500
> > Chris Mason  wrote:
> > 
> >> For btrfs there's only one bdi per SB, but for most everyone else a disk
> >> with a bunch of partitions is going to have multiple filesystems on the
> >> same bdi.
> > 
> > um, please explain why that wasn't idiotic?  The BDI is a
> > representation of a backing device and it's *supposed* to provide
> > visibility into what's happening against other partitions on the same
> > device.  Creating a BDI per SB (it didn't even occur to me to think
> > that a filesystem was even able to do this) breaks that.
> > 
> 
> In btrfs an SB my span multiple partitions. How else can it be solved?

Associate a number of bdi's with the superblock.  If necessary, convert
core kernel to operate on groups of BDI's.  Which shouldn't be too hard
given that core kernel already does this, for MD.

ie: something which faithfully models what is actually going on, rather
than simply bending reality.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-24 Thread Andrew Morton
On Wed, 24 Nov 2010 12:03:43 +1100
Nick Piggin  wrote:

> On Tue, Nov 23, 2010 at 01:58:24PM -0500, Chris Mason wrote:
> > > > My original btrfs patch just exported the bdi_ funcs so that btrfs could
> > > > do the above internally.  But Christoph objected, and I think he's
> > > > right.  We should either give everyone a bdi or make sure the writeback
> > > > func kicks only one filesystem.
> > > 
> > > Well it's just kicking the writeback thread, and it will writeback
> > > from that particular sb.
> > 
> > Hmmm?  It will writeback for all the SBs on that bdi.  In the current
> > form that ext4 uses, that gets pretty expensive if you have a bunch of
> > large partitions and you're only running out of space on one of them.
> 
> Right. But if the bdi has writeback in progress (which would be most
> of the time, on a busy filesystem), writeback_if_idle doesn't do
> anything, and it is happy just for the background writeback to
> eventually get around to writing out for us.

That doesn't work if you're running btfs (apparently short for
"busticated filesystem") because the bdi-per-sb thing carefully hid the
information which you're looking for.


We still don't have a fix for this bug yet, it appears, btw.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fs: fix deadlocks in writeback_if_idle

2010-11-24 Thread Andrew Morton
On Tue, 23 Nov 2010 07:34:07 -0500
Chris Mason  wrote:

> For btrfs there's only one bdi per SB, but for most everyone else a disk
> with a bunch of partitions is going to have multiple filesystems on the
> same bdi.

um, please explain why that wasn't idiotic?  The BDI is a
representation of a backing device and it's *supposed* to provide
visibility into what's happening against other partitions on the same
device.  Creating a BDI per SB (it didn't even occur to me to think
that a filesystem was even able to do this) breaks that.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-22 Thread Andrew Morton
On Wed, 17 Nov 2010 21:18:13 -0600
Eric Sandeen  wrote:

> On 11/17/10 12:10 AM, Nick Piggin wrote:
> > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> >> On 11/16/10 10:38 PM, Nick Piggin wrote:
>  as for the locking problems ... sorry about that!
> >>>
> >>> That's no problem. So is that an ack? :)
> >>>
> >>
> >> I'd like to test it with the original case it was supposed to solve; will
> >> do that tomorrow.
> > 
> > OK, but it shouldn't make much difference, unless there is a lot of
> > strange activity happening on the sb (like mount / umount / remount /
> > freeze / etc).
> > 
> 
> Ok, good point.  And since I failed in my promise anyway, I won't
> hold you up;
> 
> Acked-by: Eric Sandeen 

That patch still sucks!

We all breathlessly await v2, which will have a complete changelog and
approximately correct comments.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Andrew Morton
On Thu, 18 Nov 2010 13:51:15 -0500
Chris Mason  wrote:

> > If those functions "fix" a testcase then it was by sheer luck, and the
> > fs's ENOSPC handling is still busted.
> > 
> > For a start writeback_inodes_sb_if_idle() is a no-op if the device
> > isn't idle!  Secondly, if the device _was_ idle,
> > writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> > which means that the work might not get executed for another six weeks.
> > 
> > So no, your ENOSPC handling is still busted and I'll be doing you a
> > favour when I send that parport patch.
> 
> Btrfs uses it with this cool looping construct. It's an innovative
> combination of while, 1, schedule_timeout(),  and if all goes well, break.

If the calling code can do that then it doesn't need to pass the work
off to another thread at all.  Just sychronously call
writeback_inodes_sb(), then bye-bye goes writeback_inodes_sb_if_idle(),
to great sighs of relief.

And again: as btrfs is effectively making a synchronous call to
writeback_inodes_sb() via schedule(), then surely it does not need to
take s_umount to protect its own darn superblock!!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Andrew Morton
On Thu, 18 Nov 2010 13:02:43 -0600
Eric Sandeen  wrote:

> On 11/18/10 12:36 PM, Andrew Morton wrote:
> > On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen  wrote:
> > 
> >> On 11/18/10 11:10 AM, Andrew Morton wrote:
> >>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen  
> >>> wrote:
> >>>
> >>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and
> >>>>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> >>>>> pretty handwavy - do we know that deleting these things would make a
> >>>>> jot of difference?
> >>>>
> >>>> Really?  I thought it was pretty decent ;)
> >>>>
> >>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> >>>> shows the problem clearly, IIRC.  I should have included that in the
> >>>> changelog, I suppose, sorry.
> >>>
> >>> Your email didn't really impart any information :(
> >>>
> >>> I suppose I could accidentally delete those nasty little functions in a
> >>> drivers/parport patch then wait and see if anyone notices.
> >>>
> >>
> >> Um, ok, then, to answer the question directly :
> >>
> >> No, please don't delete those functions, it will break ENOSPC handling
> >> in ext4 as shown by xfstests regression test #204 ...
> >>
> > 
> > If those functions "fix" a testcase then it was by sheer luck, and the
> > fs's ENOSPC handling is still busted.
> > 
> > For a start writeback_inodes_sb_if_idle() is a no-op if the device
> > isn't idle!  
> 
> so writeback is already in progress and it's already doing what we need,
> right?  Space is being freed up as we speak in that case.

With no guarantee that it's being freed at a sufficient rate.

> > Secondly, if the device _was_ idle,
> > writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> > which means that the work might not get executed for another six weeks.
> 
> We start it quite early, before things are critical.
> 
> Yeah, it's not bulletproof but it is tons better.

Translation: "it papers over a bug".

Look, if this was a little best-effort poke-writeback-now performance
tweak then fine.  But as an attempt to prevent a synchronous and bogus
ENOSPC error it's just hopeless.

Guys, fix the thing for real, and take that hack out.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Andrew Morton
On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen  wrote:

> On 11/18/10 11:10 AM, Andrew Morton wrote:
> > On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen  wrote:
> > 
> >>> Can we just delete writeback_inodes_sb_nr_if_idle() and
> >>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> >>> pretty handwavy - do we know that deleting these things would make a
> >>> jot of difference?
> >>
> >> Really?  I thought it was pretty decent ;)
> >>
> >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> >> shows the problem clearly, IIRC.  I should have included that in the
> >> changelog, I suppose, sorry.
> > 
> > Your email didn't really impart any information :(
> > 
> > I suppose I could accidentally delete those nasty little functions in a
> > drivers/parport patch then wait and see if anyone notices.
> > 
> 
> Um, ok, then, to answer the question directly :
> 
> No, please don't delete those functions, it will break ENOSPC handling
> in ext4 as shown by xfstests regression test #204 ...
> 

If those functions "fix" a testcase then it was by sheer luck, and the
fs's ENOSPC handling is still busted.

For a start writeback_inodes_sb_if_idle() is a no-op if the device
isn't idle!  Secondly, if the device _was_ idle,
writeback_inodes_sb_if_idle() uses a work handoff to another thread,
which means that the work might not get executed for another six weeks.

So no, your ENOSPC handling is still busted and I'll be doing you a
favour when I send that parport patch.

No?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Andrew Morton
On Thu, 18 Nov 2010 19:18:22 +1100 Nick Piggin  wrote:

> On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote:
>  
> > Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> > is a per-superblock thing, and i_mutex is a per-file thing, and files
> > live under superblocks.  Nesting s_umount outside i_mutex creates
> > complex deadlock graphs between the various i_mutexes, I think.
> 
> You mean i_mutex outside s_umount?
> 

Nope.  i_mutex should nest inside s_umount.  Just as inodes nest inside
superblocks!  Seems logical to me ;)

> > Someone tell me if btrfs has the same bug, via its call to
> > writeback_inodes_sb_nr_if_idle()?
> > 
> > I don't see why these functions need s_umount at all, if they're called
> > from within ->write_begin against an inode on that superblock.  If the
> > superblock can get itself disappeared while we're running ->write_begin
> > on it, we have problems, no?
> > 
> > In which case I'd suggest just removing the down_read(s_umount) and
> > specifying that the caller must pin the superblock via some means.
> > 
> > Only we can't do that because we need to hold s_umount until the
> > bdi_queue_work() worker has done its work.
> > 
> > The fact that a call to ->write_begin can randomly return with s_umount
> > held, to be randomly released at some random time in the future is a
> > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > thing.
> 
> Yeah that whole writeback_inodes_if_idle is nasty
> 
>  
> > It'd be better if we pinned these superblocks via refcounting, not via
> > holding s_umount but even then, having ->write_begin randomly bump sb
> > refcounts for random periods of time is still pretty ugly.
> > 
> > What a pickle.
> > 
> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> > 
> > And why _do_ we need to hold s_umount during the bdi_queue_work()
> > handover?  Would simply bumping s_count suffice?
> 
> s_count just prevents it from going away, but s_umount is still needed
> to keep umount, remount,ro, freezing etc activity away. I don't think
> there is an easy way to do it.
> 
> Perhaps filesystem should have access to the dirty throttling path, kick
> writeback or delayed allocation etc as needed, and throttle against
> outstanding work that needs to be done, going through the normal
> writeback paths?

I just cannot believe that we need s_mount inside ->write_begin.  Is it
really the case that someone can come along and unmount or remount or
freeze our filesystem while some other process is down performing a
->write_begin against one of its files?  Kidding?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-18 Thread Andrew Morton
On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen  wrote:

> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> 
> Really?  I thought it was pretty decent ;)
> 
> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> shows the problem clearly, IIRC.  I should have included that in the
> changelog, I suppose, sorry.

Your email didn't really impart any information :(

I suppose I could accidentally delete those nasty little functions in a
drivers/parport patch then wait and see if anyone notices.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-17 Thread Andrew Morton
On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin  wrote:

> On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:
> > On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o"  wrote:
> > 
> > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > > > >> as for the locking problems ... sorry about that!
> > > > > > 
> > > > > > That's no problem. So is that an ack? :)
> > > > > > 
> > > > > 
> > > > > I'd like to test it with the original case it was supposed to solve; 
> > > > > will
> > > > > do that tomorrow.
> > > > 
> > > > OK, but it shouldn't make much difference, unless there is a lot of
> > > > strange activity happening on the sb (like mount / umount / remount /
> > > > freeze / etc).
> > > 
> > > This makes sense to me as well.
> > > 
> > > Acked-by: "Theodore Ts'o" 
> > > 
> > > So how do we want to send this patch to Linus?  It's a writeback
> > > change, so through some mm tree?
> > 
> > It's in my todo pile.  Even though the patch sucks, but not as much as
> > its changelog does.  Am not particularly happy merging an alleged
> > bugfix where the bug is, and I quote, "I saw a lock order warning on
> > ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
> > based on that??  Or to understand it a year from now?
> 
> Sorry bout the confusion, it was supposed to be "i_mutex", and then it
> would have been a bit more obvious.
> 
> 
> > When I get to it I'll troll this email thread and might be able to
> > kludge together a description which might be able to fool people into
> > thinking it makes sense.
> 
> "Lock order reversal between s_umount and i_mutex".
> 
> i_mutex nests inside s_umount in some writeback paths (it was the end
> io handler to convert unwritten extents IIRC). But hmm, wouldn't that
> be a bug? We aren't allowed to take i_mutex inside writeback, are we?

I'm not sure that s_umount versus i_mutex has come up before.

Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
is a per-superblock thing, and i_mutex is a per-file thing, and files
live under superblocks.  Nesting s_umount outside i_mutex creates
complex deadlock graphs between the various i_mutexes, I think.

Someone tell me if btrfs has the same bug, via its call to
writeback_inodes_sb_nr_if_idle()?

I don't see why these functions need s_umount at all, if they're called
from within ->write_begin against an inode on that superblock.  If the
superblock can get itself disappeared while we're running ->write_begin
on it, we have problems, no?

In which case I'd suggest just removing the down_read(s_umount) and
specifying that the caller must pin the superblock via some means.

Only we can't do that because we need to hold s_umount until the
bdi_queue_work() worker has done its work.

The fact that a call to ->write_begin can randomly return with s_umount
held, to be randomly released at some random time in the future is a
bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
thing.

It'd be better if we pinned these superblocks via refcounting, not via
holding s_umount but even then, having ->write_begin randomly bump sb
refcounts for random periods of time is still pretty ugly.

What a pickle.

Can we just delete writeback_inodes_sb_nr_if_idle() and
writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
pretty handwavy - do we know that deleting these things would make a
jot of difference?

And why _do_ we need to hold s_umount during the bdi_queue_work()
handover?  Would simply bumping s_count suffice?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-17 Thread Andrew Morton
On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o"  wrote:

> On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > >> as for the locking problems ... sorry about that!
> > > > 
> > > > That's no problem. So is that an ack? :)
> > > > 
> > > 
> > > I'd like to test it with the original case it was supposed to solve; will
> > > do that tomorrow.
> > 
> > OK, but it shouldn't make much difference, unless there is a lot of
> > strange activity happening on the sb (like mount / umount / remount /
> > freeze / etc).
> 
> This makes sense to me as well.
> 
> Acked-by: "Theodore Ts'o" 
> 
> So how do we want to send this patch to Linus?  It's a writeback
> change, so through some mm tree?

It's in my todo pile.  Even though the patch sucks, but not as much as
its changelog does.  Am not particularly happy merging an alleged
bugfix where the bug is, and I quote, "I saw a lock order warning on
ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
based on that??  Or to understand it a year from now?

When I get to it I'll troll this email thread and might be able to
kludge together a description which might be able to fool people into
thinking it makes sense.

And someone at least needs to fix the comment: s/i_lock/i_mutex/.  I
guess that would be me.

>  Or it lives in fs/fs-writeback.c
> (which I always thought was weird; why is it not in mm/?),

The idea was that the writeback walk goes
superblocks->inodes->address_spaces->pages->bios->drivers, so that code
needs to be spread over fs->mm->block->drivers.  The dividing line
between fs and block is approximately at the "address_spaces" mark. 
Yes, that line is a bit smeary.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] fix up lock order reversal in writeback

2010-11-16 Thread Andrew Morton
On Tue, 16 Nov 2010 22:00:58 +1100
Nick Piggin  wrote:

> I saw a lock order warning on ext4 trigger. This should solve it.

Send us the trace, please.

The code comment implies that someone is calling down_read() under
i_lock?  That would be bad, and I'd expect it to have produced a
might_sleep() warning, not a lockdep trace.  

And I don't see how we can call writeback_inodes_sb() under i_lock
anyway, so I don't really have a clue what's going on here!

> Raciness shouldn't matter much, because writeback can stop just
> after we make the test and return anyway (so the API is racy anyway).
> 
> Signed-off-by: Nick Piggin 
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===
> --- linux-2.6.orig/fs/fs-writeback.c  2010-11-16 21:44:32.0 +1100
> +++ linux-2.6/fs/fs-writeback.c   2010-11-16 21:49:37.0 +1100
> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>   if (!writeback_in_progress(sb->s_bdi)) {
> - down_read(&sb->s_umount);
> - writeback_inodes_sb(sb);
> - up_read(&sb->s_umount);
> - return 1;
> - } else
> - return 0;
> + if (down_read_trylock(&sb->s_umount)) {
> + writeback_inodes_sb(sb);
> + up_read(&sb->s_umount);
> + return 1;
> + }
> + }
> + return 0;

And it's pretty generous to describe a s/down_read/down_read_trylock/
as a "fix".  Terms like "bandaid" and "workaround" come to mind.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-23 Thread Andrew Morton

(Cc sta...@kernel.org)

On Wed, 22 Sep 2010 21:54:30 -0300
Cesar Eduardo Barros  wrote:

> This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
> happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
> commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
> warning.
> 
> [...]
> device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> /dev/mapper/vg_cesarbinspiro-lv_home
> SELinux: initialized (dev dm-3, type btrfs), uses xattr
> [ cut here ]
> WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> Hardware name: Inspiron N4010
> Dirtiable inode bdi default != sb bdi btrfs
> Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
> pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
> aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> Call Trace:
>   [] warn_slowpath_common+0x85/0x9d
>   [] warn_slowpath_fmt+0x46/0x48
>   [] inode_to_bdi+0x62/0x6d
>   [] __mark_inode_dirty+0xd0/0x177
>   [] touch_atime+0x107/0x12a
>   [] ? filldir+0x0/0xd0
>   [] vfs_readdir+0x8d/0xb4
>   [] sys_getdents+0x81/0xd1
>   [] system_call_fastpath+0x16/0x1b

Thanks.  692ebd17c2905313fff3c504c249c6a0faad16ec had a cc:stable in
the changelog.  I'd suggest it not be merged into -stable until this
regression is sorted out!

>
> ...
> 
> These are only the first two; they keep happening, with several 
> different processes (it is not only find), but always btrfs (I have / 
> and /home on btrfs and /boot on ext4). In case it makes any difference, 
> /dev seems to be on devtmpfs, and everything uses relatime.
> 
> There is also the following one which is not on vfs_readdir:
> 
> [ cut here ]
> WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> Hardware name: Inspiron N4010
> Dirtiable inode bdi default != sb bdi btrfs
> Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop snd 
> pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd aes_x86_64 
> aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> Pid: 1104, comm: mkdir Tainted: GW   2.6.36-rc5+ #8
> Call Trace:
>   [] warn_slowpath_common+0x85/0x9d
>   [] warn_slowpath_fmt+0x46/0x48
>   [] inode_to_bdi+0x62/0x6d
>   [] __mark_inode_dirty+0xd0/0x177
>   [] btrfs_setattr+0x210/0x23a [btrfs]
>   [] notify_change+0x1a2/0x29d
>   [] sys_fchmod+0xac/0xfd
>   [] ? sys_newfstat+0x2e/0x39
>   [] system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Btrfs-devel] [patch] O_DIRECT: fix the splitting up of contiguous I/O

2010-09-05 Thread Andrew Morton
On Sun, 5 Sep 2010 22:56:08 +1000 Chris Samuel  wrote:

> On Fri, 3 Sep 2010 12:25:01 am Christoph Hellwig wrote:
> 
> > Andrew, can you please send this on to Linus and -stable ASAP?
> > It's causing massive problems for our users.
> 
> Did this patch get dropped ? 

Nope.  I have it queued for 2.6.36 and -stable, will send it in soon. 
Everything is a bit lagged at present due to conferences, summer,
encroaching senility, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

2010-06-02 Thread Andrew Morton
On Fri, 28 May 2010 10:35:50 -0700
Dan Magenheimer  wrote:

> [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files
> 
> Cleancache core files.
> 
> Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
> design for tmem; sysfs code modelled after mm/ksm.c
> 
> Note that CONFIG_CLEANCACHE defaults to on; all hooks devolve
> to a compare-pointer-to-NULL so performance impact should
> be negligible, but can be reduced to zero impact if config'ed off.
> 
> ...
>
> --- linux-2.6.34/include/linux/cleancache.h   1969-12-31 17:00:00.0 
> -0700
> +++ linux-2.6.34-cleancache/include/linux/cleancache.h2010-05-24 
> 18:14:33.0 -0600
> @@ -0,0 +1,90 @@
> +#ifndef _LINUX_CLEANCACHE_H
> +#define _LINUX_CLEANCACHE_H
> +
> +#include 
> +#include 
> +
> +#define CLEANCACHE_GET_PAGE_SUCCESS 1
> +
> +struct cleancache_ops {
> + int (*init_fs)(size_t);
> + int (*init_shared_fs)(char *uuid, size_t);
> + int (*get_page)(int, ino_t, pgoff_t, struct page *);
> + int (*put_page)(int, ino_t, pgoff_t, struct page *);
> + int (*flush_page)(int, ino_t, pgoff_t);
> + int (*flush_inode)(int, ino_t);
> + void (*flush_fs)(int);
> +};
> +
> +extern struct cleancache_ops *cleancache_ops;

Why does this exist?  If there's only ever one cleancache_ops
system-wide then we'd be better off doing

(*cleancache_ops.init_fs)()

and save a zillion pointer hops.

If instead there are different flavours of cleancache_ops then making
this pointer a system-wide singleton seems an odd decision.

>
> ...
>
> +int __cleancache_get_page(struct page *page)
> +{
> + int ret = 0;
> + int pool_id = page->mapping->host->i_sb->cleancache_poolid;
> +
> + if (pool_id >= 0) {
> + ret = (*cleancache_ops->get_page)(pool_id,
> +   page->mapping->host->i_ino,
> +   page->index,
> +   page);
> + if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
> + succ_gets++;
> + else
> + failed_gets++;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(__cleancache_get_page);

All these undocumeted functions would appear to be racy and buggy if
the passed-in page isn't locked.  But because they're undocumented, I
don't know if "the page must be locked" was an API requirement and I
ain't going to go and review all callers.

> +#ifdef CONFIG_SYSFS
> +
> +#define CLEANCACHE_ATTR_RO(_name) \
> + static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> +
> +static ssize_t succ_gets_show(struct kobject *kobj,
> +struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", succ_gets);
> +}
> +CLEANCACHE_ATTR_RO(succ_gets);
> +
> +static ssize_t failed_gets_show(struct kobject *kobj,
> +struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", failed_gets);
> +}
> +CLEANCACHE_ATTR_RO(failed_gets);
> +
> +static ssize_t puts_show(struct kobject *kobj,
> +struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", puts);
> +}
> +CLEANCACHE_ATTR_RO(puts);
> +
> +static ssize_t flushes_show(struct kobject *kobj,
> +struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", flushes);
> +}
> +CLEANCACHE_ATTR_RO(flushes);
> +
> +static struct attribute *cleancache_attrs[] = {
> + &succ_gets_attr.attr,
> + &failed_gets_attr.attr,
> + &puts_attr.attr,
> + &flushes_attr.attr,
> + NULL,
> +};
> +
> +static struct attribute_group cleancache_attr_group = {
> + .attrs = cleancache_attrs,
> + .name = "cleancache",
> +};
> +
> +#endif /* CONFIG_SYSFS */

Please completely document the sysfs API, preferably in the changelogs.
It's the first thing reviewers should look at, because it's one thing
we can never change.  And Documentation/ABI/ is a place for permanent
documentation.


> --- linux-2.6.34/mm/Kconfig   2010-05-16 15:17:36.0 -0600
> +++ linux-2.6.34-cleancache/mm/Kconfig2010-05-24 12:14:44.0 
> -0600
> @@ -287,3 +287,25 @@ config NOMMU_INITIAL_TRIM_EXCESS
> of 1 says that all excess pages should be trimmed.
>  
> See Documentation/nommu-mmap.txt for more information.
> +
> +config CLEANCACHE
> + bool "Enable cleancache pseudo-RAM driver to cache clean pages"
> + default y
> + help
> +   Cleancache can be thought of as a page-granularity victim cache
> +   for clean pages that the kernel's pageframe replacement algorithm
> +   (PFRA) would like to keep around, but can't since there isn't enough
> +   memory.  So when the PFRA "evicts" a page, it first attempts to put
> +   it into a synchronous concurrency-safe page-oriented pseudo-RAM
> +   device (such as Xen's Transcendent Memory, aka "tmem") which is no

Re: [PATCH 4/4] Btrfs: add basic DIO read/write support V3

2010-05-13 Thread Andrew Morton
On Thu, 13 May 2010 14:01:37 -0400
Josef Bacik  wrote:

> On Thu, May 13, 2010 at 11:26:39AM -0400, Christoph Hellwig wrote:
> > On Thu, May 13, 2010 at 11:31:45AM -0400, Josef Bacik wrote:
> > > AIO's aio_complete does kmap with KM_IRQ0/1 and it gets called in the same
> > > context as the btrfs completion handler, so if it's ok for aio_complete it
> > > should be ok for btrfs right?  Thanks,
> > 
> > aio_complete does a spin_lock_irqsave before that, which disables
> > interrupts on the local CPU.
> > 
> 
> Ok how about I just do
> 
> local_irq_disable()
> kmap(KM_IRQ0)
> local_irq_enable()
> 
> would that be acceptable?  Thanks,

yup.

local_irq_disable() (or local_irq_save())
kmap_atomic(KM_IRQx);

kunmap_atomic(KM_IRQx);
local_irq_enable() (or local_irq_restore()).

then perhaps flush_dcache_page().
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Btrfs: add basic DIO read/write support V3

2010-05-13 Thread Andrew Morton
On Thu, 13 May 2010 11:31:45 -0400 Josef Bacik  wrote:

> On Thu, May 13, 2010 at 11:14:30AM -0400, Christoph Hellwig wrote:
> > On Wed, May 12, 2010 at 04:40:53PM -0400, Josef Bacik wrote:
> > > V1->V2
> > > -Use __blockdev_direct_IO instead of helper
> > > -Use KM_IRQ0 for kmap instead of KM_USER0
> > 
> > I still don't think this is correct.  The completion can come from
> > softirq and hardirq context, and possibly even normal process context.
> > You either need to check for all these, or you need to use the generic
> > complete in user context helper btrfs has available for other types
> > of I/O.
> > 
> 
> AIO's aio_complete does kmap with KM_IRQ0/1 and it gets called in the same
> context as the btrfs completion handler, so if it's ok for aio_complete it
> should be ok for btrfs right?  Thanks,
> 

Using KM_USERx from irq context is a horrid, horrid bug.

Using KM_IRQx with local interrupts enabled is a ditto.

It's OK to use KM_IRQx from process context as long as local interrupts
are disabled.

kmap slots are just per-cpu variables.  Each one has a particular usage
convention: process-context-only, must-be-irq-safe, etc.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 22:14:58 +0100
Ingo Molnar  wrote:

> 
> * Andrew Morton  wrote:
> 
> > On Wed, 14 Jan 2009 21:51:22 +0100
> > Ingo Molnar  wrote:
> > 
> > > 
> > > * Andrew Morton  wrote:
> > > 
> > > > > > Do people enable CONFIG_SCHED_DEBUG?
> > > > > 
> > > > > If they suspect performance problems and want to analyze them?
> > > > 
> > > > The vast majority of users do not and usually cannot compile their own 
> > > > kernels.
> > > 
> > > ... which they derive from distro kernels or some old .config they always 
> > > used, via 'make oldconfig'. You are arguing against well-established 
> > > facts 
> > > here.
> > > 
> > > If you dont believe my word for it, here's an analysis of all kernel 
> > > configs posted to lkml in the past 8 months:
> > > 
> > >$ grep ^CONFIG_SCHED_DEBUG linux-kernel | wc -l
> > >424
> > > 
> > >$ grep 'CONFIG_SCHED_DEBUG is not' linux-kernel | wc -l
> > >109
> > > 
> > > i.e. CONFIG_SCHED_DEBUG=y is set in 80% of the configs. A large majority 
> > > of testers has it enabled and /sys/debug/sched_features was always a good 
> > > mechanism that we used for runtime toggles.
> > 
> > You just disproved your own case :(
> 
> how so? 80% is not enough?

No.

It really depends on what distros do.

> I also checked Fedora and it has SCHED_DEBUG=y 
> in its kernel rpms.

If all distros set SCHED_DEBUG=y then fine.

But if they do this then we should do this at the kernel.org level, and
make it a hard-to-turn-off thing via CONFIG_EMBEDDED=y.

> note that there's also a performance issue here: we generally _dont want_ 
> a debug sysctl overhead in the mutex code or in any fastpath for that 
> matter. So making it depend on SCHED_DEBUG is useful.
> 
> sched_feat() features get optimized out at build time when SCHED_DEBUG is 
> disabled. So it gives us the best of two worlds: the utility of sysctls in 
> the SCHED_DEBUG=y, and they get compiled out in the !SCHED_DEBUG case.

I'm not detecting here a sufficient appreciation of the number of
sched-related regressions we've seen in recent years, nor of the
difficulty encountered in diagnosing and fixing them.  Let alone
the difficulty getting those fixes propagated out a *long* time
after the regression was added.

You're taking a whizzy new feature which drastically changes a critical
core kernel feature and jamming it into mainline with a vestigial
amount of testing coverage without giving sufficient care and thought
to the practical lessons which we have learned from doing this in the
past.

This is a highly risky change.  It's not that the probability of
failure is high - the problem is that the *cost* of the improbable
failure is high.  We should seek to minimize that cost.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 21:51:22 +0100
Ingo Molnar  wrote:

> 
> * Andrew Morton  wrote:
> 
> > > > Do people enable CONFIG_SCHED_DEBUG?
> > > 
> > > If they suspect performance problems and want to analyze them?
> > 
> > The vast majority of users do not and usually cannot compile their own 
> > kernels.
> 
> ... which they derive from distro kernels or some old .config they always 
> used, via 'make oldconfig'. You are arguing against well-established facts 
> here.
> 
> If you dont believe my word for it, here's an analysis of all kernel 
> configs posted to lkml in the past 8 months:
> 
>$ grep ^CONFIG_SCHED_DEBUG linux-kernel | wc -l
>424
> 
>$ grep 'CONFIG_SCHED_DEBUG is not' linux-kernel | wc -l
>109
> 
> i.e. CONFIG_SCHED_DEBUG=y is set in 80% of the configs. A large majority 
> of testers has it enabled and /sys/debug/sched_features was always a good 
> mechanism that we used for runtime toggles.

You just disproved your own case :(

> > > Note that CONFIG_SCHED_DEBUG=y is also the default.
> > 
> > akpm:/usr/src/25> echo $ARCH
> > x86_64
> > akpm:/usr/src/25> make defconfig
> > *** Default configuration is based on 'x86_64_defconfig'
> 
> x86 defconfig is used too, but it's a pretty rare usage.
> 
> Under default i mean the customary meaning of default config: it's the 
> default if you come via 'make oldconfig' or if you derive your config from 
> a distro config:
> 
> | config SCHED_DEBUG
> |bool "Collect scheduler debugging info"
> |depends on DEBUG_KERNEL && PROC_FS
> |default y
> 

This simply isn't reliable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 21:27:36 +0100
Ingo Molnar  wrote:

> 
> * Peter Zijlstra  wrote:
> 
> > On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote:
> > 
> > > Do people enable CONFIG_SCHED_DEBUG?
> > 
> > Well, I have it always enabled, but I've honestly no idea if that makes
> > me weird.
> > 
> > > CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty 
> > > small subset?
> > 
> > Could be, do you fancy me doing a sysctl? shouldn't be hard.
> 
> i dunno, why another fancy sysctl for something that fits quite nicely 
> into the existing sched_features scheme that we've been using for such 
> purposes for the past 3-4 kernel releases?
> 
> we always provided various toggles for new scheduler features via 
> /sys/debug/sched_features, so that people can do performance regression 
> testing, and it works quite well.
> 

If we know that this control will be reliably available in packaged
kernels then fine.  But how to we arrange for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 21:14:35 +0100
Ingo Molnar  wrote:

> 
> * Andrew Morton  wrote:
> 
> > On Wed, 14 Jan 2009 20:00:08 +0100
> > Ingo Molnar  wrote:
> > 
> > > 
> > > * Andrew Morton  wrote:
> > > 
> > > > On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar  wrote:
> > > > 
> > > > > Please pull the adaptive-mutexes-for-linus git tree
> > > > 
> > > > 
> > > > 
> > > > - It seems a major shortcoming that the feature is disabled if
> > > >   CONFIG_DEBUG_MUTEXES=y.  It means that lots of people won't test it.
> > 
> > ^^^?
> >  
> > > > - When people hit performance/latency oddities, it would be nice if
> > > >   they had a /proc knob with which they could disable this feature at
> > > >   runtime.
> > > > 
> > > >   This would also be useful for comparative performance testing.
> > > 
> > > Yeah. From my other mail:
> > > 
> > > > > We still have the /sys/debug/sched_features tunable under 
> > > > > CONFIG_SCHED_DEBUG=y, so should this cause any performance 
> > > > > regressions 
> > > > > somewhere, it can be pinned down and blamed back on this change 
> > > > > easily, without bisection and without rebooting the box.
> > > 
> > > This kind of easy knob was included early on - this is how all those spin 
> > > versus no-spin numbers were done.
> > 
> > Do people enable CONFIG_SCHED_DEBUG?
> 
> If they suspect performance problems and want to analyze them?

The vast majority of users do not and usually cannot compile their own
kernels.

> Note that CONFIG_SCHED_DEBUG=y is also the default.

akpm:/usr/src/25> echo $ARCH
x86_64
akpm:/usr/src/25> make defconfig
*** Default configuration is based on 'x86_64_defconfig'
#
# configuration written to .config
#
akpm:/usr/src/25> grep SCHED_DEBUG .config 
# CONFIG_SCHED_DEBUG is not set

> > CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty 
> > small subset?
> 
> Those two are the default config settings actually, so i'd expect it to be 
> the most commonly occuring combinations.

akpm:/usr/src/25> grep DEBUG_MUTEXES .config
# CONFIG_DEBUG_MUTEXES is not set

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 20:50:50 +0100
Peter Zijlstra  wrote:

> On Wed, 2009-01-14 at 11:36 -0800, Andrew Morton wrote:
> 
> > Do people enable CONFIG_SCHED_DEBUG?
> 
> Well, I have it always enabled, but I've honestly no idea if that makes
> me weird.

It'd be weird if you're not weird.

> > CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty
> > small subset?
> 
> Could be, do you fancy me doing a sysctl? shouldn't be hard.

The /sys/debug knob depends upon CONFIG_SCHED_DEBUG, I assume?

umm, yes, I do think that it would be prudent to make this control
unconditionally available.  Experience tells us that any regressions
which this change causes could take a long time to turn up - sometimes
years.  By which time the people who are reporting the regressions are
running packaged kernels, and if that packaged kernel didn't enable
CONFIG_SCHED_DEBUG, we're a bit screwed.

Did we end up deciding to remove the CONFIG_DEBUG_MUTEXES=n dependency?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 20:00:08 +0100
Ingo Molnar  wrote:

> 
> * Andrew Morton  wrote:
> 
> > On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar  wrote:
> > 
> > > Please pull the adaptive-mutexes-for-linus git tree
> > 
> > 
> > 
> > - It seems a major shortcoming that the feature is disabled if
> >   CONFIG_DEBUG_MUTEXES=y.  It means that lots of people won't test it.

^^^?
 
> > - When people hit performance/latency oddities, it would be nice if
> >   they had a /proc knob with which they could disable this feature at
> >   runtime.
> > 
> >   This would also be useful for comparative performance testing.
> 
> Yeah. From my other mail:
> 
> > > We still have the /sys/debug/sched_features tunable under 
> > > CONFIG_SCHED_DEBUG=y, so should this cause any performance regressions 
> > > somewhere, it can be pinned down and blamed back on this change 
> > > easily, without bisection and without rebooting the box.
> 
> This kind of easy knob was included early on - this is how all those spin 
> versus no-spin numbers were done.

Do people enable CONFIG_SCHED_DEBUG?

CONFIG_DEBUG_MUTEXES=n, CONFIG_SCHED_DEBUG=y is getting to be a pretty
small subset?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] adaptive spinning mutexes

2009-01-14 Thread Andrew Morton
On Wed, 14 Jan 2009 19:33:19 +0100 Ingo Molnar  wrote:

> Please pull the adaptive-mutexes-for-linus git tree



- It seems a major shortcoming that the feature is disabled if
  CONFIG_DEBUG_MUTEXES=y.  It means that lots of people won't test it.

- When people hit performance/latency oddities, it would be nice if
  they had a /proc knob with which they could disable this feature at
  runtime.

  This would also be useful for comparative performance testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs for mainline

2009-01-11 Thread Andrew Morton

http://bugzilla.kernel.org/show_bug.cgi?id=12435

Congratulations ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andrew Morton
On Sat, 10 Jan 2009 02:01:25 +0100 Ingo Molnar  wrote:

> 
> * Linus Torvalds  wrote:
> 
> > On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > > 
> > > may_inline/inline_hint is a longer, less known and uglier keyword.
> > 
> > Hey, your choice, should you decide to accept it, is to just get rid of 
> > them entirely.
> > 
> > You claim that we're back to square one, but that's simply the way 
> > things are. Either "inline" means something, or it doesn't. You argue 
> > for it meaning nothing. I argue for it meaning something.
> > 
> > If you want to argue for it meaning nothing, then REMOVE it, instead of 
> > breaking it.
> > 
> > It really is that simple. Remove the inlines you think are wrong. 
> > Instead of trying to change the meaning of them.
> 
> Well, it's not totally meaningless. To begin with, defining 'inline' to 
> mean 'always inline' is a Linux kernel definition. So we already changed 
> the behavior - in the hope of getting it right most of the time and in the 
> hope of thus improving the kernel.
> 
> And now it appears that in our quest of improving the kernel we can 
> further tweak that (already non-standard) meaning to a weak "inline if the 
> compiler agrees too" hint. That gives us an even more compact kernel. It 
> also moves the meaning of 'inline' closer to what the typical programmer 
> expects it to be - for better or worse.
> 
> We could remove them completely, but there are a couple of practical 
> problems with that:
> 
>  - In this cycle alone, in the past ~2 weeks we added another 1300 inlines
>to the kernel.

Who "reviewed" all that?

> Do we really want periodic postings of:
> 
>   [PATCH 0/135] inline removal cleanups
> 
>... in the next 10 years? We have about 20% of all functions in the 
>kernel marked with 'inline'. It is a _very_ strong habit. Is it worth 
>fighting against it?

A side-effect of the inline fetish is that a lot of it goes into header
files, thus requiring that those header files #include lots of other
headers, thus leading to, well, the current mess.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andrew Morton
On Fri, 9 Jan 2009 04:35:31 +0100 Andi Kleen  wrote:

> On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
> > Harvey Harrison wrote:
> > >>
> > >> We might still try the second or third options, as i think we shouldnt 
> > >> go 
> > >> back into the business of managing the inline attributes of ~100,000 
> > >> kernel functions.
> > > 
> > > Or just make it clear that inline shouldn't (unless for a very good 
> > > reason)
> > > _ever_ be used in a .c file.
> > > 
> > 
> > The question is if that would produce acceptable quality code.  In
> > theory it should, but I'm more than wondering if it really will.
> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn't inline ever static
> (which it normally does if it only has a single caller). You know
> roughly where it crashed without having to decode the line number.
> 
> I believe others do that too, I notice it's all over btrfs for example.
> 

Plus there is the problem where

foo()
{
char a[1000];
}

bar()
{
char a[1000];
}

zot()
{
foo();
bar();
}

uses 2000 bytes of stack.

Fortunately scripts/checkstack.pl can find these.

If someone runs it.

With the right kconfig settings.

And with the right compiler version.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Andrew Morton
On Wed, 7 Jan 2009 22:32:22 +0100
Ingo Molnar  wrote:

> > We could do the whole "oldfs = get_fs(); set_fs(KERNEL_DS); .. 
> > set_fs(oldfs);" crud, but it would probably be better to just add an 
> > architected accessor. Especially since it's going to generally just be a
> > 
> > #define get_kernel_careful(val,p) __get_user(val,p)
> > 
> > for most architectures.
> > 
> > We've needed that before (and yes, we've simply mis-used __get_user() on 
> > x86 before rather than add it).
> 
> for the oldfs stuff we already have probe_kernel_read(). OTOH, that 
> involves pagefault_disable() which is an atomic op

tisn't.  pagefault_disable() is just preempt_count()+=1;barrier() ?

Am suspecting that you guys might be over-optimising this
contended-path-were-going-to-spin-anyway code?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

2009-01-07 Thread Andrew Morton
On Wed, 7 Jan 2009 22:37:40 +0100
Andi Kleen  wrote:

> > But we can do that with __get_user(thread_info->cpu) (very unlikely page 
> > fault protection due to the possibility of CONFIG_DEBUG_PAGEALLOC) and 
> > then validating the cpu. It it's in range, we can use it and verify 
> > whether cpu_rq(cpu)->curr has that thread_info.
> > 
> > So we can do all that locklessly and optimistically, just going back and 
> > verifying the results later. This is why "thread_info" is actually a 
> > better thing to use than "task_struct" - we can look up the cpu in it with 
> > a simple dereference. We knew the pointer _used_ to be valid, so in any 
> > normal situation, it will never page fault (and if you have 
> > CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance isn't 
> > your concern anyway: we just need to make the page fault be non-lethal ;)
> 
> The problem with probe_kernel_address() is that it does lots of
> operations around the access in the hot path (set_fs, pagefault_disable 
> etc.), 
> so i'm not sure that's a good idea. 

probe_kernel_address() isn't to bad - a few reads and writes into
the task_struct and thread_struct.  And we're on the slow, contended
path here anyway..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC]: mutex: adaptive spin

2009-01-06 Thread Andrew Morton
On Tue, 06 Jan 2009 12:40:31 +0100 Peter Zijlstra  wrote:

> Subject: mutex: adaptive spin
> From: Peter Zijlstra 
> Date: Tue Jan 06 12:32:12 CET 2009
> 
> Based on the code in -rt, provide adaptive spins on generic mutexes.
> 

How dumb is it to send a lump of uncommented, changelogged code as an
rfc, only to have Ingo reply, providing the changelog for you?

Sigh.

> --- linux-2.6.orig/kernel/mutex.c
> +++ linux-2.6/kernel/mutex.c
> @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
>   atomic_set(&lock->count, 1);
>   spin_lock_init(&lock->wait_lock);
>   INIT_LIST_HEAD(&lock->wait_list);
> + lock->owner = NULL;
>  
>   debug_mutex_init(lock, name, key);
>  }
> @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
>  
>  EXPORT_SYMBOL(mutex_unlock);
>  
> +#ifdef CONFIG_SMP
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +  struct task_struct *owner, long state)
> +{
> + for (;;) {
> + if (signal_pending_state(state, waiter->task))
> + return 0;
> + if (waiter->lock->owner != owner)
> + return 0;
> + if (!task_is_current(owner))
> + return 1;
> + cpu_relax();
> + }
> +}

Each of the tests in this function should be carefully commented.  It's
really the core piece of the design.

> +#else
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +  struct task_struct *owner, long state)
> +{
> + return 1;
> +}
> +#endif
> +
>  /*
>   * Lock a mutex (possibly interruptible), slowpath:
>   */
> @@ -127,7 +150,7 @@ static inline int __sched
>  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>   unsigned long ip)
>  {
> - struct task_struct *task = current;
> + struct task_struct *owner, *task = current;
>   struct mutex_waiter waiter;
>   unsigned int old_val;
>   unsigned long flags;
> @@ -141,6 +164,7 @@ __mutex_lock_common(struct mutex *lock, 
>   /* add waiting tasks to the end of the waitqueue (FIFO): */
>   list_add_tail(&waiter.list, &lock->wait_list);
>   waiter.task = task;
> + waiter.lock = lock;
>  
>   old_val = atomic_xchg(&lock->count, -1);
>   if (old_val == 1)
> @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
>   debug_mutex_free_waiter(&waiter);
>   return -EINTR;
>   }
> - __set_task_state(task, state);
>  
> - /* didnt get the lock, go to sleep: */
> + owner = lock->owner;

What prevents *owner from exitting right here?

> + get_task_struct(owner);
>   spin_unlock_mutex(&lock->wait_lock, flags);
> - schedule();
> +
> + if (adaptive_wait(&waiter, owner, state)) {
> + put_task_struct(owner);
> + __set_task_state(task, state);
> + /* didnt get the lock, go to sleep: */
> + schedule();
> + } else
> + put_task_struct(owner);
> +
>   spin_lock_mutex(&lock->wait_lock, flags);
>   }
>  
> @@ -187,7 +219,7 @@ done:
>   lock_acquired(&lock->dep_map, ip);
>   /* got the lock - rejoice! */
>   mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> - debug_mutex_set_owner(lock, task_thread_info(task));
> + lock->owner = task;
>  
>   /* set it to 0 if there are no waiters left: */
>   if (likely(list_empty(&lock->wait_list)))
> @@ -260,7 +292,7 @@ __mutex_unlock_common_slowpath(atomic_t 
>   wake_up_process(waiter->task);
>   }
>  
> - debug_mutex_clear_owner(lock);
> + lock->owner = NULL;
>  
>   spin_unlock_mutex(&lock->wait_lock, flags);
>  }
> @@ -352,7 +384,7 @@ static inline int __mutex_trylock_slowpa
>  
>   prev = atomic_xchg(&lock->count, -1);
>   if (likely(prev == 1)) {
> - debug_mutex_set_owner(lock, current_thread_info());
> + lock->owner = current;
>   mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>   }
>   /* Set it back to 0 if there are no waiters: */
> Index: linux-2.6/kernel/sched.c
> ===
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -697,6 +697,11 @@ int runqueue_is_locked(void)
>   return ret;
>  }
>  
> +int task_is_current(struct task_struct *p)
> +{
> + return task_rq(p)->curr == p;
> +}

Please don't add kernel-wide infrastructure and leave it completely
undocumented.  Particularly functions which are as vague and dangerous
as this one.

What locking must the caller provide?  What are the semantics of the
return value?

What must the caller do to avoid oopses if *p is concurrently exiting?

etc.


The overall design intent seems very smart to me, as long as the races
can be plugged, if they're indeed presen

Re: Btrfs for mainline

2008-12-31 Thread Andrew Morton
On Wed, 31 Dec 2008 06:28:55 -0500 Chris Mason  wrote:

> Hello everyone,

Hi!

> I've done some testing against Linus' git tree from last night and the
> current btrfs trees still work well.

what's btrfs?  I think I've heard the name before, but I've never
seen the patches :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Notes on support for multiple devices for a single filesystem

2008-12-17 Thread Andrew Morton
On Wed, 17 Dec 2008 08:23:44 -0500
Christoph Hellwig  wrote:

> FYI: here's a little writeup I did this summer on support for
> filesystems spanning multiple block devices:
> 
> 
> -- 
> 
> === Notes on support for multiple devices for a single filesystem ===
> 
> == Intro ==
> 
> Btrfs (and an experimental XFS version) can support multiple underlying block
> devices for a single filesystem instances in a generalized and flexible way.
> 
> Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and
> the special real-time device in XFS all data and metadata may be spread over a
> potentially large number of block devices, and not just one (or two)
> 
> 
> == Requirements ==
> 
> We want a scheme to support these complex filesystem topologies in way
> that is
> 
>  a) easy to setup and non-fragile for the users
>  b) scalable to a large number of disks in the system
>  c) recoverable without requiring user space running first
>  d) generic enough to work for multiple filesystems or other consumers
> 
> Requirement a) means that a multiple-device filesystem should be mountable
> by a simple fstab entry (UUID/LABEL or some other cookie) which continues
> to work when the filesystem topology changes.

"device topology"?

> Requirement b) implies we must not do a scan over all available block devices
> in large systems, but use an event-based callout on detection of new block
> devices.
> 
> Requirement c) means there must be some version to add devices to a filesystem
> by kernel command lines, even if this is not the default way, and might 
> require
> additional knowledge from the user / system administrator.
> 
> Requirement d) means that we should not implement this mechanism inside a
> single filesystem.
> 

One thing I've never seen comprehensively addressed is: why do this in
the filesystem at all?  Why not let MD take care of all this and
present a single block device to the fs layer?

Lots of filesystems are violating this, and I'm sure the reasons for
this are good, but this document seems like a suitable place in which to
briefly decribe those reasons.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs trees for linux-next

2008-12-10 Thread Andrew Morton
On Thu, 11 Dec 2008 14:14:36 +1100 Stephen Rothwell <[EMAIL PROTECTED]> wrote:

> Hi Andrew,
> 
> On Wed, 10 Dec 2008 21:34:56 -0500 Chris Mason <[EMAIL PROTECTED]> wrote:
> >
> > Just an update, while I still have a long todo list and plenty of things
> > to fix in the code, these src trees have been updated with a disk format
> > I hope to maintain compatibility with from here on.  There are still
> > format changes planned, but should go in through the compat mechanisms
> > in the sources now.
> > 
> > The btrfs trees are still at 2.6.28-rc5, but I just tested against
> > linux-next without problems.
> 
> Do you think this is ready to be added to the end of linux-next, yet?  Or
> is this more -mm material?

I'd prefer that it go into linux-next in the usual fashion.  But the
first step is review..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html