from:"Jörn Engel"

Re: [PATCH mm 2/2] kfence: show access type in report

2021-01-11 Thread Jörn Engel

On Mon, Jan 11, 2021 at 10:15:44AM +0100, Marco Elver wrote:
> Show the access type in KFENCE reports by plumbing through read/write
> information from the page fault handler. Update the documentation and
> test accordingly.
> 
> Suggested-by: Jörn Engel 
> Signed-off-by: Marco Elver 

Reviewed-by: Jörn Engel 

Jörn

--
This above all: to thine own self be true.
-- Shakespeare

Re: [PATCH mm 1/2] kfence: add option to use KFENCE without static keys

2021-01-11 Thread Jörn Engel

On Mon, Jan 11, 2021 at 10:15:43AM +0100, Marco Elver wrote:
> For certain usecases, specifically where the sample interval is always
> set to a very low value such as 1ms, it can make sense to use a dynamic
> branch instead of static branches due to the overhead of toggling a
> static branch.

I ended up with 100µs and couldn't measure a performance problem in our
benchmarks.  My results don't have predictive value for anyone else, of
course.

> Therefore, add a new Kconfig option to remove the static branches and
> instead check kfence_allocation_gate if a KFENCE allocation should be
> set up.
> 
> Suggested-by: Jörn Engel 
> Signed-off-by: Marco Elver 

Reviewed-by: Jörn Engel 

Jörn

--
One of the things I’ve discovered over the years, is that you can
create change or you can receive credit – not both.
-- Stephen Downes

Re: [PATCH] random: make try_to_generate_entropy() more robust

2019-10-19 Thread Jörn Engel

On Sat, Oct 19, 2019 at 12:49:52PM +0200, Thomas Gleixner wrote:
> 
> One slightly related thing I was looking into is that the mixing of
> interrupt entropy is always done from hard interrupt context. That has a
> few issues:
> 
> 1) It's pretty visible in profiles for high frequency interrupt
>scenarios.
> 
> 2) The regs content can be pretty boring non-deterministic when the
>interrupt hits idle.
> 
>Not an issue in the try_to_generate_entropy() case probably, but
>that still needs some careful investigation.
> 
> For #1 I was looking into a trivial storage model with a per cpu ring
> buffer, where each entry contains the entropy data of one interrupt and let
> some thread or whatever handle the mixing later.

Or you can sum up all regs.

unsigned long regsum(struct pt_regs *regs)
{
unsigned long *r = (void *)regs;
unsigned long sum = r[0];
int i;

for (i = 1; i < sizeof(*regs) / sizeof(*r); i++) {
sum += r[i];
}
return sum;
}

Takes 1 cycle per register in the current form, half as much if the
compiler can be convinced to unroll the loop.  That's cheaper than
rdtsc() on most/all CPUs.

If interrupt volume is high, the regsum should be good enough.  The
final mixing can be amortized as well.  Once the pool is initialized,
you can mix new entropy once per jiffy or so and otherwise just add to a
percpu counter or something like that.

> That would allow to filter out 'constant' data (#) but it would also give
> Joerns approach a way to get to some 'random' register content independent
> of the context in which the timer softirq is running in.

Jörn

--
Given two functions foo_safe() and foo_fast(), the shorthand foo()
should be an alias for foo_safe(), never foo_fast().
-- me

Re: [PATCH] random: make try_to_generate_entropy() more robust

2019-10-18 Thread Jörn Engel

On Fri, Oct 18, 2019 at 01:37:04PM -0700, Jörn Engel wrote:
> Sorry for coming late to the discussion.  I generally like the approach
> in try_to_generate_entropy(), but I think we can do a little better
> still.  Would something like this work?

Fixed lkml address.

> From 90078333edb6e720f13f6668376a69c0f9c570f5 Mon Sep 17 00:00:00 2001
> From: Joern Engel 
> Date: Fri, 18 Oct 2019 13:25:52 -0700
> Subject: [PATCH] random: make try_to_generate_entropy() more robust
> 
> We can generate entropy on almost any CPU, even if it doesn't provide a
> high-resolution timer for random_get_entropy().  As long as the CPU is
> not idle, it changed the register file every few cycles.  As long as the
> ALU isn't fully synchronized with the timer, the drift between the
> register file and the timer is enough to generate entropy from.
> 
> Also print a warning on systems where entropy collection might be a
> problem.  I have good confidence in two unsynchronized timers generating
> entropy.  But I cannot tell whether timer and ALU are synchronized and
> we ought to warn users if all their crypto is likely to be broken.
> 
> Signed-off-by: Joern Engel 
> ---
>  drivers/char/random.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index de434feb873a..00a04efd0686 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1748,6 +1748,16 @@ EXPORT_SYMBOL(get_random_bytes);
>   */
>  static void entropy_timer(struct timer_list *t)
>  {
> + struct pt_regs *regs = get_irq_regs();
> +
> + /*
> +  * Even if we don't have a high-resolution timer in our system,
> +  * the register file itself is a high-resolution timer.  It
> +  * isn't monotonic or particularly useful to read the current
> +  * time.  But it changes with every retired instruction, which
> +  * is enough to generate entropy from.
> +  */
> + mix_pool_bytes(_pool, regs, sizeof(*regs));
>   credit_entropy_bits(_pool, 1);
>  }
>  
> @@ -1764,9 +1774,8 @@ static void try_to_generate_entropy(void)
>  
>   stack.now = random_get_entropy();
>  
> - /* Slow counter - or none. Don't even bother */
> - if (stack.now == random_get_entropy())
> - return;
> + /* Slow counter - or none.  Warn user */
> + WARN_ON(stack.now == random_get_entropy());
>  
>   timer_setup_on_stack(, entropy_timer, 0);
>   while (!crng_ready()) {
> -- 
> 2.20.1
> 

Jörn

--
...one more straw can't possibly matter...
-- Kirby Bakken

[PATCH] btree: avoid variable-length allocations

2018-03-13 Thread Jörn Engel

I agree that the code is garbage.  In my defense, creating generic
iterator-type functions for multiple data types appears to be one of the
hardest problems in CS with many bad examples of what not to do.

Patch below should fix it.  We have tcm_qla2xxx systems, so I will stick
it into our test system as well.

Jörn

--
It is a cliché that most clichés are true, but then, like most clichés,
that cliché is untrue.
-- Stephen Fry

>From 0077d19b11ec27c3287787d2413b26fc4cf0b3ca Mon Sep 17 00:00:00 2001
From: Joern Engel 
Date: Tue, 13 Mar 2018 11:36:49 -0700
Subject: [PATCH] btree: avoid variable-length allocations

geo->keylen cannot be larger than 4.  So we might as well make
fixed-size allocations.

Given the one remaining user, geo->keylen cannot even be larger than 1.
Logfs used to have 64bit and 128bit keys, tcm_qla2xxx only has 32bit
keys.  But let's not break the code if we don't have to.

Signed-off-by: Joern Engel 
---
 lib/btree.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/btree.c b/lib/btree.c
index f93a945274af..590facba2c50 100644
--- a/lib/btree.c
+++ b/lib/btree.c
@@ -3,7 +3,7 @@
  *
  * As should be obvious for Linux kernel code, license is GPLv2
  *
- * Copyright (c) 2007-2008 Joern Engel 
+ * Copyright (c) 2007-2008 Joern Engel 
  * Bits and pieces stolen from Peter Zijlstra's code, which is
  * Copyright 2007, Red Hat Inc. Peter Zijlstra
  * GPLv2
@@ -76,6 +76,8 @@ struct btree_geo btree_geo128 = {
 };
 EXPORT_SYMBOL_GPL(btree_geo128);
 
+#define MAX_KEYLEN (2 * LONG_PER_U64)
+
 static struct kmem_cache *btree_cachep;
 
 void *btree_alloc(gfp_t gfp_mask, void *pool_data)
@@ -313,7 +315,7 @@ void *btree_get_prev(struct btree_head *head, struct 
btree_geo *geo,
 {
int i, height;
unsigned long *node, *oldnode;
-   unsigned long *retry_key = NULL, key[geo->keylen];
+   unsigned long *retry_key = NULL, key[MAX_KEYLEN];
 
if (keyzero(geo, __key))
return NULL;
@@ -639,8 +641,8 @@ EXPORT_SYMBOL_GPL(btree_remove);
 int btree_merge(struct btree_head *target, struct btree_head *victim,
struct btree_geo *geo, gfp_t gfp)
 {
-   unsigned long key[geo->keylen];
-   unsigned long dup[geo->keylen];
+   unsigned long key[MAX_KEYLEN];
+   unsigned long dup[MAX_KEYLEN];
void *val;
int err;
 
-- 
2.15.1

[PATCH] btree: avoid variable-length allocations

2018-03-13 Thread Jörn Engel

I agree that the code is garbage.  In my defense, creating generic
iterator-type functions for multiple data types appears to be one of the
hardest problems in CS with many bad examples of what not to do.

Patch below should fix it.  We have tcm_qla2xxx systems, so I will stick
it into our test system as well.

Jörn

--
It is a cliché that most clichés are true, but then, like most clichés,
that cliché is untrue.
-- Stephen Fry

>From 0077d19b11ec27c3287787d2413b26fc4cf0b3ca Mon Sep 17 00:00:00 2001
From: Joern Engel 
Date: Tue, 13 Mar 2018 11:36:49 -0700
Subject: [PATCH] btree: avoid variable-length allocations

geo->keylen cannot be larger than 4.  So we might as well make
fixed-size allocations.

Given the one remaining user, geo->keylen cannot even be larger than 1.
Logfs used to have 64bit and 128bit keys, tcm_qla2xxx only has 32bit
keys.  But let's not break the code if we don't have to.

Signed-off-by: Joern Engel 
---
 lib/btree.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/btree.c b/lib/btree.c
index f93a945274af..590facba2c50 100644
--- a/lib/btree.c
+++ b/lib/btree.c
@@ -3,7 +3,7 @@
  *
  * As should be obvious for Linux kernel code, license is GPLv2
  *
- * Copyright (c) 2007-2008 Joern Engel 
+ * Copyright (c) 2007-2008 Joern Engel 
  * Bits and pieces stolen from Peter Zijlstra's code, which is
  * Copyright 2007, Red Hat Inc. Peter Zijlstra
  * GPLv2
@@ -76,6 +76,8 @@ struct btree_geo btree_geo128 = {
 };
 EXPORT_SYMBOL_GPL(btree_geo128);
 
+#define MAX_KEYLEN (2 * LONG_PER_U64)
+
 static struct kmem_cache *btree_cachep;
 
 void *btree_alloc(gfp_t gfp_mask, void *pool_data)
@@ -313,7 +315,7 @@ void *btree_get_prev(struct btree_head *head, struct 
btree_geo *geo,
 {
int i, height;
unsigned long *node, *oldnode;
-   unsigned long *retry_key = NULL, key[geo->keylen];
+   unsigned long *retry_key = NULL, key[MAX_KEYLEN];
 
if (keyzero(geo, __key))
return NULL;
@@ -639,8 +641,8 @@ EXPORT_SYMBOL_GPL(btree_remove);
 int btree_merge(struct btree_head *target, struct btree_head *victim,
struct btree_geo *geo, gfp_t gfp)
 {
-   unsigned long key[geo->keylen];
-   unsigned long dup[geo->keylen];
+   unsigned long key[MAX_KEYLEN];
+   unsigned long dup[MAX_KEYLEN];
void *val;
int err;
 
-- 
2.15.1

Re: [PATCH] random: add regrand

2017-11-11 Thread Jörn Engel

On Sat, Nov 11, 2017 at 01:22:38PM -0500, Theodore Ts'o wrote:
> On Fri, Nov 10, 2017 at 04:23:21PM -0800, Jörn Engel wrote:
> > On Fri, Nov 10, 2017 at 06:52:12PM -0500, Theodore Ts'o wrote:
> > > Couple of quick comments.  The code as it exists is horrifically x86
> > > specific.  For example: it uses rdtsc().
> > 
> > Sorry, I should have changed that to "#if 0".  The rdtsc() exists only
> > as a means to estimate entropy and is not meant for production.
> 
> But that means you've only done estimates on x86 and are blindly
> extrapolating to all other architectures?

Correct.

I would claim that regrand and dakarand are near-identical twins.
Dakarand seems to have survived all attempts so far, which gives me a
bit more confidence.  Then again, it is questionable how many people
actually tried to attack dakarand on non-x86.

> > - add_interrupt_randomness() mostly depends on a high-resolution timer,
> >   and has to use jiffies on many architectures.
> 
> It samples from registers as well, at which point it's actually not
> that different from your proposed regrand.  The main difference
> ispurel that you are trying to do the bulk of sampling in very early
> boot, which means before we've started using any of the devices.  So
> regrand is purely dependent on there being separate oscillators in use
> between the CPU clock and the timer interrupt.  If there is a single
> master oscillator which is then downsampled to obtain the other
> "clocks" --- how well do you think regrand works in that environment?
> 
> At *least* with the current system, at least some of the interrupts
> will be from the off the SOC chip.  Such as the network interrupt, for
> example.

Regrand continues to collect entropy later on.  That is pretty important
for another reason: snapshot and restore of VMs.  In theory we could get
by with a well-seeded PRNG most of the time - except for boot and VM
restore.  We can detect boot, but we cannot detect VM restore.  So
sampling at some reasonable rate is the best we can do.

The reason I sample "enough" entropy at early boot is basically this:
https://factorable.net/weakkeys12.extended.pdf

If there is a way to mess up by reading from /dev/urandom, I would blame
the kernel and not the users.  If I have to delay bootup to initialize
the random pool, so be it.  The alternative, imo, is worse.

Now the only question is whether we can actually sample enough.  I have
demonstrated that I do when booting VMs on x86.  Not perfect, but a
starting point.

> > Your concern may still be right.  If it is, I think we have lost and
> > there is nothing we can do.
> 
> There is definitely more than we can do.  In particular, I dispute
> your counsel of hopeless over at a number of items you have rejected,
> but definitely this one:
> 
> > - Storing entropy across reboot requires a writeable filesystem and
> >   script support.
> 
> Funny thing; most systems have at least *some* amount of stateful
> store; they wouldn't be terribly useful computers if they didn't!

I have worked on systems that didn't.  Once you go to the cheap end of
embedded systems, you either have no writeable filesystem or a very
limited amount of flash.  Storing entropy at shutdown wouldn't work
because shutdown typically means yanking the power cord.  Storing
entropy every hour/day/whatever will quickly dominate the write load on
those systems and wear out your flash.

I also have limited faith in everyone working on those systems to get
the entropy save/restore script right.  Maybe 90% of them will, but
definitely not 100%.  "You cannot get it wrong" is much stronger than
"carefully read the documentation, write some code of your own and you
probably won't get it wrong".

> What I would do is to make it the responsibility of the bootloader, so
> we can get the entropy early enough that it can be used to seed the
> KASLR.  It may be that on different architectures we will have to
> store entropy in different places; I wouldn't put in the file system
> in most cases, if we can at all help it.  For example, on x86 systems,
> the best place to put it might be an explicit (but small) GPT
> partition if the GPT partitioning scheme is used, or immediately after
> the MBR and GRUB Stage 1.5 at offset 20 sectors from the beginning of
> the disk.

Some kind of bootloader is probably necessary for stackprotector.  You
have to somehow exclude a set of functions from stackprotector and use
only that set of functions to sample the initial entropy for your canary
value.  It would be nice if none of those functions would be reused
after stackprotector is turned on, which would imply something like a
bootloader.

I would prefer not using the regular bootloader, because there are just
too many of them and making sure

Re: [PATCH] random: add regrand

2017-11-11 Thread Jörn Engel

On Sat, Nov 11, 2017 at 01:22:38PM -0500, Theodore Ts'o wrote:
> On Fri, Nov 10, 2017 at 04:23:21PM -0800, Jörn Engel wrote:
> > On Fri, Nov 10, 2017 at 06:52:12PM -0500, Theodore Ts'o wrote:
> > > Couple of quick comments.  The code as it exists is horrifically x86
> > > specific.  For example: it uses rdtsc().
> > 
> > Sorry, I should have changed that to "#if 0".  The rdtsc() exists only
> > as a means to estimate entropy and is not meant for production.
> 
> But that means you've only done estimates on x86 and are blindly
> extrapolating to all other architectures?

Correct.

I would claim that regrand and dakarand are near-identical twins.
Dakarand seems to have survived all attempts so far, which gives me a
bit more confidence.  Then again, it is questionable how many people
actually tried to attack dakarand on non-x86.

> > - add_interrupt_randomness() mostly depends on a high-resolution timer,
> >   and has to use jiffies on many architectures.
> 
> It samples from registers as well, at which point it's actually not
> that different from your proposed regrand.  The main difference
> ispurel that you are trying to do the bulk of sampling in very early
> boot, which means before we've started using any of the devices.  So
> regrand is purely dependent on there being separate oscillators in use
> between the CPU clock and the timer interrupt.  If there is a single
> master oscillator which is then downsampled to obtain the other
> "clocks" --- how well do you think regrand works in that environment?
> 
> At *least* with the current system, at least some of the interrupts
> will be from the off the SOC chip.  Such as the network interrupt, for
> example.

Regrand continues to collect entropy later on.  That is pretty important
for another reason: snapshot and restore of VMs.  In theory we could get
by with a well-seeded PRNG most of the time - except for boot and VM
restore.  We can detect boot, but we cannot detect VM restore.  So
sampling at some reasonable rate is the best we can do.

The reason I sample "enough" entropy at early boot is basically this:
https://factorable.net/weakkeys12.extended.pdf

If there is a way to mess up by reading from /dev/urandom, I would blame
the kernel and not the users.  If I have to delay bootup to initialize
the random pool, so be it.  The alternative, imo, is worse.

Now the only question is whether we can actually sample enough.  I have
demonstrated that I do when booting VMs on x86.  Not perfect, but a
starting point.

> > Your concern may still be right.  If it is, I think we have lost and
> > there is nothing we can do.
> 
> There is definitely more than we can do.  In particular, I dispute
> your counsel of hopeless over at a number of items you have rejected,
> but definitely this one:
> 
> > - Storing entropy across reboot requires a writeable filesystem and
> >   script support.
> 
> Funny thing; most systems have at least *some* amount of stateful
> store; they wouldn't be terribly useful computers if they didn't!

I have worked on systems that didn't.  Once you go to the cheap end of
embedded systems, you either have no writeable filesystem or a very
limited amount of flash.  Storing entropy at shutdown wouldn't work
because shutdown typically means yanking the power cord.  Storing
entropy every hour/day/whatever will quickly dominate the write load on
those systems and wear out your flash.

I also have limited faith in everyone working on those systems to get
the entropy save/restore script right.  Maybe 90% of them will, but
definitely not 100%.  "You cannot get it wrong" is much stronger than
"carefully read the documentation, write some code of your own and you
probably won't get it wrong".

> What I would do is to make it the responsibility of the bootloader, so
> we can get the entropy early enough that it can be used to seed the
> KASLR.  It may be that on different architectures we will have to
> store entropy in different places; I wouldn't put in the file system
> in most cases, if we can at all help it.  For example, on x86 systems,
> the best place to put it might be an explicit (but small) GPT
> partition if the GPT partitioning scheme is used, or immediately after
> the MBR and GRUB Stage 1.5 at offset 20 sectors from the beginning of
> the disk.

Some kind of bootloader is probably necessary for stackprotector.  You
have to somehow exclude a set of functions from stackprotector and use
only that set of functions to sample the initial entropy for your canary
value.  It would be nice if none of those functions would be reused
after stackprotector is turned on, which would imply something like a
bootloader.

I would prefer not using the regular bootloader, because there are just
too many of them and making sure

Re: [PATCH] random: add regrand

2017-11-10 Thread Jörn Engel

Forgot to Cc: linux-kernel.

On Fri, Nov 10, 2017 at 01:30:39PM -0800, Jörn Engel wrote:
> Regrand is a replacement for drivers/char/random.c.  It is supposed to
> achieve the following design goals:
> 
> 1. /dev/random shall never block.
> 2. /dev/urandom shall never return bad randomness.
> 3. Any machine supported by Linux must have good randomness.
> 4. Any entropy source that is unavailable on some machines is useless.
> 5. By the time we start userspace, the random pool must be fully initialized.
> 6. Entropy never gets XOR'ed, always hashed.
> 7. In-kernel users never receive bad randomness.
> 
> I believe it achieves all but the last.  Sadly there are a few
> exceptions where random numbers are consumed before the entropy pool can
> be initialized.  In contrast, I believe current drivers/char/random.c
> achieves none of the above.
> 
> That is a pretty bold claim, so please try to poke holes.  For the
> moment, I think regrand should be optional and users should decide which
> random number generator they trust more.  Maybe after a year or two,
> provided noone manages to find a problem in regrand, it can be made the
> default.
> 
> Copyright for regrand proper is Public Domain.  Feel free to port the
> version in Documentation/regrand.c to any OS.  Most of the code in
> drivers/char/regrand.c is copied from either crypto/sha3_generic.c or
> drivers/char/random.c.  I didn't want to go through crypto API to access
> sha3 code and had to implement an interface compatible to the existing
> random number generator.  Regrand proper is tiny in comparison.
> 
> Happy reading.
> 
> Signed-off-by: Joern Engel <jo...@purestorage.com>
> ---
>  Documentation/regrand.c   | 365 ++
>  arch/Kconfig  |   1 +
>  arch/x86/include/asm/stackprotector.h |   2 +-
>  arch/x86/kernel/apic/apic.c   |   2 +
>  drivers/char/Kconfig  |  11 +
>  drivers/char/Makefile |   7 +-
>  drivers/char/regrand.c| 673 
> ++
>  include/linux/genhd.h |   5 +
>  include/linux/hw_random.h |   4 +
>  include/linux/random.h|  25 +-
>  kernel/fork.c |   2 +-
>  kernel/panic.c|   2 +-
>  mm/slab.c |   2 +-
>  mm/slab_common.c  |   4 +-
>  mm/slub.c |   5 +-
>  15 files changed, 1093 insertions(+), 17 deletions(-)
>  create mode 100644 Documentation/regrand.c
>  create mode 100644 drivers/char/regrand.c
> 
> diff --git a/Documentation/regrand.c b/Documentation/regrand.c
> new file mode 100644
> index ..07a1b87c2bb7
> --- /dev/null
> +++ b/Documentation/regrand.c
> @@ -0,0 +1,365 @@
> +/*
> + * regrand - Random number generator using register state at time of
> + * interrupt.  Uses the same principle as Dakarand, drift between two
> + * unsynchronized high-precision timers, as entropy source.  Should
> + * work reasonably well on any CPU that can generate interrupts.
> + *
> + * Public Domain
> + */
> +#error ADD INCLUDES
> +
> +/*
> + * Based on a million boots, each interrupt seems to yield about 10-15
> + * bits of entropy.  That means 20 samples might be enough for
> + * cryptographically strong random numbers.  Naturally we want more to
> + * give us a comfortable safety margin.  128 should be good enough.
> + */
> +#define SAMPLES_NEEDED   (128)   // [2]
> +
> +#define  ACCESS_ONCE(x)  (*(volatile __typeof(x) *)&(x))
> +
> +/* 256b hash value */
> +struct half_hash {   // [1]
> + uint64_t h[4];
> +};
> +
> +/* 512b hash value */
> +struct full_hash {   // [1]
> + union {
> + struct half_hash half[2];
> + uint64_t h[8];
> + };
> +};
> +
> +struct per_cpu_state {   // [3]
> + struct half_hash half;  /* 256 bits of state */
> + uint64_t p_time;/* last time we produced entropy */
> + uint64_t c_time;/* last time we consumed entropy */
> +};
> +
> +static struct half_hash global_pool; // [3]
> +static int uninitialized_count = INT_MAX;// [8]
> +static int global_lock;  // [4]
> +
> +static struct per_cpu_state *get_local_state(void)   // [5]
> +{
> +#error FILL ME IN
> + /*
> +  * Should return a pointer to per-cpu state.  Most RNG
> +  *

Re: [PATCH] random: add regrand

2017-11-10 Thread Jörn Engel

Forgot to Cc: linux-kernel.

On Fri, Nov 10, 2017 at 01:30:39PM -0800, Jörn Engel wrote:
> Regrand is a replacement for drivers/char/random.c.  It is supposed to
> achieve the following design goals:
> 
> 1. /dev/random shall never block.
> 2. /dev/urandom shall never return bad randomness.
> 3. Any machine supported by Linux must have good randomness.
> 4. Any entropy source that is unavailable on some machines is useless.
> 5. By the time we start userspace, the random pool must be fully initialized.
> 6. Entropy never gets XOR'ed, always hashed.
> 7. In-kernel users never receive bad randomness.
> 
> I believe it achieves all but the last.  Sadly there are a few
> exceptions where random numbers are consumed before the entropy pool can
> be initialized.  In contrast, I believe current drivers/char/random.c
> achieves none of the above.
> 
> That is a pretty bold claim, so please try to poke holes.  For the
> moment, I think regrand should be optional and users should decide which
> random number generator they trust more.  Maybe after a year or two,
> provided noone manages to find a problem in regrand, it can be made the
> default.
> 
> Copyright for regrand proper is Public Domain.  Feel free to port the
> version in Documentation/regrand.c to any OS.  Most of the code in
> drivers/char/regrand.c is copied from either crypto/sha3_generic.c or
> drivers/char/random.c.  I didn't want to go through crypto API to access
> sha3 code and had to implement an interface compatible to the existing
> random number generator.  Regrand proper is tiny in comparison.
> 
> Happy reading.
> 
> Signed-off-by: Joern Engel 
> ---
>  Documentation/regrand.c   | 365 ++
>  arch/Kconfig  |   1 +
>  arch/x86/include/asm/stackprotector.h |   2 +-
>  arch/x86/kernel/apic/apic.c   |   2 +
>  drivers/char/Kconfig  |  11 +
>  drivers/char/Makefile |   7 +-
>  drivers/char/regrand.c| 673 
> ++
>  include/linux/genhd.h |   5 +
>  include/linux/hw_random.h |   4 +
>  include/linux/random.h|  25 +-
>  kernel/fork.c |   2 +-
>  kernel/panic.c|   2 +-
>  mm/slab.c |   2 +-
>  mm/slab_common.c  |   4 +-
>  mm/slub.c |   5 +-
>  15 files changed, 1093 insertions(+), 17 deletions(-)
>  create mode 100644 Documentation/regrand.c
>  create mode 100644 drivers/char/regrand.c
> 
> diff --git a/Documentation/regrand.c b/Documentation/regrand.c
> new file mode 100644
> index ..07a1b87c2bb7
> --- /dev/null
> +++ b/Documentation/regrand.c
> @@ -0,0 +1,365 @@
> +/*
> + * regrand - Random number generator using register state at time of
> + * interrupt.  Uses the same principle as Dakarand, drift between two
> + * unsynchronized high-precision timers, as entropy source.  Should
> + * work reasonably well on any CPU that can generate interrupts.
> + *
> + * Public Domain
> + */
> +#error ADD INCLUDES
> +
> +/*
> + * Based on a million boots, each interrupt seems to yield about 10-15
> + * bits of entropy.  That means 20 samples might be enough for
> + * cryptographically strong random numbers.  Naturally we want more to
> + * give us a comfortable safety margin.  128 should be good enough.
> + */
> +#define SAMPLES_NEEDED   (128)   // [2]
> +
> +#define  ACCESS_ONCE(x)  (*(volatile __typeof(x) *)&(x))
> +
> +/* 256b hash value */
> +struct half_hash {   // [1]
> + uint64_t h[4];
> +};
> +
> +/* 512b hash value */
> +struct full_hash {   // [1]
> + union {
> + struct half_hash half[2];
> + uint64_t h[8];
> + };
> +};
> +
> +struct per_cpu_state {   // [3]
> + struct half_hash half;  /* 256 bits of state */
> + uint64_t p_time;/* last time we produced entropy */
> + uint64_t c_time;/* last time we consumed entropy */
> +};
> +
> +static struct half_hash global_pool; // [3]
> +static int uninitialized_count = INT_MAX;// [8]
> +static int global_lock;  // [4]
> +
> +static struct per_cpu_state *get_local_state(void)   // [5]
> +{
> +#error FILL ME IN
> + /*
> +  * Should return a pointer to per-cpu state.  Most RNG
> +  * operations are done on local state, without cachel

Re: [PATCH] mm: Fix overflow check in expand_upwards()

2017-06-30 Thread Jörn Engel

On Fri, Jun 30, 2017 at 09:34:24AM +0200, Helge Deller wrote:
> 
> I see there are some architectures which define TASK_SIZE not as
> multiple of PAGE_SIZE and as 0x for which the (address >=
> TASK_SIZE) check will not trigger:
> 
> arch/arm/include/asm/memory.h:#define TASK_SIZE UL(0x)
> arch/frv/include/asm/mem-layout.h:#define TASK_SIZE 
> __UL(0xUL)
> arch/m68k/include/asm/processor.h:#define TASK_SIZE (0xUL)
> arch/blackfin/include/asm/processor.h:#define TASK_SIZE 0x
> arch/h8300/include/asm/processor.h:#define TASK_SIZE(0xUL)
> arch/xtensa/include/asm/processor.h:#define TASK_SIZE   
> __XTENSA_UL_CONST(0x)
> 
> None of those have an upwards growing stack and thus I believe we don't
> run into issues, but nevertheless the checks could probably be changed
> like this (untested patch):

That would also work.  I have no preference which patch to use.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index a5e3dcd..224bdc2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2224,15 +2224,17 @@ int expand_upwards(struct vm_area_struct *vma, 
> unsigned long address)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   struct vm_area_struct *next;
> - unsigned long gap_addr;
> + unsigned long gap_addr, max_task_size;
>   int error = 0;
>  
>   if (!(vma->vm_flags & VM_GROWSUP))
>   return -EFAULT;
>  
> + max_task_size = TASK_SIZE & PAGE_MASK;
> +
>   /* Guard against exceeding limits of the address space. */
>   address &= PAGE_MASK;
> - if (address >= TASK_SIZE)
> + if (address >= max_task_size)
>   return -ENOMEM;
>   address += PAGE_SIZE;
>  
> @@ -2240,8 +2242,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
> long address)
>   gap_addr = address + stack_guard_gap;
>  
>   /* Guard against overflow */
> - if (gap_addr < address || gap_addr > TASK_SIZE)
> - gap_addr = TASK_SIZE;
> + if (gap_addr < address || gap_addr > max_task_size)
> + gap_addr = max_task_size;
>  
>   next = vma->vm_next;
>   if (next && next->vm_start < gap_addr) {
> 
> Helge

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

Re: [PATCH] mm: Fix overflow check in expand_upwards()

2017-06-30 Thread Jörn Engel

On Fri, Jun 30, 2017 at 09:34:24AM +0200, Helge Deller wrote:
> 
> I see there are some architectures which define TASK_SIZE not as
> multiple of PAGE_SIZE and as 0x for which the (address >=
> TASK_SIZE) check will not trigger:
> 
> arch/arm/include/asm/memory.h:#define TASK_SIZE UL(0x)
> arch/frv/include/asm/mem-layout.h:#define TASK_SIZE 
> __UL(0xUL)
> arch/m68k/include/asm/processor.h:#define TASK_SIZE (0xUL)
> arch/blackfin/include/asm/processor.h:#define TASK_SIZE 0x
> arch/h8300/include/asm/processor.h:#define TASK_SIZE(0xUL)
> arch/xtensa/include/asm/processor.h:#define TASK_SIZE   
> __XTENSA_UL_CONST(0x)
> 
> None of those have an upwards growing stack and thus I believe we don't
> run into issues, but nevertheless the checks could probably be changed
> like this (untested patch):

That would also work.  I have no preference which patch to use.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index a5e3dcd..224bdc2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2224,15 +2224,17 @@ int expand_upwards(struct vm_area_struct *vma, 
> unsigned long address)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   struct vm_area_struct *next;
> - unsigned long gap_addr;
> + unsigned long gap_addr, max_task_size;
>   int error = 0;
>  
>   if (!(vma->vm_flags & VM_GROWSUP))
>   return -EFAULT;
>  
> + max_task_size = TASK_SIZE & PAGE_MASK;
> +
>   /* Guard against exceeding limits of the address space. */
>   address &= PAGE_MASK;
> - if (address >= TASK_SIZE)
> + if (address >= max_task_size)
>   return -ENOMEM;
>   address += PAGE_SIZE;
>  
> @@ -2240,8 +2242,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
> long address)
>   gap_addr = address + stack_guard_gap;
>  
>   /* Guard against overflow */
> - if (gap_addr < address || gap_addr > TASK_SIZE)
> - gap_addr = TASK_SIZE;
> + if (gap_addr < address || gap_addr > max_task_size)
> + gap_addr = max_task_size;
>  
>   next = vma->vm_next;
>   if (next && next->vm_start < gap_addr) {
> 
> Helge

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

Re: [PATCH] mm: Fix overflow check in expand_upwards()

2017-06-30 Thread Jörn Engel

On Fri, Jun 30, 2017 at 08:57:27AM +0200, Helge Deller wrote:
> On 30.06.2017 01:02, Jörn Engel wrote:
> > I believe the overflow check was correct, then got subtly broken by
> > commit bd726c90b6b8
> > Author: Helge Deller <del...@gmx.de>
> > Date:   Mon Jun 19 17:34:05 2017 +0200
> > 
> > Allow stack to grow up to address space limit
> > 
> > The old overflow check may have been a bit subtle and I suppose Helge
> > missed its importance.
> > 
> > if (!address)
> > return -ENOMEM;
> > 
> > Functionally the my check is identical to the old one.  I just hope the
> > alternative form (and comment!) make it harder to miss and break things
> > in a future patch.
> > 
> > Signed-off-by: Joern Engel <jo...@logfs.org>
> > ---
> >  mm/mmap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index a5e3dcd75e79..7366f62c31f4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2232,7 +2232,8 @@ int expand_upwards(struct vm_area_struct *vma, 
> > unsigned long address)
> >  
> > /* Guard against exceeding limits of the address space. */
> > address &= PAGE_MASK;
> > -   if (address >= TASK_SIZE)
> > +   /* second check is for integer overflow */
> > +   if (address >= TASK_SIZE || address + PAGE_SIZE < address)
> > return -ENOMEM;
> > address += PAGE_SIZE;
> 
> That overflow check is still there.
> Look at the next few lines in mmap.c:
> 
>/* Enforce stack_guard_gap */
> gap_addr = address + stack_guard_gap;
> 
> /* Guard against overflow */
> if (gap_addr < address || gap_addr > TASK_SIZE)
> gap_addr = TASK_SIZE;
> 
> If the requested page plus the gap (=gap_addr) wraps around, then the
> code will limit it to TASK_SIZE.
> So, the code should already take care of all possible areas >=TASK_SIZE,
> including wrap-arounds.

Does it cover the case where address is (unsigned long)-PAGE_SIZE?

I believe you catch every other case, but not that one.

> Did you faced a real issue?

No.  I don't even own a computer with stacks growing up.  Just spotted
this while reviewing some patches going by.

Jörn

--
The Linux community is zillions of people with different cultures and ideas
all trying to convince the rest that their vision of the shared culture
is right.
-- Alan Cox

Re: [PATCH] mm: Fix overflow check in expand_upwards()

2017-06-30 Thread Jörn Engel

On Fri, Jun 30, 2017 at 08:57:27AM +0200, Helge Deller wrote:
> On 30.06.2017 01:02, Jörn Engel wrote:
> > I believe the overflow check was correct, then got subtly broken by
> > commit bd726c90b6b8
> > Author: Helge Deller 
> > Date:   Mon Jun 19 17:34:05 2017 +0200
> > 
> > Allow stack to grow up to address space limit
> > 
> > The old overflow check may have been a bit subtle and I suppose Helge
> > missed its importance.
> > 
> > if (!address)
> > return -ENOMEM;
> > 
> > Functionally the my check is identical to the old one.  I just hope the
> > alternative form (and comment!) make it harder to miss and break things
> > in a future patch.
> > 
> > Signed-off-by: Joern Engel 
> > ---
> >  mm/mmap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index a5e3dcd75e79..7366f62c31f4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2232,7 +2232,8 @@ int expand_upwards(struct vm_area_struct *vma, 
> > unsigned long address)
> >  
> > /* Guard against exceeding limits of the address space. */
> > address &= PAGE_MASK;
> > -   if (address >= TASK_SIZE)
> > +   /* second check is for integer overflow */
> > +   if (address >= TASK_SIZE || address + PAGE_SIZE < address)
> > return -ENOMEM;
> > address += PAGE_SIZE;
> 
> That overflow check is still there.
> Look at the next few lines in mmap.c:
> 
>/* Enforce stack_guard_gap */
> gap_addr = address + stack_guard_gap;
> 
> /* Guard against overflow */
> if (gap_addr < address || gap_addr > TASK_SIZE)
> gap_addr = TASK_SIZE;
> 
> If the requested page plus the gap (=gap_addr) wraps around, then the
> code will limit it to TASK_SIZE.
> So, the code should already take care of all possible areas >=TASK_SIZE,
> including wrap-arounds.

Does it cover the case where address is (unsigned long)-PAGE_SIZE?

I believe you catch every other case, but not that one.

> Did you faced a real issue?

No.  I don't even own a computer with stacks growing up.  Just spotted
this while reviewing some patches going by.

Jörn

--
The Linux community is zillions of people with different cultures and ideas
all trying to convince the rest that their vision of the shared culture
is right.
-- Alan Cox

[PATCH] mm: Fix overflow check in expand_upwards()

2017-06-29 Thread Jörn Engel

I believe the overflow check was correct, then got subtly broken by
commit bd726c90b6b8
Author: Helge Deller 
Date:   Mon Jun 19 17:34:05 2017 +0200

Allow stack to grow up to address space limit

The old overflow check may have been a bit subtle and I suppose Helge
missed its importance.

if (!address)
return -ENOMEM;

Functionally the my check is identical to the old one.  I just hope the
alternative form (and comment!) make it harder to miss and break things
in a future patch.

Signed-off-by: Joern Engel 
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..7366f62c31f4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2232,7 +2232,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 
/* Guard against exceeding limits of the address space. */
address &= PAGE_MASK;
-   if (address >= TASK_SIZE)
+   /* second check is for integer overflow */
+   if (address >= TASK_SIZE || address + PAGE_SIZE < address)
return -ENOMEM;
address += PAGE_SIZE;
 
-- 
2.1.4

[PATCH] mm: Fix overflow check in expand_upwards()

2017-06-29 Thread Jörn Engel

I believe the overflow check was correct, then got subtly broken by
commit bd726c90b6b8
Author: Helge Deller 
Date:   Mon Jun 19 17:34:05 2017 +0200

Allow stack to grow up to address space limit

The old overflow check may have been a bit subtle and I suppose Helge
missed its importance.

if (!address)
return -ENOMEM;

Functionally the my check is identical to the old one.  I just hope the
alternative form (and comment!) make it harder to miss and break things
in a future patch.

Signed-off-by: Joern Engel 
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..7366f62c31f4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2232,7 +2232,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 
/* Guard against exceeding limits of the address space. */
address &= PAGE_MASK;
-   if (address >= TASK_SIZE)
+   /* second check is for integer overflow */
+   if (address >= TASK_SIZE || address + PAGE_SIZE < address)
return -ENOMEM;
address += PAGE_SIZE;
 
-- 
2.1.4

[Regression] Bonding no longer support tun-interfaces

2016-08-08 Thread Jörn Engel

This has been reported (and ignored) before:
http://lkml.iu.edu/hypermail/linux/kernel/1407.2/03790.html
https://bugzilla.kernel.org/show_bug.cgi?id=89161

Regression was introduced by:

commit 00503b6f702e (refs/bisect/bad)
Author: dingtianhong 
Date:   Sat Jan 25 13:00:29 2014 +0800

bonding: fail_over_mac should only affect AB mode at enslave and removal 
processing

According to bonding.txt, the fail_over_ma should only affect active-backup 
mode,
but I found that the fail_over_mac could be set to active or follow in all
modes, this will cause new slave could not be set to bond's MAC address at
enslave processing and restore its own MAC address at removal processing.

The correct way to fix the problem is that we should not add restrictions 
when
setting options, just need to modify the bond enslave and removal processing
to check the mode in addition to fail_over_mac when setting a slave's MAC 
during
enslavement. The change active slave processing already only calls the 
fail_over_mac
function when in active-backup mode.

Thanks for Jay's suggestion.

The patch also modify the pr_warning() to pr_warn().

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Signed-off-by: Ding Tianhong 
Signed-off-by: David S. Miller 

Since I never needed bonding or tun-interfaces before, I come late to
the party.  Some 6k lines have changed in the bonding driver since the
regression got in two years ago.  So a simple revert is unlikely to lead
to happiness.

But I absolutely need that functionality and would rather run a 3.13
kernel than live with the regression.  dingtianhong, any suggestions?

Jörn

--
It is a cliché that most clichés are true, but then, like most clichés,
that cliché is untrue.
-- Stephen Fry

[Regression] Bonding no longer support tun-interfaces

2016-08-08 Thread Jörn Engel

This has been reported (and ignored) before:
http://lkml.iu.edu/hypermail/linux/kernel/1407.2/03790.html
https://bugzilla.kernel.org/show_bug.cgi?id=89161

Regression was introduced by:

commit 00503b6f702e (refs/bisect/bad)
Author: dingtianhong 
Date:   Sat Jan 25 13:00:29 2014 +0800

bonding: fail_over_mac should only affect AB mode at enslave and removal 
processing

According to bonding.txt, the fail_over_ma should only affect active-backup 
mode,
but I found that the fail_over_mac could be set to active or follow in all
modes, this will cause new slave could not be set to bond's MAC address at
enslave processing and restore its own MAC address at removal processing.

The correct way to fix the problem is that we should not add restrictions 
when
setting options, just need to modify the bond enslave and removal processing
to check the mode in addition to fail_over_mac when setting a slave's MAC 
during
enslavement. The change active slave processing already only calls the 
fail_over_mac
function when in active-backup mode.

Thanks for Jay's suggestion.

The patch also modify the pr_warning() to pr_warn().

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Signed-off-by: Ding Tianhong 
Signed-off-by: David S. Miller 

Since I never needed bonding or tun-interfaces before, I come late to
the party.  Some 6k lines have changed in the bonding driver since the
regression got in two years ago.  So a simple revert is unlikely to lead
to happiness.

But I absolutely need that functionality and would rather run a 3.13
kernel than live with the regression.  dingtianhong, any suggestions?

Jörn

--
It is a cliché that most clichés are true, but then, like most clichés,
that cliché is untrue.
-- Stephen Fry

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-27 Thread Jörn Engel

On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> 
> > On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.
> > 
> > If you are concerned about the memory allocation of struct hugetlb_usage, 
> > it could easily be embedded directly in struct mm_struct.
> 
> Yes I am concerned about that and
> 9 files changed, 112 insertions(+), 1 deletion(-)
> for something that is even not clear to be really required. And I still
> haven't heard any strong usecase to justify it.
> 
> Can we go with the single and much simpler cumulative number first and
> only add the break down list if it is _really_ required? We can even
> document that the future version of /proc//status might add an
> additional information to prepare all the parsers to be more careful.

I don't care much which way we decide.  But I find your reasoning a bit
worrying.  If someone asks for a by-size breakup of hugepages in a few
years, you might have existing binaries that depend on the _absence_ of
those extra characters on the line.

Compare:
  HugetlbPages:  18432 kB
  HugetlbPages:1069056 kB (1*1048576kB 10*2048kB)

Once someone has written a script that greps for 'HugetlbPages:.*kB$',
you have lost the option of adding anything else to the line.  You have
created yet another ABI compatibility headache today in order to save
112 lines of code.

That may be a worthwhile tradeoff, I don't know.  But at least I realize
there is a cost, while you seem to ignore that component.  There is
value in not painting yourself into a corner.

Jörn

--
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-27 Thread Jörn Engel

On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
 
  On x86, HUGE_MAX_HSTATE == 2.  I don't consider that to be expensive.
  
  If you are concerned about the memory allocation of struct hugetlb_usage, 
  it could easily be embedded directly in struct mm_struct.
 
 Yes I am concerned about that and
 9 files changed, 112 insertions(+), 1 deletion(-)
 for something that is even not clear to be really required. And I still
 haven't heard any strong usecase to justify it.
 
 Can we go with the single and much simpler cumulative number first and
 only add the break down list if it is _really_ required? We can even
 document that the future version of /proc/pid/status might add an
 additional information to prepare all the parsers to be more careful.

I don't care much which way we decide.  But I find your reasoning a bit
worrying.  If someone asks for a by-size breakup of hugepages in a few
years, you might have existing binaries that depend on the _absence_ of
those extra characters on the line.

Compare:
  HugetlbPages:  18432 kB
  HugetlbPages:1069056 kB (1*1048576kB 10*2048kB)

Once someone has written a script that greps for 'HugetlbPages:.*kB$',
you have lost the option of adding anything else to the line.  You have
created yet another ABI compatibility headache today in order to save
112 lines of code.

That may be a worthwhile tradeoff, I don't know.  But at least I realize
there is a cost, while you seem to ignore that component.  There is
value in not painting yourself into a corner.

Jörn

--
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-21 Thread Jörn Engel

On Fri, Aug 21, 2015 at 08:32:33AM +0200, Michal Hocko wrote:
> 
> Both mmotm and linus tree have
> REG("smaps",  S_IRUGO, proc_pid_smaps_operations),
> 
> and opening the file requires PTRACE_MODE_READ. So I do not see any
> requirement for root here. Or did you mean that you need root to examine
> all processes? That would be true but I am wondering why would be a regular
> user interested in this break out numbers. Hugetlb management sounds
> pretty much like an administrative or very specialized thing.
> 
> From my understanding of the discussion there is no usecase to have this
> information world readable. Is this correct?

Well, tools like top currently display rss.  Once we have some
interface, I would like a version of top that displays the true rss
including hugepages (hrss maybe?).

If we make such a tool impossible today, someone will complain about it
in the future and we created a new mess for ourselves.  I think it is
trouble enough to deal with the old one.

Jörn

--
Denying any reality for any laudable political goal is a bad strategy.
When the facts come out, the discovery of the facts will undermine the
laudable political goals.
-- Jared Diamond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-21 Thread Jörn Engel

On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
> On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
> [...]
> > > Reading a single file is, of course, easier but is it really worth the
> > > additional code? I haven't really looked at the patch so I might be
> > > missing something but what would be an advantage over reading
> > > /proc//smaps and extracting the information from there?
> > 
> > My first idea was just "users should feel it useful", but permission as 
> > David
> > commented sounds a good technical reason to me.
> 
> 9 files changed, 112 insertions(+), 1 deletion(-)
> 
> is quite a lot especially when it touches hot paths like fork so it
> better should have a good usecase. I have already asked in the other
> email but is actually anybody requesting this? Nice to have is not
> a good justification IMO.

I need some way to judge the real rss of a process, including huge
pages.  No strong opinion on implementation details, but something is
clearly needed.

If you have processes with 99% huge pages, you are currently reduced to
guesswork.

Jörn

--
Journalism is printing what someone else does not want printed;
everything else is public relations.
-- George Orwell
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-21 Thread Jörn Engel

On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
 On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
 [...]
   Reading a single file is, of course, easier but is it really worth the
   additional code? I haven't really looked at the patch so I might be
   missing something but what would be an advantage over reading
   /proc/pid/smaps and extracting the information from there?
  
  My first idea was just users should feel it useful, but permission as 
  David
  commented sounds a good technical reason to me.
 
 9 files changed, 112 insertions(+), 1 deletion(-)
 
 is quite a lot especially when it touches hot paths like fork so it
 better should have a good usecase. I have already asked in the other
 email but is actually anybody requesting this? Nice to have is not
 a good justification IMO.

I need some way to judge the real rss of a process, including huge
pages.  No strong opinion on implementation details, but something is
clearly needed.

If you have processes with 99% huge pages, you are currently reduced to
guesswork.

Jörn

--
Journalism is printing what someone else does not want printed;
everything else is public relations.
-- George Orwell
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-21 Thread Jörn Engel

On Fri, Aug 21, 2015 at 08:32:33AM +0200, Michal Hocko wrote:
 
 Both mmotm and linus tree have
 REG(smaps,  S_IRUGO, proc_pid_smaps_operations),
 
 and opening the file requires PTRACE_MODE_READ. So I do not see any
 requirement for root here. Or did you mean that you need root to examine
 all processes? That would be true but I am wondering why would be a regular
 user interested in this break out numbers. Hugetlb management sounds
 pretty much like an administrative or very specialized thing.
 
 From my understanding of the discussion there is no usecase to have this
 information world readable. Is this correct?

Well, tools like top currently display rss.  Once we have some
interface, I would like a version of top that displays the true rss
including hugepages (hrss maybe?).

If we make such a tool impossible today, someone will complain about it
in the future and we created a new mess for ourselves.  I think it is
trouble enough to deal with the old one.

Jörn

--
Denying any reality for any laudable political goal is a bad strategy.
When the facts come out, the discovery of the facts will undermine the
laudable political goals.
-- Jared Diamond
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

2015-08-13 Thread Jörn Engel

On Wed, Aug 12, 2015 at 01:25:59PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
> 
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> > 
> >   Size:  20480 kB
> >   Rss:   0 kB
> >   Pss:   0 kB
> >   Shared_Clean:  0 kB
> >   Shared_Dirty:  0 kB
> >   Private_Clean: 0 kB
> >   Private_Dirty: 0 kB
> >   Referenced:0 kB
> >   Anonymous: 0 kB
> >   AnonHugePages: 0 kB
> >   HugetlbPages:  18432 kB
> >   Swap:  0 kB
> >   KernelPageSize: 2048 kB
> >   MMUPageSize:2048 kB
> >   Locked:0 kB
> >   VmFlags: rd wr mr mw me de ht
> > 
> > Signed-off-by: Naoya Horiguchi 
> 
> Acked-by: David Rientjes 

Acked-by: Joern Engel 

Jörn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-13 Thread Jörn Engel

On Thu, Aug 13, 2015 at 12:45:33AM +, Naoya Horiguchi wrote:
> From: Naoya Horiguchi 
> Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
> 
> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because userspace applications which use hugetlb typically 
> want
> to control their processes on the basis of how much memory (including hugetlb)
> they use. So this patch simply provides easy access to the info via
> /proc/PID/status.
> 
> With this patch, for example, /proc/PID/status shows a line like this:
> 
>   HugetlbPages:  20480 kB (10*2048kB)
> 
> If your system supports and enables multiple hugepage sizes, the line looks
> like this:
> 
>   HugetlbPages:1069056 kB (1*1048576kB 10*2048kB)
> 
> , so you can easily know how many hugepages in which pagesize are used by a
> process.
> 
> Signed-off-by: Naoya Horiguchi 

Acked-by: Joern Engel 

Jörn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success."
-- Tom DeMarco
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-13 Thread Jörn Engel

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
> 
> I'd be interested in the comments of others, though, to see if there is 
> any reservation about the hstate size breakdown.  It may actually find no 
> current customer who is interested in parsing it.  (If we keep it, I would 
> suggest the 'x' change to '*' similar to per-order breakdowns in 
> show_mem()).  It may also be possible to add it later if a definitive 
> usecase is presented.

I have no interest in parsing the size breakdown today.  I might change
my mind tomorrow and having the extra information hurts very little, so
I won't argue against it either.

> But overall I'm very happy with the new addition and think it's a good 
> solution to the problem.

Agreed.  Thank you!

Jörn

--
One of the painful things about our time is that those who feel certainty
are stupid, and those with any imagination and understanding are filled
with doubt and indecision.
-- Bertrand Russell
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

2015-08-13 Thread Jörn Engel

On Wed, Aug 12, 2015 at 01:25:59PM -0700, David Rientjes wrote:
 On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
 
  Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
  is inconvenient when we want to know per-task or per-vma base hugetlb usage.
  To solve this, this patch adds a new line for hugetlb usage like below:
  
Size:  20480 kB
Rss:   0 kB
Pss:   0 kB
Shared_Clean:  0 kB
Shared_Dirty:  0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced:0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
HugetlbPages:  18432 kB
Swap:  0 kB
KernelPageSize: 2048 kB
MMUPageSize:2048 kB
Locked:0 kB
VmFlags: rd wr mr mw me de ht
  
  Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 
 Acked-by: David Rientjes rient...@google.com

Acked-by: Joern Engel jo...@logfs.org

Jörn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-13 Thread Jörn Engel

On Thu, Aug 13, 2015 at 12:45:33AM +, Naoya Horiguchi wrote:
 From: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

 Currently there's no easy way to get per-process usage of hugetlb pages, which
 is inconvenient because userspace applications which use hugetlb typically 
 want
 to control their processes on the basis of how much memory (including hugetlb)
 they use. So this patch simply provides easy access to the info via
 /proc/PID/status.

 With this patch, for example, /proc/PID/status shows a line like this:

   HugetlbPages:  20480 kB (10*2048kB)

 If your system supports and enables multiple hugepage sizes, the line looks
 like this:

   HugetlbPages:1069056 kB (1*1048576kB 10*2048kB)

 , so you can easily know how many hugepages in which pagesize are used by a
 process.

 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com

Acked-by: Joern Engel jo...@logfs.org

Jörn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success.
-- Tom DeMarco
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-08-13 Thread Jörn Engel

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
 
 I'd be interested in the comments of others, though, to see if there is 
 any reservation about the hstate size breakdown.  It may actually find no 
 current customer who is interested in parsing it.  (If we keep it, I would 
 suggest the 'x' change to '*' similar to per-order breakdowns in 
 show_mem()).  It may also be possible to add it later if a definitive 
 usecase is presented.

I have no interest in parsing the size breakdown today.  I might change
my mind tomorrow and having the extra information hurts very little, so
I won't argue against it either.

 But overall I'm very happy with the new addition and think it's a good 
 solution to the problem.

Agreed.  Thank you!

Jörn

--
One of the painful things about our time is that those who feel certainty
are stupid, and those with any imagination and understanding are filled
with doubt and indecision.
-- Bertrand Russell
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

2015-08-04 Thread Jörn Engel

On Tue, Aug 04, 2015 at 05:13:39AM +, Naoya Horiguchi wrote:
> On Tue, Aug 04, 2015 at 02:55:30AM +, Naoya Horiguchi wrote:
> > 
> > One possible way to get hugetlb metric in per-task basis is to walk page
> > table via /proc/pid/pagemap, and counting page flags for each mapped page
> > (we can easily do this with tools/vm/page-types.c like "page-types -p 
> > -b huge"). This is obviously slower than just storing the counter as
> > in-kernel data and just exporting it, but might be useful in some situation.

Maybe.  The current situation is a mess and I don't know the best way
out of it yet.

> BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
> I wrote the following patch, which hopefully is helpful for your purpose.
> 
> Thanks,
> Naoya Horiguchi
> 
> ---
> From: Naoya Horiguchi 
> Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
> 
> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
> 
> before patch:
> 
>   Size:  20480 kB
>   Rss:   0 kB
>   Pss:   0 kB
>   Shared_Clean:  0 kB
>   Shared_Dirty:  0 kB
>   Private_Clean: 0 kB
>   Private_Dirty: 0 kB
>   Referenced:0 kB
>   Anonymous: 0 kB
>   AnonHugePages: 0 kB
>   Swap:  0 kB
>   KernelPageSize: 2048 kB
>   MMUPageSize:2048 kB
>   Locked:0 kB
>   VmFlags: rd wr mr mw me de ht
> 
> after patch:
> 
>   Size:  20480 kB
>   Rss:   18432 kB
>   Pss:   18432 kB
>   Shared_Clean:  0 kB
>   Shared_Dirty:  0 kB
>   Private_Clean: 0 kB
>   Private_Dirty: 18432 kB
>   Referenced:18432 kB
>   Anonymous: 18432 kB
>   AnonHugePages: 0 kB
>   Swap:  0 kB
>   KernelPageSize: 2048 kB
>   MMUPageSize:2048 kB
>   Locked:0 kB
>   VmFlags: rd wr mr mw me de ht

Nice!

> Signed-off-by: Naoya Horiguchi 
> ---
>  fs/proc/task_mmu.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ca1e091881d4..c7218603306d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   seq_putc(m, '\n');
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> +  unsigned long addr, unsigned long end,
> +  struct mm_walk *walk)
> +{
> + struct mem_size_stats *mss = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + struct page *page = NULL;
> +
> + if (pte_present(*pte)) {
> + page = vm_normal_page(vma, addr, *pte);
> + } else if (is_swap_pte(*pte)) {
> + swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> + if (is_migration_entry(swpent))
> + page = migration_entry_to_page(swpent);
> + }
> + if (page)
> + smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
> +   pte_young(*pte), pte_dirty(*pte));
> + return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>   struct vm_area_struct *vma = v;
>   struct mem_size_stats mss;
>   struct mm_walk smaps_walk = {
>   .pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> + .hugetlb_entry = smaps_hugetlb_range,
> +#endif

Not too fond of the #ifdef.  But I won't blame you, as there already is
an example of the same and - worse - a contradicting example that
unconditionally assigns and moved the #ifdef elsewhere.

Hugetlb is the unloved stepchild with 13 years of neglect and
half-measures.  It shows.

Patch looks good to me.

Acked-by: Jörn Engel 

Jörn

--
Functionality is an asset, but code is a liability.
--Ted Dziuba
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

2015-08-04 Thread Jörn Engel

On Tue, Aug 04, 2015 at 05:13:39AM +, Naoya Horiguchi wrote:
 On Tue, Aug 04, 2015 at 02:55:30AM +, Naoya Horiguchi wrote:
  
  One possible way to get hugetlb metric in per-task basis is to walk page
  table via /proc/pid/pagemap, and counting page flags for each mapped page
  (we can easily do this with tools/vm/page-types.c like page-types -p PID
  -b huge). This is obviously slower than just storing the counter as
  in-kernel data and just exporting it, but might be useful in some situation.

Maybe.  The current situation is a mess and I don't know the best way
out of it yet.

 BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
 I wrote the following patch, which hopefully is helpful for your purpose.
 
 Thanks,
 Naoya Horiguchi
 
 ---
 From: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
 
 Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
 inconvenient when we want to know per-task or per-vma base hugetlb usage.
 This patch enables these fields by introducing smaps_hugetlb_range().
 
 before patch:
 
   Size:  20480 kB
   Rss:   0 kB
   Pss:   0 kB
   Shared_Clean:  0 kB
   Shared_Dirty:  0 kB
   Private_Clean: 0 kB
   Private_Dirty: 0 kB
   Referenced:0 kB
   Anonymous: 0 kB
   AnonHugePages: 0 kB
   Swap:  0 kB
   KernelPageSize: 2048 kB
   MMUPageSize:2048 kB
   Locked:0 kB
   VmFlags: rd wr mr mw me de ht
 
 after patch:
 
   Size:  20480 kB
   Rss:   18432 kB
   Pss:   18432 kB
   Shared_Clean:  0 kB
   Shared_Dirty:  0 kB
   Private_Clean: 0 kB
   Private_Dirty: 18432 kB
   Referenced:18432 kB
   Anonymous: 18432 kB
   AnonHugePages: 0 kB
   Swap:  0 kB
   KernelPageSize: 2048 kB
   MMUPageSize:2048 kB
   Locked:0 kB
   VmFlags: rd wr mr mw me de ht

Nice!

 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 ---
  fs/proc/task_mmu.c | 27 +++
  1 file changed, 27 insertions(+)
 
 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index ca1e091881d4..c7218603306d 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, 
 struct vm_area_struct *vma)
   seq_putc(m, '\n');
  }
  
 +#ifdef CONFIG_HUGETLB_PAGE
 +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 +  unsigned long addr, unsigned long end,
 +  struct mm_walk *walk)
 +{
 + struct mem_size_stats *mss = walk-private;
 + struct vm_area_struct *vma = walk-vma;
 + struct page *page = NULL;
 +
 + if (pte_present(*pte)) {
 + page = vm_normal_page(vma, addr, *pte);
 + } else if (is_swap_pte(*pte)) {
 + swp_entry_t swpent = pte_to_swp_entry(*pte);
 +
 + if (is_migration_entry(swpent))
 + page = migration_entry_to_page(swpent);
 + }
 + if (page)
 + smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
 +   pte_young(*pte), pte_dirty(*pte));
 + return 0;
 +}
 +#endif /* HUGETLB_PAGE */
 +
  static int show_smap(struct seq_file *m, void *v, int is_pid)
  {
   struct vm_area_struct *vma = v;
   struct mem_size_stats mss;
   struct mm_walk smaps_walk = {
   .pmd_entry = smaps_pte_range,
 +#ifdef CONFIG_HUGETLB_PAGE
 + .hugetlb_entry = smaps_hugetlb_range,
 +#endif

Not too fond of the #ifdef.  But I won't blame you, as there already is
an example of the same and - worse - a contradicting example that
unconditionally assigns and moved the #ifdef elsewhere.

Hugetlb is the unloved stepchild with 13 years of neglect and
half-measures.  It shows.

Patch looks good to me.

Acked-by: Jörn Engel jo...@logfs.org

Jörn

--
Functionality is an asset, but code is a liability.
--Ted Dziuba
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-30 Thread Jörn Engel

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> >
> >Since the hugetlb pool is a global resource, it would also be helpful to
> >determine if a process is mapping more than expected.  You can't do that
> >just by adding a huge rss metric, however: if you have 2MB and 1GB
> >hugepages configured you wouldn't know if a process was mapping 512 2MB
> >hugepages or 1 1GB hugepage.

Fair, although I believe 1GB hugepages are overrated.  If you assume
that per-page overhead is independent of page size (not quite true, but
close enough), going from 1% small pages to 0.8% small pages will
improve performance as much as going from 99% 2MB pages to 99% 1GB
pages.

> >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> >counters for all hstates.  The test could be converted to use that to
> >measure usage if configured in the kernel.
> >
> >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> >userspace in a clean way and other ways of obtaining the same data are
> >possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
> >arguing that we need separate rss counters for it.
> 
> If I want to track hugetlb usage on a per-task basis, do I then need to
> create one cgroup per task?
> 
> For example, suppose I have many tasks using hugetlb and the global pool
> is getting low on free pages.  It might be useful to know which tasks are
> using hugetlb pages, and how many they are using.
> 
> I don't actually have this need (I think), but it appears to be what
> Jörn is asking for.

Maybe some background is useful.  I would absolutely love to use
transparent hugepages.  They are absolutely perfect in every respect,
except for performance.  With transparent hugepages we get higher
latencies.  Small pages are unacceptable, so we are forced to use
non-transparent hugepages.

The part of our system that uses small pages is pretty much constant,
while total system memory follows Moore's law.  When possible we even
try to shrink that part.  Hugepages already dominate today and things
will get worse.

But otherwise we have all the problems that others also have.  There are
memory leaks and we would like to know how much memory each process
actually uses.  Most people use rss, while we have nothing good.  And I
am not sure if cgroup is the correct answer for essentially fixing a
regression introduced in 2002.

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-30 Thread Jörn Engel

On Thu, Jul 30, 2015 at 05:22:55PM +0200, Mike Galbraith wrote:
> 
> I piddled about with the thought that it might be nice to be able to
> sprinkle cond_resched() about to cut rt latencies without wrecking
> normal load throughput, cobbled together a cond_resched_rt().
> 
> On my little box that was a waste of time, as the biggest hits are block
> softirq and free_hot_cold_page_list().

Block softirq is one of our problems as well.  It is a bit of a joke
that __do_softirq() moves work to ksoftirqd after 2ms, but block softirq
can take several 100ms in bad cases.

We could give individual softirqs a time budget.  If they exceed the
budget they should complete, but reassert themselves.  Not sure about
the rest, but that would be pretty simple to implement for block
softirq.

Jörn

--
Happiness isn't having what you want, it's wanting what you have.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-30 Thread Jörn Engel

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
 
 Since the hugetlb pool is a global resource, it would also be helpful to
 determine if a process is mapping more than expected.  You can't do that
 just by adding a huge rss metric, however: if you have 2MB and 1GB
 hugepages configured you wouldn't know if a process was mapping 512 2MB
 hugepages or 1 1GB hugepage.

Fair, although I believe 1GB hugepages are overrated.  If you assume
that per-page overhead is independent of page size (not quite true, but
close enough), going from 1% small pages to 0.8% small pages will
improve performance as much as going from 99% 2MB pages to 99% 1GB
pages.

 That's the purpose of hugetlb_cgroup, after all, and it supports usage
 counters for all hstates.  The test could be converted to use that to
 measure usage if configured in the kernel.
 
 Beyond that, I'm not sure how a per-hstate rss metric would be exported to
 userspace in a clean way and other ways of obtaining the same data are
 possible with hugetlb_cgroup.  I'm not sure how successful you'd be in
 arguing that we need separate rss counters for it.
 
 If I want to track hugetlb usage on a per-task basis, do I then need to
 create one cgroup per task?
 
 For example, suppose I have many tasks using hugetlb and the global pool
 is getting low on free pages.  It might be useful to know which tasks are
 using hugetlb pages, and how many they are using.
 
 I don't actually have this need (I think), but it appears to be what
 Jörn is asking for.

Maybe some background is useful.  I would absolutely love to use
transparent hugepages.  They are absolutely perfect in every respect,
except for performance.  With transparent hugepages we get higher
latencies.  Small pages are unacceptable, so we are forced to use
non-transparent hugepages.

The part of our system that uses small pages is pretty much constant,
while total system memory follows Moore's law.  When possible we even
try to shrink that part.  Hugepages already dominate today and things
will get worse.

But otherwise we have all the problems that others also have.  There are
memory leaks and we would like to know how much memory each process
actually uses.  Most people use rss, while we have nothing good.  And I
am not sure if cgroup is the correct answer for essentially fixing a
regression introduced in 2002.

Jörn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-30 Thread Jörn Engel

On Thu, Jul 30, 2015 at 05:22:55PM +0200, Mike Galbraith wrote:
 
 I piddled about with the thought that it might be nice to be able to
 sprinkle cond_resched() about to cut rt latencies without wrecking
 normal load throughput, cobbled together a cond_resched_rt().
 
 On my little box that was a waste of time, as the biggest hits are block
 softirq and free_hot_cold_page_list().

Block softirq is one of our problems as well.  It is a bit of a joke
that __do_softirq() moves work to ksoftirqd after 2ms, but block softirq
can take several 100ms in bad cases.

We could give individual softirqs a time budget.  If they exceed the
budget they should complete, but reassert themselves.  Not sure about
the rest, but that would be pretty simple to implement for block
softirq.

Jörn

--
Happiness isn't having what you want, it's wanting what you have.
-- unknown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 04:30:19PM -0700, David Rientjes wrote:
> 
> It's not only the oom killer, I don't believe hugeltb pages are accounted 
> to the "rss" in memcg.  They use the hugetlb_cgroup for that.  Starting to 
> account for them in existing memcg deployments would cause them to hit 
> their memory limits much earlier.  The "rss_huge" field in memcg only 
> represents transparent hugepages.
> 
> I agree with your comment that having done this when hugetlbfs was 
> introduced would have been optimal.
> 
> It's always difficult to add a new class of memory to an existing metric 
> ("new" here because it's currently unaccounted).
> 
> If we can add yet another process metric to track hugetlbfs memory mapped, 
> then the test could be converted to use that.  I'm not sure if the 
> jusitifcation would be strong enough, but you could try.

Well, we definitely need something.  Having a 100GB process show 3GB of
rss is not very useful.  How would we notice a memory leak if it only
affects hugepages, for example?

Jörn

--
The object-oriented version of 'Spaghetti code' is, of course, 'Lasagna code'.
(Too many layers).
-- Roberto Waltman.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 03:15:17PM -0700, David Rientjes wrote:
> 
> Starting to account hugetlb pages in rss may lead to breakage in userspace 
> and I would agree with your earlier suggestion that just removing any test 
> for rss would be appropriate.

What would you propose for me then?  I have 80% RAM or more in reserved
hugepages.  OOM-killer is not a concern, as it panics the system - the
alternatives were almost universally silly and we didn't want to deal
with system in unpredictable states.  But knowing how much memory is
used by which process is a concern.  And if you only tell me about the
small (and continuously shrinking) portion, I essentially fly blind.

That is not a case of "may lead to breakage", it _is_ broken.

Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
By now we might have to introduce a new field, rss_including_hugepages
or whatever.  Then we have to update tools like top etc. to use the new
field when appropriate.  No fun, but might be necessary.

If we can get away with including hugepages in rss and fixing the OOM
killer to be less silly, I would strongly prefer that.  But I don't know
how much of a mess we are already in.

Jörn

--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
> I started looking at the hugetlb self tests.  The test hugetlbfstest
> expects hugetlb pages to be accounted for in rss.  However, there is
> no code in the kernel to do this accounting.
> 
> It looks like there was an effort to add the accounting back in 2013.
> The test program made it into tree, but the accounting code did not.

My apologies.  Upstream work always gets axed first when I run out of
time - which happens more often than not.

> The easiest way to resolve this issue would be to remove the test and
> perhaps document that hugetlb pages are not accounted for in rss.
> However, it does seem like a big oversight that hugetlb pages are not
> accounted for in rss.  From a quick scan of the code it appears THP
> pages are properly accounted for.
> 
> Thoughts?

Unsurprisingly I agree that hugepages should count towards rss.  Keeping
the test in keeps us honest.  Actually fixing the issue would make us
honest and correct.

Increasingly we have tiny processes (by rss) that actually consume large
fractions of total memory.  Makes rss somewhat useless as a measure of
anything.

Jörn

--
Consensus is no proof!
-- John Naisbitt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 03:32:55PM +0200, Michal Hocko wrote:
> > 
> > We have kernel preemption disabled.  A lower-priority task in a system
> > call will block higher-priority tasks.
> 
> This is an inherent problem of !PREEMPT, though. There are many
> loops which can take quite some time but we do not want to sprinkle
> cond_resched all over the kernel. On the other hand these io/remap resp.
> vunmap page table walks do not have any cond_resched points AFAICS so we
> can at least mimic zap_pmd_range which does cond_resched.

Even for !PREEMPT we don't want infinite scheduler latencies.  Real
question is how much we are willing to accept and at what point we
should start sprinkling cond_resched.  I would pick 100ms, but that is
just a personal choice.  If we decide on 200ms or 500ms, I can live with
that too.

But whatever value we pick, I suspect these resched points need to go in
eventually.  As memory sizes grow, people will also start mapping bigger
regions and the scheduler latency will eventually exceed whatever value
we picked.

Jörn

--
Fools ignore complexity.  Pragmatists suffer it.
Some can avoid it.  Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept.  1982
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 03:15:17PM -0700, David Rientjes wrote:
 
 Starting to account hugetlb pages in rss may lead to breakage in userspace 
 and I would agree with your earlier suggestion that just removing any test 
 for rss would be appropriate.

What would you propose for me then?  I have 80% RAM or more in reserved
hugepages.  OOM-killer is not a concern, as it panics the system - the
alternatives were almost universally silly and we didn't want to deal
with system in unpredictable states.  But knowing how much memory is
used by which process is a concern.  And if you only tell me about the
small (and continuously shrinking) portion, I essentially fly blind.

That is not a case of may lead to breakage, it _is_ broken.

Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
By now we might have to introduce a new field, rss_including_hugepages
or whatever.  Then we have to update tools like top etc. to use the new
field when appropriate.  No fun, but might be necessary.

If we can get away with including hugepages in rss and fixing the OOM
killer to be less silly, I would strongly prefer that.  But I don't know
how much of a mess we are already in.

Jörn

--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 04:30:19PM -0700, David Rientjes wrote:
 
 It's not only the oom killer, I don't believe hugeltb pages are accounted 
 to the rss in memcg.  They use the hugetlb_cgroup for that.  Starting to 
 account for them in existing memcg deployments would cause them to hit 
 their memory limits much earlier.  The rss_huge field in memcg only 
 represents transparent hugepages.
 
 I agree with your comment that having done this when hugetlbfs was 
 introduced would have been optimal.
 
 It's always difficult to add a new class of memory to an existing metric 
 (new here because it's currently unaccounted).
 
 If we can add yet another process metric to track hugetlbfs memory mapped, 
 then the test could be converted to use that.  I'm not sure if the 
 jusitifcation would be strong enough, but you could try.

Well, we definitely need something.  Having a 100GB process show 3GB of
rss is not very useful.  How would we notice a memory leak if it only
affects hugepages, for example?

Jörn

--
The object-oriented version of 'Spaghetti code' is, of course, 'Lasagna code'.
(Too many layers).
-- Roberto Waltman.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-28 Thread Jörn Engel

On Tue, Jul 28, 2015 at 03:32:55PM +0200, Michal Hocko wrote:
  
  We have kernel preemption disabled.  A lower-priority task in a system
  call will block higher-priority tasks.
 
 This is an inherent problem of !PREEMPT, though. There are many
 loops which can take quite some time but we do not want to sprinkle
 cond_resched all over the kernel. On the other hand these io/remap resp.
 vunmap page table walks do not have any cond_resched points AFAICS so we
 can at least mimic zap_pmd_range which does cond_resched.

Even for !PREEMPT we don't want infinite scheduler latencies.  Real
question is how much we are willing to accept and at what point we
should start sprinkling cond_resched.  I would pick 100ms, but that is
just a personal choice.  If we decide on 200ms or 500ms, I can live with
that too.

But whatever value we pick, I suspect these resched points need to go in
eventually.  As memory sizes grow, people will also start mapping bigger
regions and the scheduler latency will eventually exceed whatever value
we picked.

Jörn

--
Fools ignore complexity.  Pragmatists suffer it.
Some can avoid it.  Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept.  1982
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hugetlb pages not accounted for in rss

2015-07-28 Thread Jörn Engel

On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
 I started looking at the hugetlb self tests.  The test hugetlbfstest
 expects hugetlb pages to be accounted for in rss.  However, there is
 no code in the kernel to do this accounting.
 
 It looks like there was an effort to add the accounting back in 2013.
 The test program made it into tree, but the accounting code did not.

My apologies.  Upstream work always gets axed first when I run out of
time - which happens more often than not.

 The easiest way to resolve this issue would be to remove the test and
 perhaps document that hugetlb pages are not accounted for in rss.
 However, it does seem like a big oversight that hugetlb pages are not
 accounted for in rss.  From a quick scan of the code it appears THP
 pages are properly accounted for.
 
 Thoughts?

Unsurprisingly I agree that hugepages should count towards rss.  Keeping
the test in keeps us honest.  Actually fixing the issue would make us
honest and correct.

Increasingly we have tiny processes (by rss) that actually consume large
fractions of total memory.  Makes rss somewhat useless as a measure of
anything.

Jörn

--
Consensus is no proof!
-- John Naisbitt
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-27 Thread Jörn Engel

On Mon, Jul 27, 2015 at 09:08:42AM +0200, Michal Hocko wrote:
> On Fri 24-07-15 09:56:27, Jörn Engel wrote:
> > On Fri, Jul 24, 2015 at 09:04:21AM +0200, Michal Hocko wrote:
> > > On Thu 23-07-15 14:54:33, Spencer Baugh wrote:
> > > > From: Joern Engel 
> > > > 
> > > > Mapping large memory spaces can be slow and prevent high-priority
> > > > realtime threads from preempting lower-priority threads for a long time.
> > > 
> > > How can a lower priority task block the high priority one? Do you have
> > > preemption disabled?
> > 
> > Yes.

We have kernel preemption disabled.  A lower-priority task in a system
call will block higher-priority tasks.

Jörn

--
After Iraq, if a PM said there are trees in the rainforest you’d need
to send for proof.
-- Alex Thompson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-27 Thread Jörn Engel

On Mon, Jul 27, 2015 at 09:08:42AM +0200, Michal Hocko wrote:
 On Fri 24-07-15 09:56:27, Jörn Engel wrote:
  On Fri, Jul 24, 2015 at 09:04:21AM +0200, Michal Hocko wrote:
   On Thu 23-07-15 14:54:33, Spencer Baugh wrote:
From: Joern Engel jo...@logfs.org

Mapping large memory spaces can be slow and prevent high-priority
realtime threads from preempting lower-priority threads for a long time.
   
   How can a lower priority task block the high priority one? Do you have
   preemption disabled?
  
  Yes.

We have kernel preemption disabled.  A lower-priority task in a system
call will block higher-priority tasks.

Jörn

--
After Iraq, if a PM said there are trees in the rainforest you’d need
to send for proof.
-- Alex Thompson
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 12:49:14PM -0700, David Rientjes wrote:
> 
> I don't see the cond_resched() you propose to add, but the need for it is 
> obvious with a large user-written nr_hugepages in the above loop.
> 
> The suggestion is to check the conditional, reschedule if needed (and if 
> so, recheck the conditional), and then allocate.
> 
> Your third option looks fine and the best place to do the cond_resched().  
> I was looking at your second option when I responded and compared it to 
> the first.  We don't want to do cond_resched() immediately before or after 
> the allocation, the net result is the same: we may be pointlessly 
> allocating the hugepage and each hugepage allocation can be very 
> heavyweight.
> 
> So I agree with your third option from the previous email.

All right.  We are talking about the same thing now.  But I previously
argued that the pointless allocation will a) not impact correctness and
b) be so rare as to not impact performance.  The problem with the third
option is that it adds a bit of constant overhead all the time to
compensate for not doing the pointless allocation.

On my systems at least, the pointless allocation will happen, on
average, less than once per boot.  Unless my systems are vastly
unrepresentative, the third option doesn't look appealing to me.

> You may also want to include the actual text of the warning from the 
> kernel log in your commit message.  When people encounter this, then will 
> probably grep in the kernel logs for some keywords to see if it was 
> already fixed and I fear your current commit message may allow it to be 
> missed.

Ack.

I should still have those warning in logfiles somewhere and can hunt
them down.

Jörn

--
Act only according to that maxim whereby you can, at the same time,
will that it should become a universal law.
-- Immanuel Kant
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 08:59:59AM +0200, Michal Hocko wrote:
> On Thu 23-07-15 14:54:31, Spencer Baugh wrote:
> > From: Joern Engel 
> > 
> > ~150ms scheduler latency for both observed in the wild.
> 
> This is way to vague. Could you describe your problem somehow more,
> please?
> There are schduling points in the page allocator (when it triggers the
> reclaim), why are those not sufficient? Or do you manage to allocate
> many hugetlb pages without performing the reclaim and that leads to
> soft lockups?

We don't use transparent hugepages - they cause too much latency.
Instead we reserve somewhere around 3/4 or so of physical memory for
hugepages.  "sysctl -w vm.nr_hugepages=10" or something similar in a
startup script.

Since it is early in boot we don't go through page reclaim.

Jörn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-24 Thread Jörn Engel

On Thu, Jul 23, 2015 at 05:32:03PM -0600, Toshi Kani wrote:
> On Thu, 2015-07-23 at 14:54 -0700, Spencer Baugh wrote:
> > From: Joern Engel 
> > 
> > Mapping large memory spaces can be slow and prevent high-priority
> > realtime threads from preempting lower-priority threads for a long time.
> 
> Yes, and one of the goals of large page ioremap support is to address such
> problem.

Nice!  Once we upgrade we should retest this one then.

> > [ cut here ]
> > WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
> > Thread not rescheduled for 95 jiffies
> > CPU: 14 PID: 6684 Comm: foo Tainted: GW  O 3.10.59+
> >  0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
> >  8103f131 887f48275ac0 002f 007c
> >   7fadd1e0 883f7fbc3f78 8103f19c
> > Call Trace:
> >[] dump_stack+0x19/0x1b
> >  [] warn_slowpath_common+0x61/0x80
> >  [] warn_slowpath_fmt+0x4c/0x50
> >  [] ? rcu_irq_exit+0x77/0xc0
> >  [] do_IRQ+0x126/0x140
> >  [] common_interrupt+0x6f/0x6f
> >[] ? _raw_spin_lock+0x13/0x30
> >  [] __pte_alloc+0x31/0xc0
> >  [] remap_pfn_range+0x45c/0x470
> 
> remap_pfn_range() does not have large page mappings support yet.  So, yes,
> this can still take a long time at this point.  We can extend large page
> support for this interface if necessary.

A cond_resched() is enough to solve the latency impact.  But I suspect
large pages will perform better as well, so having that support would be
appreciated.

> >  [] vfio_pci_mmap+0x148/0x210 [vfio_pci]
> >  [] vfio_device_fops_mmap+0x23/0x30 [vfio]
> >  [] mmap_region+0x3d8/0x5e0
> >  [] do_mmap_pgoff+0x305/0x3c0
> >  [] ? call_rwsem_down_write_failed+0x13/0x20
> >  [] vm_mmap_pgoff+0x67/0xa0
> >  [] SyS_mmap_pgoff+0x272/0x2e0
> >  [] SyS_mmap+0x22/0x30
> >  [] system_call_fastpath+0x16/0x1b
> > ---[ end trace 6b0a8d2341444bde ]---

Jörn

--
A defeated army first battles and then seeks victory.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 09:04:21AM +0200, Michal Hocko wrote:
> On Thu 23-07-15 14:54:33, Spencer Baugh wrote:
> > From: Joern Engel 
> > 
> > Mapping large memory spaces can be slow and prevent high-priority
> > realtime threads from preempting lower-priority threads for a long time.
> 
> How can a lower priority task block the high priority one? Do you have
> preemption disabled?

Yes.

Jörn

--
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 12:49:14PM -0700, David Rientjes wrote:
 
 I don't see the cond_resched() you propose to add, but the need for it is 
 obvious with a large user-written nr_hugepages in the above loop.
 
 The suggestion is to check the conditional, reschedule if needed (and if 
 so, recheck the conditional), and then allocate.
 
 Your third option looks fine and the best place to do the cond_resched().  
 I was looking at your second option when I responded and compared it to 
 the first.  We don't want to do cond_resched() immediately before or after 
 the allocation, the net result is the same: we may be pointlessly 
 allocating the hugepage and each hugepage allocation can be very 
 heavyweight.
 
 So I agree with your third option from the previous email.

All right.  We are talking about the same thing now.  But I previously
argued that the pointless allocation will a) not impact correctness and
b) be so rare as to not impact performance.  The problem with the third
option is that it adds a bit of constant overhead all the time to
compensate for not doing the pointless allocation.

On my systems at least, the pointless allocation will happen, on
average, less than once per boot.  Unless my systems are vastly
unrepresentative, the third option doesn't look appealing to me.

 You may also want to include the actual text of the warning from the 
 kernel log in your commit message.  When people encounter this, then will 
 probably grep in the kernel logs for some keywords to see if it was 
 already fixed and I fear your current commit message may allow it to be 
 missed.

Ack.

I should still have those warning in logfiles somewhere and can hunt
them down.

Jörn

--
Act only according to that maxim whereby you can, at the same time,
will that it should become a universal law.
-- Immanuel Kant
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 09:04:21AM +0200, Michal Hocko wrote:
 On Thu 23-07-15 14:54:33, Spencer Baugh wrote:
  From: Joern Engel jo...@logfs.org
  
  Mapping large memory spaces can be slow and prevent high-priority
  realtime threads from preempting lower-priority threads for a long time.
 
 How can a lower priority task block the high priority one? Do you have
 preemption disabled?

Yes.

Jörn

--
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-24 Thread Jörn Engel

On Thu, Jul 23, 2015 at 05:32:03PM -0600, Toshi Kani wrote:
 On Thu, 2015-07-23 at 14:54 -0700, Spencer Baugh wrote:
  From: Joern Engel jo...@logfs.org
  
  Mapping large memory spaces can be slow and prevent high-priority
  realtime threads from preempting lower-priority threads for a long time.
 
 Yes, and one of the goals of large page ioremap support is to address such
 problem.

Nice!  Once we upgrade we should retest this one then.

  [ cut here ]
  WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
  Thread not rescheduled for 95 jiffies
  CPU: 14 PID: 6684 Comm: foo Tainted: GW  O 3.10.59+
   0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
   8103f131 887f48275ac0 002f 007c
    7fadd1e0 883f7fbc3f78 8103f19c
  Call Trace:
   IRQ  [8163a12c] dump_stack+0x19/0x1b
   [8103f131] warn_slowpath_common+0x61/0x80
   [8103f19c] warn_slowpath_fmt+0x4c/0x50
   [810bd917] ? rcu_irq_exit+0x77/0xc0
   [8164a556] do_IRQ+0x126/0x140
   [816407ef] common_interrupt+0x6f/0x6f
   EOI  [81640483] ? _raw_spin_lock+0x13/0x30
   [8111b621] __pte_alloc+0x31/0xc0
   [8111feac] remap_pfn_range+0x45c/0x470
 
 remap_pfn_range() does not have large page mappings support yet.  So, yes,
 this can still take a long time at this point.  We can extend large page
 support for this interface if necessary.

A cond_resched() is enough to solve the latency impact.  But I suspect
large pages will perform better as well, so having that support would be
appreciated.

   [a007f1f8] vfio_pci_mmap+0x148/0x210 [vfio_pci]
   [a0025173] vfio_device_fops_mmap+0x23/0x30 [vfio]
   [81124ed8] mmap_region+0x3d8/0x5e0
   [811253e5] do_mmap_pgoff+0x305/0x3c0
   [8126f3f3] ? call_rwsem_down_write_failed+0x13/0x20
   [8677] vm_mmap_pgoff+0x67/0xa0
   [811237e2] SyS_mmap_pgoff+0x272/0x2e0
   [810067e2] SyS_mmap+0x22/0x30
   [81648c59] system_call_fastpath+0x16/0x1b
  ---[ end trace 6b0a8d2341444bde ]---

Jörn

--
A defeated army first battles and then seeks victory.
-- Sun Tzu
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-24 Thread Jörn Engel

On Fri, Jul 24, 2015 at 08:59:59AM +0200, Michal Hocko wrote:
 On Thu 23-07-15 14:54:31, Spencer Baugh wrote:
  From: Joern Engel jo...@logfs.org
  
  ~150ms scheduler latency for both observed in the wild.
 
 This is way to vague. Could you describe your problem somehow more,
 please?
 There are schduling points in the page allocator (when it triggers the
 reclaim), why are those not sufficient? Or do you manage to allocate
 many hugetlb pages without performing the reclaim and that leads to
 soft lockups?

We don't use transparent hugepages - they cause too much latency.
Instead we reserve somewhere around 3/4 or so of physical memory for
hugepages.  sysctl -w vm.nr_hugepages=10 or something similar in a
startup script.

Since it is early in boot we don't go through page reclaim.

Jörn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:54:43PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Jörn Engel wrote:
> 
> > > This is wrong, you'd want to do any cond_resched() before the page 
> > > allocation to avoid racing with an update to h->nr_huge_pages or 
> > > h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
> > > the page having been uselessly allocated.
> > 
> > There are three options.  Either
> > /* some allocation */
> > cond_resched();
> > or
> > cond_resched();
> > /* some allocation */
> > or
> > if (cond_resched()) {
> > spin_lock(_lock);
> > continue;
> > }
> > /* some allocation */
> > 
> > I think you want the second option instead of the first.  That way we
> > have a little less memory allocation for the time we are scheduled out.
> > Sure, we can do that.  It probably doesn't make a big difference either
> > way, but why not.
> > 
> 
> The loop is dropping the lock simply to do the allocation and it needs to 
> compare with the user-written number of hugepages to allocate.

And at this point the existing code is racy.  Page allocation might
block for minutes trying to free some memory.  A cond_resched doesn't
change that - it only increases the odds of hitting the race window.

> What we don't want is to allocate, reschedule, and check if we really 
> needed to allocate.  That's what your patch does because it races with 
> persistent_huge_page().  It's probably the worst place to do it.
> 
> Rather, what you want to do is check if you need to allocate, reschedule 
> if needed (and if so, recheck), and then allocate.
> 
> > If you are asking for the third option, I would rather avoid that.  It
> > makes the code more complex and doesn't change the fact that we have a
> > race and better be able to handle the race.  The code size growth will
> > likely cost us more performance that we would ever gain.  nr_huge_pages
> > tends to get updated once per system boot.
> 
> Your third option is nonsensical, you didn't save the state of whether you 
> locked the lock so you can't reliably unlock it, and you cannot hold a 
> spinlock while allocating in this context.

Are we looking at the same code?  Mine looks like this:
while (count > persistent_huge_pages(h)) {
/*
 * If this allocation races such that we no longer need the
 * page, free_huge_page will handle it by freeing the page
 * and reducing the surplus.
 */
spin_unlock(_lock);
if (hstate_is_gigantic(h))
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
spin_lock(_lock);
if (!ret)
goto out;

/* Bail for signals. Probably ctrl-c from user */
if (signal_pending(current))
goto out;
}

What state is there to save?  We just called spin_unlock, we did a
schedule and if we want to continue without doing page allocation we
better take the lock again.  Or do you want to go even more complex and
check for signals as well?

The case you are concerned about is rare.  It is so rare that it doesn't
matter from a performance point of view, only for correctness.  And if
we hit the rare case, the worst harm would be an unnecessary allocation
that we return back to the system.  How much complexity do you think it
is worth to avoid this allocation?  How much runtime will the bigger
text size cost you in the common cases?

What matters to me is the scheduler latency.  That is real and happens
reliably once per boot.

Jörn

--
Chance favors only the prepared mind.
-- Louis Pasteur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:08:58PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Spencer Baugh wrote:
> > From: Joern Engel 
> > 
> > ~150ms scheduler latency for both observed in the wild.
> > 
> > Signed-off-by: Joern Engel 
> > Signed-off-by: Spencer Baugh 
> > ---
> >  mm/hugetlb.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a8c3087..2eb6919 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate 
> > *h, unsigned long count,
> > ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> > else
> > ret = alloc_fresh_huge_page(h, nodes_allowed);
> > +   cond_resched();
> > spin_lock(_lock);
> > if (!ret)
> > goto out;
> 
> This is wrong, you'd want to do any cond_resched() before the page 
> allocation to avoid racing with an update to h->nr_huge_pages or 
> h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
> the page having been uselessly allocated.

There are three options.  Either
/* some allocation */
cond_resched();
or
cond_resched();
/* some allocation */
or
if (cond_resched()) {
spin_lock(_lock);
continue;
}
/* some allocation */

I think you want the second option instead of the first.  That way we
have a little less memory allocation for the time we are scheduled out.
Sure, we can do that.  It probably doesn't make a big difference either
way, but why not.

If you are asking for the third option, I would rather avoid that.  It
makes the code more complex and doesn't change the fact that we have a
race and better be able to handle the race.  The code size growth will
likely cost us more performance that we would ever gain.  nr_huge_pages
tends to get updated once per system boot.

> > @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> > spin_unlock(ptl);
> > ret = hugetlb_fault(mm, vma, vaddr,
> > (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> > +   cond_resched();
> > if (!(ret & VM_FAULT_ERROR))
> > continue;
> >  
> 
> This is almost certainly the wrong placement as well since it's inserted 
> inside a conditional inside a while loop and there's no reason to 
> hugetlb_fault(), schedule, and then check the return value.  You need to 
> insert your cond_resched()'s in legitimate places.

I assume you want the second option here as well.  Am I right?

Jörn

--
Sometimes it pays to stay in bed on Monday, rather than spending the rest
of the week debugging Monday's code.
-- Christopher Thompson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: round_up integer underflow

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 11:02:55AM -0700, Jörn Engel wrote:
> Spencer spotted something nasty in the round_up macro.  We were
> wondering why round_up() worked differently from ALIGN.  The only real
> difference between the two patterns is overflow behaviour.  And both
> version are buggy when used for signed integer types, round_up will
> underflow on INT_MIN, ALIGN will overflow on INT_MAX.  Since signed
> integer under/overflows are undefined, we might have subtle bugs lurking
> in the kernel.
> 
> This example program produces a warning when compiling with gcc -O2 or
> higher.  Clang doesn't warn.  Compiled code behaves correctly with both
> compilers, but that is largely luck and the same compilers may create
> wrong behaviour if the surrounding code changes.
> 
> #include 
> #include 
> 
> #define __round_mask(x, y) ((__typeof__(x))((y)-1))
> #define round_up(x, y) x)-1) | __round_mask(x, y))+1)
> #define round_down(x, y) ((x) & ~__round_mask(x, y))
> 
> int main(void)
> {
>   int i, r = 8;
> 
>   for (i = INT_MIN; i; i++) {
>   printf("%2x: %2x %2x\n", i, round_down(i, r), round_up(i, r));
>   }
>   return 0;
> }
> 
> I don't have a good answer yet.  We could make round_up check for
> negative numbers, but I would prefer unconditional code that optimizes
> down to nothing.  We could rewrite it in assembly, once for each
> architecture.
> 
> Does anyone have better ideas?

Btw, it would be awesome if something like the following would work in
gcc:
#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define __round_up(x, y) x)-1) | __round_mask(x, y))+1)
#define round_down(x, y) ((x) & ~__round_mask(x, y))
#define round_up(x, y) (__typeof__(x)(__round_up((unsigned __typeof__(x)(x)), 
(y

I.e. cast x to the matching unsigned type where overflows are
well-defined, do the rounding, then cast the result back to the original
type.

Jörn

--
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

round_up integer underflow

2015-07-23 Thread Jörn Engel

Spencer spotted something nasty in the round_up macro.  We were
wondering why round_up() worked differently from ALIGN.  The only real
difference between the two patterns is overflow behaviour.  And both
version are buggy when used for signed integer types, round_up will
underflow on INT_MIN, ALIGN will overflow on INT_MAX.  Since signed
integer under/overflows are undefined, we might have subtle bugs lurking
in the kernel.

This example program produces a warning when compiling with gcc -O2 or
higher.  Clang doesn't warn.  Compiled code behaves correctly with both
compilers, but that is largely luck and the same compilers may create
wrong behaviour if the surrounding code changes.

#include 
#include 

#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define round_up(x, y) x)-1) | __round_mask(x, y))+1)
#define round_down(x, y) ((x) & ~__round_mask(x, y))

int main(void)
{
int i, r = 8;

for (i = INT_MIN; i; i++) {
printf("%2x: %2x %2x\n", i, round_down(i, r), round_up(i, r));
}
return 0;
}

I don't have a good answer yet.  We could make round_up check for
negative numbers, but I would prefer unconditional code that optimizes
down to nothing.  We could rewrite it in assembly, once for each
architecture.

Does anyone have better ideas?

Jörn

--
People really ought to be forced to read their code aloud over the phone.
That would rapidly improve the choice of identifiers.
-- Al Viro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:54:43PM -0700, David Rientjes wrote:
 On Thu, 23 Jul 2015, Jörn Engel wrote:
 
   This is wrong, you'd want to do any cond_resched() before the page 
   allocation to avoid racing with an update to h-nr_huge_pages or 
   h-surplus_huge_pages while hugetlb_lock was dropped that would result in 
   the page having been uselessly allocated.
  
  There are three options.  Either
  /* some allocation */
  cond_resched();
  or
  cond_resched();
  /* some allocation */
  or
  if (cond_resched()) {
  spin_lock(hugetlb_lock);
  continue;
  }
  /* some allocation */
  
  I think you want the second option instead of the first.  That way we
  have a little less memory allocation for the time we are scheduled out.
  Sure, we can do that.  It probably doesn't make a big difference either
  way, but why not.
  
 
 The loop is dropping the lock simply to do the allocation and it needs to 
 compare with the user-written number of hugepages to allocate.

And at this point the existing code is racy.  Page allocation might
block for minutes trying to free some memory.  A cond_resched doesn't
change that - it only increases the odds of hitting the race window.

 What we don't want is to allocate, reschedule, and check if we really 
 needed to allocate.  That's what your patch does because it races with 
 persistent_huge_page().  It's probably the worst place to do it.
 
 Rather, what you want to do is check if you need to allocate, reschedule 
 if needed (and if so, recheck), and then allocate.
 
  If you are asking for the third option, I would rather avoid that.  It
  makes the code more complex and doesn't change the fact that we have a
  race and better be able to handle the race.  The code size growth will
  likely cost us more performance that we would ever gain.  nr_huge_pages
  tends to get updated once per system boot.
 
 Your third option is nonsensical, you didn't save the state of whether you 
 locked the lock so you can't reliably unlock it, and you cannot hold a 
 spinlock while allocating in this context.

Are we looking at the same code?  Mine looks like this:
while (count  persistent_huge_pages(h)) {
/*
 * If this allocation races such that we no longer need the
 * page, free_huge_page will handle it by freeing the page
 * and reducing the surplus.
 */
spin_unlock(hugetlb_lock);
if (hstate_is_gigantic(h))
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
spin_lock(hugetlb_lock);
if (!ret)
goto out;

/* Bail for signals. Probably ctrl-c from user */
if (signal_pending(current))
goto out;
}

What state is there to save?  We just called spin_unlock, we did a
schedule and if we want to continue without doing page allocation we
better take the lock again.  Or do you want to go even more complex and
check for signals as well?

The case you are concerned about is rare.  It is so rare that it doesn't
matter from a performance point of view, only for correctness.  And if
we hit the rare case, the worst harm would be an unnecessary allocation
that we return back to the system.  How much complexity do you think it
is worth to avoid this allocation?  How much runtime will the bigger
text size cost you in the common cases?

What matters to me is the scheduler latency.  That is real and happens
reliably once per boot.

Jörn

--
Chance favors only the prepared mind.
-- Louis Pasteur
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:08:58PM -0700, David Rientjes wrote:
 On Thu, 23 Jul 2015, Spencer Baugh wrote:
  From: Joern Engel jo...@logfs.org
  
  ~150ms scheduler latency for both observed in the wild.
  
  Signed-off-by: Joern Engel jo...@logfs.org
  Signed-off-by: Spencer Baugh sba...@catern.com
  ---
   mm/hugetlb.c | 2 ++
   1 file changed, 2 insertions(+)
  
  diff --git a/mm/hugetlb.c b/mm/hugetlb.c
  index a8c3087..2eb6919 100644
  --- a/mm/hugetlb.c
  +++ b/mm/hugetlb.c
  @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate 
  *h, unsigned long count,
  ret = alloc_fresh_gigantic_page(h, nodes_allowed);
  else
  ret = alloc_fresh_huge_page(h, nodes_allowed);
  +   cond_resched();
  spin_lock(hugetlb_lock);
  if (!ret)
  goto out;
 
 This is wrong, you'd want to do any cond_resched() before the page 
 allocation to avoid racing with an update to h-nr_huge_pages or 
 h-surplus_huge_pages while hugetlb_lock was dropped that would result in 
 the page having been uselessly allocated.

There are three options.  Either
/* some allocation */
cond_resched();
or
cond_resched();
/* some allocation */
or
if (cond_resched()) {
spin_lock(hugetlb_lock);
continue;
}
/* some allocation */

I think you want the second option instead of the first.  That way we
have a little less memory allocation for the time we are scheduled out.
Sure, we can do that.  It probably doesn't make a big difference either
way, but why not.

If you are asking for the third option, I would rather avoid that.  It
makes the code more complex and doesn't change the fact that we have a
race and better be able to handle the race.  The code size growth will
likely cost us more performance that we would ever gain.  nr_huge_pages
tends to get updated once per system boot.

  @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
  vm_area_struct *vma,
  spin_unlock(ptl);
  ret = hugetlb_fault(mm, vma, vaddr,
  (flags  FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
  +   cond_resched();
  if (!(ret  VM_FAULT_ERROR))
  continue;
   
 
 This is almost certainly the wrong placement as well since it's inserted 
 inside a conditional inside a while loop and there's no reason to 
 hugetlb_fault(), schedule, and then check the return value.  You need to 
 insert your cond_resched()'s in legitimate places.

I assume you want the second option here as well.  Am I right?

Jörn

--
Sometimes it pays to stay in bed on Monday, rather than spending the rest
of the week debugging Monday's code.
-- Christopher Thompson
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

round_up integer underflow

2015-07-23 Thread Jörn Engel

Spencer spotted something nasty in the round_up macro.  We were
wondering why round_up() worked differently from ALIGN.  The only real
difference between the two patterns is overflow behaviour.  And both
version are buggy when used for signed integer types, round_up will
underflow on INT_MIN, ALIGN will overflow on INT_MAX.  Since signed
integer under/overflows are undefined, we might have subtle bugs lurking
in the kernel.

This example program produces a warning when compiling with gcc -O2 or
higher.  Clang doesn't warn.  Compiled code behaves correctly with both
compilers, but that is largely luck and the same compilers may create
wrong behaviour if the surrounding code changes.

#include limits.h
#include stdio.h

#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define round_up(x, y) x)-1) | __round_mask(x, y))+1)
#define round_down(x, y) ((x)  ~__round_mask(x, y))

int main(void)
{
int i, r = 8;

for (i = INT_MIN; i; i++) {
printf(%2x: %2x %2x\n, i, round_down(i, r), round_up(i, r));
}
return 0;
}

I don't have a good answer yet.  We could make round_up check for
negative numbers, but I would prefer unconditional code that optimizes
down to nothing.  We could rewrite it in assembly, once for each
architecture.

Does anyone have better ideas?

Jörn

--
People really ought to be forced to read their code aloud over the phone.
That would rapidly improve the choice of identifiers.
-- Al Viro
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: round_up integer underflow

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 11:02:55AM -0700, Jörn Engel wrote:
 Spencer spotted something nasty in the round_up macro.  We were
 wondering why round_up() worked differently from ALIGN.  The only real
 difference between the two patterns is overflow behaviour.  And both
 version are buggy when used for signed integer types, round_up will
 underflow on INT_MIN, ALIGN will overflow on INT_MAX.  Since signed
 integer under/overflows are undefined, we might have subtle bugs lurking
 in the kernel.
 
 This example program produces a warning when compiling with gcc -O2 or
 higher.  Clang doesn't warn.  Compiled code behaves correctly with both
 compilers, but that is largely luck and the same compilers may create
 wrong behaviour if the surrounding code changes.
 
 #include limits.h
 #include stdio.h
 
 #define __round_mask(x, y) ((__typeof__(x))((y)-1))
 #define round_up(x, y) x)-1) | __round_mask(x, y))+1)
 #define round_down(x, y) ((x)  ~__round_mask(x, y))
 
 int main(void)
 {
   int i, r = 8;
 
   for (i = INT_MIN; i; i++) {
   printf(%2x: %2x %2x\n, i, round_down(i, r), round_up(i, r));
   }
   return 0;
 }
 
 I don't have a good answer yet.  We could make round_up check for
 negative numbers, but I would prefer unconditional code that optimizes
 down to nothing.  We could rewrite it in assembly, once for each
 architecture.
 
 Does anyone have better ideas?

Btw, it would be awesome if something like the following would work in
gcc:
#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define __round_up(x, y) x)-1) | __round_mask(x, y))+1)
#define round_down(x, y) ((x)  ~__round_mask(x, y))
#define round_up(x, y) (__typeof__(x)(__round_up((unsigned __typeof__(x)(x)), 
(y

I.e. cast x to the matching unsigned type where overflows are
well-defined, do the rounding, then cast the result back to the original
type.

Jörn

--
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 03:54:36PM -0700, Andrew Morton wrote:
> On Tue, 21 Jul 2015 15:07:57 -0700 Spencer Baugh  wrote:
> 
> > From: Joern Engel 
> > 
> > We have observed cases where the soft lockup detector triggered, but no
> > kernel bug existed.  Instead we had a buggy realtime thread that
> > monopolized a cpu.  So let's kill the responsible party and not panic
> > the entire system.
> > 
> > ...
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > hrtimer *hrtimer)
> > }
> >  
> > add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> > -   if (softlockup_panic)
> > +   if (rt_prio(current->prio)) {
> > +   pr_emerg("killing realtime thread\n");
> > +   send_sig(SIGILL, current, 0);
> 
> Why choose SIGILL?

It is a random signal that happens to generate a stacktrace in
userspace.

> > +   } else if (softlockup_panic)
> > panic("softlockup: hung tasks");
> > __this_cpu_write(soft_watchdog_warn, true);
> 
> But what about a non-buggy realtime thread which happens to
> occasionally spend 15 seconds doing stuff?
> 
> Old behaviour: kernel blurts a softlockup message, everything keeps running.
> 
> New behaviour: thread gets killed, plane crashes.
> 
> 
> Possibly a better approach would be to only kill the thread if
> softlockup_panic was set, because the system is going down anyway.
> 
> Also, perhaps some users would prefer that the kernel simply suppress
> the softlockup warning in this situation, rather than killing stuff!
> 
> 
> 
> 
> Really, what you're trying to implement here is a watchdog for runaway
> realtime threads.  And that sounds a worthy project but it's a rather
> separate thing from the softlockup detector.  A realtime thread
> watchdog feature might have things as
> 
> - timeout duration separately configurable from softlockup
> 
> - enabled independently from sotflockup: people might want one and
>   not the other.
> 
> - configurable signal, perhaps?
> 
> Now, the *implementation* of the realtime thread watchdog may well
> share code with the softlockup detector.  But from a
> conceptual/configuration/documentation point of view, it's a separate
> thing, no?

Agreed.  We needed this patch exactly once and it is a rather quick hack
that yielded the necessary results.  Realtime threads were well-behaved
since and the patch has seen zero polish as a result.

I think it is better to drop the patch for now.  If someone else keeps
running into the same issue, it might be a starting point for a better
implementation.  They will find it in list archives.

Jörn

--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 09:35:28AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 23:33 -0700, Jörn Engel wrote:
> 
> > One could argue that killing the realtime thread is even better than
> > panic, as things can restart with a blank slate even faster.  But the
> > real benefit is that we get better debug data for the failing component.
> > If we had a kernel bug, the backtrace would usually be sufficient to
> > point fingers.  With a bonkers realtime thread, not so much.
> 
> If userspace wants a watchdog, it should train a userspace dog, not turn
> the kernel watchdog into a userspace attack dog.

Fair point.  Let's drop this patch then.

Jörn

--
Money can buy bandwidth, but latency is forever.
-- John R. Mashey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 07:41:48AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 22:18 -0700, Jörn Engel wrote:
> > 
> > Not sure if this patch is something for mainline, but those two
> > alternatives have problems of their own.  Not panicking on lockups can
> > leave a system disabled until some human come around.  In many cases
> > that human will do no better than power-cycle.  A panic reduces the
> > downtime.
> 
> If a realtime task goes bonkers, the realtime game is over, you're down.

Agreed.  But a reboot will often solve the issue.  So the automatic
panic will repair the system within minutes, while no panic will leave
the system broken for days, depending on human response time.  Automatic
panic is a great way to minimize downtime - or vulnerable time if you
have HA.

One could argue that killing the realtime thread is even better than
panic, as things can restart with a blank slate even faster.  But the
real benefit is that we get better debug data for the failing component.
If we had a kernel bug, the backtrace would usually be sufficient to
point fingers.  With a bonkers realtime thread, not so much.

Anyway, this patch has been useful to us once.  If someone deems it
merge-worthy, great.  If not, I won't lose any sleep either.

Jörn

--
The key to performance is elegance, not battalions of special cases.
-- Jon Bentley and Doug McIlroy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 09:35:28AM +0200, Mike Galbraith wrote:
 On Tue, 2015-07-21 at 23:33 -0700, Jörn Engel wrote:
 
  One could argue that killing the realtime thread is even better than
  panic, as things can restart with a blank slate even faster.  But the
  real benefit is that we get better debug data for the failing component.
  If we had a kernel bug, the backtrace would usually be sufficient to
  point fingers.  With a bonkers realtime thread, not so much.
 
 If userspace wants a watchdog, it should train a userspace dog, not turn
 the kernel watchdog into a userspace attack dog.

Fair point.  Let's drop this patch then.

Jörn

--
Money can buy bandwidth, but latency is forever.
-- John R. Mashey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 07:41:48AM +0200, Mike Galbraith wrote:
 On Tue, 2015-07-21 at 22:18 -0700, Jörn Engel wrote:
  
  Not sure if this patch is something for mainline, but those two
  alternatives have problems of their own.  Not panicking on lockups can
  leave a system disabled until some human come around.  In many cases
  that human will do no better than power-cycle.  A panic reduces the
  downtime.
 
 If a realtime task goes bonkers, the realtime game is over, you're down.

Agreed.  But a reboot will often solve the issue.  So the automatic
panic will repair the system within minutes, while no panic will leave
the system broken for days, depending on human response time.  Automatic
panic is a great way to minimize downtime - or vulnerable time if you
have HA.

One could argue that killing the realtime thread is even better than
panic, as things can restart with a blank slate even faster.  But the
real benefit is that we get better debug data for the failing component.
If we had a kernel bug, the backtrace would usually be sufficient to
point fingers.  With a bonkers realtime thread, not so much.

Anyway, this patch has been useful to us once.  If someone deems it
merge-worthy, great.  If not, I won't lose any sleep either.

Jörn

--
The key to performance is elegance, not battalions of special cases.
-- Jon Bentley and Doug McIlroy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-22 Thread Jörn Engel

On Wed, Jul 22, 2015 at 03:54:36PM -0700, Andrew Morton wrote:
 On Tue, 21 Jul 2015 15:07:57 -0700 Spencer Baugh sba...@catern.com wrote:
 
  From: Joern Engel jo...@logfs.org
  
  We have observed cases where the soft lockup detector triggered, but no
  kernel bug existed.  Instead we had a buggy realtime thread that
  monopolized a cpu.  So let's kill the responsible party and not panic
  the entire system.
  
  ...
 
  --- a/kernel/watchdog.c
  +++ b/kernel/watchdog.c
  @@ -428,7 +428,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
  hrtimer *hrtimer)
  }
   
  add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
  -   if (softlockup_panic)
  +   if (rt_prio(current-prio)) {
  +   pr_emerg(killing realtime thread\n);
  +   send_sig(SIGILL, current, 0);
 
 Why choose SIGILL?

It is a random signal that happens to generate a stacktrace in
userspace.

  +   } else if (softlockup_panic)
  panic(softlockup: hung tasks);
  __this_cpu_write(soft_watchdog_warn, true);
 
 But what about a non-buggy realtime thread which happens to
 occasionally spend 15 seconds doing stuff?
 
 Old behaviour: kernel blurts a softlockup message, everything keeps running.
 
 New behaviour: thread gets killed, plane crashes.
 
 
 Possibly a better approach would be to only kill the thread if
 softlockup_panic was set, because the system is going down anyway.
 
 Also, perhaps some users would prefer that the kernel simply suppress
 the softlockup warning in this situation, rather than killing stuff!
 
 
 
 
 Really, what you're trying to implement here is a watchdog for runaway
 realtime threads.  And that sounds a worthy project but it's a rather
 separate thing from the softlockup detector.  A realtime thread
 watchdog feature might have things as
 
 - timeout duration separately configurable from softlockup
 
 - enabled independently from sotflockup: people might want one and
   not the other.
 
 - configurable signal, perhaps?
 
 Now, the *implementation* of the realtime thread watchdog may well
 share code with the softlockup detector.  But from a
 conceptual/configuration/documentation point of view, it's a separate
 thing, no?

Agreed.  We needed this patch exactly once and it is a rather quick hack
that yielded the necessary results.  Realtime threads were well-behaved
since and the patch has seen zero polish as a result.

I think it is better to drop the patch for now.  If someone else keeps
running into the same issue, it might be a starting point for a better
implementation.  They will find it in list archives.

Jörn

--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-21 Thread Jörn Engel

On Wed, Jul 22, 2015 at 06:36:30AM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-21 at 15:07 -0700, Spencer Baugh wrote:
> 
> > We have observed cases where the soft lockup detector triggered, but no
> > kernel bug existed.  Instead we had a buggy realtime thread that
> > monopolized a cpu.  So let's kill the responsible party and not panic
> > the entire system.
> 
> If you don't tell the kernel to panic, it won't, and if you don't remove
> its leash (the throttle), your not so tame rt beast won't maul you.

Not sure if this patch is something for mainline, but those two
alternatives have problems of their own.  Not panicking on lockups can
leave a system disabled until some human come around.  In many cases
that human will do no better than power-cycle.  A panic reduces the
downtime.

And the realtime throttling gives non-realtime threads some minimum
runtime, but does nothing to help low-priority realtime threads.  It
also introduces latencies, often when workloads are high and you would
like any available cpu to get through that rough spot.

I don't think we have a good answer to this problem in the mainline
kernel yet.

Jörn

--
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] soft lockup: kill realtime threads before panic

2015-07-21 Thread Jörn Engel

On Wed, Jul 22, 2015 at 06:36:30AM +0200, Mike Galbraith wrote:
 On Tue, 2015-07-21 at 15:07 -0700, Spencer Baugh wrote:
 
  We have observed cases where the soft lockup detector triggered, but no
  kernel bug existed.  Instead we had a buggy realtime thread that
  monopolized a cpu.  So let's kill the responsible party and not panic
  the entire system.
 
 If you don't tell the kernel to panic, it won't, and if you don't remove
 its leash (the throttle), your not so tame rt beast won't maul you.

Not sure if this patch is something for mainline, but those two
alternatives have problems of their own.  Not panicking on lockups can
leave a system disabled until some human come around.  In many cases
that human will do no better than power-cycle.  A panic reduces the
downtime.

And the realtime throttling gives non-realtime threads some minimum
runtime, but does nothing to help low-priority realtime threads.  It
also introduces latencies, often when workloads are high and you would
like any available cpu to get through that rough spot.

I don't think we have a good answer to this problem in the mainline
kernel yet.

Jörn

--
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] random: initialize pools faster

2015-07-13 Thread Jörn Engel

add_interrupt_randomness() can cause significant cpu overhead on
interrupt-heavy workloads.  We try to limit that overhead by bailing out
quickly once we have sampled a few bits of entropy this second.  If
there is enough entropy around it doesn't hurt to waste the excess.

However, we also waste entropy early in boot when we haven't even
initialized the pools yet.  With this patch we initialize the pools in
1-2s while it takes 10-20s without this patch.  Actual numbers depend on
hardware and fluctuate from boot to boot, but in all cases I have tested
there is a clear improvement.

Signed-off-by: Joern Engel 
---
 drivers/char/random.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 9cd6968e2f92..514f67a98b88 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -898,7 +898,8 @@ void add_interrupt_randomness(int irq, int irq_flags)
add_interrupt_bench(cycles);
 
if ((fast_pool->count < 64) &&
-   !time_after(now, fast_pool->last + HZ))
+   !time_after(now, fast_pool->last + HZ) &&
+   nonblocking_pool.initialized)
return;
 
r = nonblocking_pool.initialized ? _pool : _pool;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] random: initialize pools faster

2015-07-13 Thread Jörn Engel

add_interrupt_randomness() can cause significant cpu overhead on
interrupt-heavy workloads.  We try to limit that overhead by bailing out
quickly once we have sampled a few bits of entropy this second.  If
there is enough entropy around it doesn't hurt to waste the excess.

However, we also waste entropy early in boot when we haven't even
initialized the pools yet.  With this patch we initialize the pools in
1-2s while it takes 10-20s without this patch.  Actual numbers depend on
hardware and fluctuate from boot to boot, but in all cases I have tested
there is a clear improvement.

Signed-off-by: Joern Engel jo...@logfs.org
---
 drivers/char/random.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 9cd6968e2f92..514f67a98b88 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -898,7 +898,8 @@ void add_interrupt_randomness(int irq, int irq_flags)
add_interrupt_bench(cycles);
 
if ((fast_pool-count  64) 
-   !time_after(now, fast_pool-last + HZ))
+   !time_after(now, fast_pool-last + HZ) 
+   nonblocking_pool.initialized)
return;
 
r = nonblocking_pool.initialized ? input_pool : nonblocking_pool;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: revert 43fa5460fe60

2015-02-24 Thread Jörn Engel

On Tue, Feb 24, 2015 at 10:33:44AM -0500, Steven Rostedt wrote:
> On Tue, 24 Feb 2015 09:55:09 -0500
> Jörn Engel  wrote:
> 
> > I came across a silly problem that tempted me to revert 43fa5460fe60.
> > We had a high-priority realtime thread woken, TIF_NEED_RESCHED was set
> > for the running thread and the realtime thread didn't run for >2s.
> > Problem was a system call that mapped a ton of device memory and never
> > hit a cond_resched() point.  Obvious solution is to fix the long-running
> > system call.
> 
> I'm assuming that you are running a non PREEMPT kernel
> (PREEMPT_VOLUNTARY doesn't count). If that's the case, then it is very
> much likely that you will hit long latencies. Regardless of this change
> or not (you will continue to need to whack-a-mole here).

Correct.

> > Applying that solution quickly turns into a game of whack-a-mole.  Not
> > the worst game in the world and all those moles surely deserve a solid
> > hit on the head.  But what is annoying in my case is that I had plenty
> > of idle cpus during the entire time and the high-priority thread was
> > allowed to run anywhere.  So if the thread had been moved to a different
> > runqueue immediately there would have been zero delay.  Sure, the cache
> > is semi-cold or the migration may even be cross-package.  That is a
> > tradeoff we are willing to make and we set the cpumask explicitly that
> > way.  We want this thread to run quickly, anywhere.
> > 
> > So we could check for currently idle cpus when waking realtime threads.
> > If there are any, immediately move the woken thread over.  Maybe have a
> > check whether the running thread on the current cpu is in a syscall and
> > retain current behaviour if not.
> > 
> > Now this is not quite the same as reverting 43fa5460fe60 and I would
> > like to verify the idea before I spend time on a patch you would never
> > consider merging anyway.
> 
> The thing is, the patch you want to revert helped a lot for
> CONFIG_PREEMPT kernels. It doesn't make sense to gut a change for a non
> CONFIG_PREEMPT kernel as that already suffers huge latencies.
> 
> This is considering that you are indeed not running a CONFIG_PREEMPT
> kernel, as you stated that it doesn't preempt until it hits a
> cond_resched().

Well, reverting was my first instinct, but for different reasons I think
it is wrong.  Simply reverting can result in the high priority thread
moving from one cpu with a running process to a different cpu with a
running process.  In both cases you may trip over a mole, so nothing
much is gained.

But if you know that the destination cpu is idle, you can avoid any
moles, give or take a small race window maybe.  The moles are still
present and you still need some debug tool to detect them and fix them
over time.  But as cpus increase over time, your chances of getting
lucky in spite of bad kernel code also increase.

Is that a worthwhile approach, at least for non PREEMPT?

Jörn

--
Schrödinger's cat is not dead.
-- Illiad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: revert 43fa5460fe60

2015-02-24 Thread Jörn Engel

On Tue, Feb 24, 2015 at 10:33:44AM -0500, Steven Rostedt wrote:
 On Tue, 24 Feb 2015 09:55:09 -0500
 Jörn Engel jo...@purestorage.com wrote:
 
  I came across a silly problem that tempted me to revert 43fa5460fe60.
  We had a high-priority realtime thread woken, TIF_NEED_RESCHED was set
  for the running thread and the realtime thread didn't run for 2s.
  Problem was a system call that mapped a ton of device memory and never
  hit a cond_resched() point.  Obvious solution is to fix the long-running
  system call.
 
 I'm assuming that you are running a non PREEMPT kernel
 (PREEMPT_VOLUNTARY doesn't count). If that's the case, then it is very
 much likely that you will hit long latencies. Regardless of this change
 or not (you will continue to need to whack-a-mole here).

Correct.

  Applying that solution quickly turns into a game of whack-a-mole.  Not
  the worst game in the world and all those moles surely deserve a solid
  hit on the head.  But what is annoying in my case is that I had plenty
  of idle cpus during the entire time and the high-priority thread was
  allowed to run anywhere.  So if the thread had been moved to a different
  runqueue immediately there would have been zero delay.  Sure, the cache
  is semi-cold or the migration may even be cross-package.  That is a
  tradeoff we are willing to make and we set the cpumask explicitly that
  way.  We want this thread to run quickly, anywhere.
  
  So we could check for currently idle cpus when waking realtime threads.
  If there are any, immediately move the woken thread over.  Maybe have a
  check whether the running thread on the current cpu is in a syscall and
  retain current behaviour if not.
  
  Now this is not quite the same as reverting 43fa5460fe60 and I would
  like to verify the idea before I spend time on a patch you would never
  consider merging anyway.
 
 The thing is, the patch you want to revert helped a lot for
 CONFIG_PREEMPT kernels. It doesn't make sense to gut a change for a non
 CONFIG_PREEMPT kernel as that already suffers huge latencies.
 
 This is considering that you are indeed not running a CONFIG_PREEMPT
 kernel, as you stated that it doesn't preempt until it hits a
 cond_resched().

Well, reverting was my first instinct, but for different reasons I think
it is wrong.  Simply reverting can result in the high priority thread
moving from one cpu with a running process to a different cpu with a
running process.  In both cases you may trip over a mole, so nothing
much is gained.

But if you know that the destination cpu is idle, you can avoid any
moles, give or take a small race window maybe.  The moles are still
present and you still need some debug tool to detect them and fix them
over time.  But as cpus increase over time, your chances of getting
lucky in spite of bad kernel code also increase.

Is that a worthwhile approach, at least for non PREEMPT?

Jörn

--
Schrödinger's cat is BLINKnot/BLINK dead.
-- Illiad
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: revert 43fa5460fe60

2015-02-23 Thread Jörn Engel

Looks like Gregory Haskins' email bounces.  Should have figured as much.

Jörn

--
The only real mistake is the one from which we learn nothing.
-- John Powell
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC: revert 43fa5460fe60

2015-02-23 Thread Jörn Engel

Hello Steven!

I came across a silly problem that tempted me to revert 43fa5460fe60.
We had a high-priority realtime thread woken, TIF_NEED_RESCHED was set
for the running thread and the realtime thread didn't run for >2s.
Problem was a system call that mapped a ton of device memory and never
hit a cond_resched() point.  Obvious solution is to fix the long-running
system call.

Applying that solution quickly turns into a game of whack-a-mole.  Not
the worst game in the world and all those moles surely deserve a solid
hit on the head.  But what is annoying in my case is that I had plenty
of idle cpus during the entire time and the high-priority thread was
allowed to run anywhere.  So if the thread had been moved to a different
runqueue immediately there would have been zero delay.  Sure, the cache
is semi-cold or the migration may even be cross-package.  That is a
tradeoff we are willing to make and we set the cpumask explicitly that
way.  We want this thread to run quickly, anywhere.

So we could check for currently idle cpus when waking realtime threads.
If there are any, immediately move the woken thread over.  Maybe have a
check whether the running thread on the current cpu is in a syscall and
retain current behaviour if not.

Now this is not quite the same as reverting 43fa5460fe60 and I would
like to verify the idea before I spend time on a patch you would never
consider merging anyway.

Jörn

--
As more agents are added, systems become more reliable in the total-effort
case, but less reliable in the weakest-link case.  What are the implications?
Well, software companies schould hire more software testers and fewer (but
more competent) programmers.
-- Ross Anderson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: revert 43fa5460fe60

2015-02-23 Thread Jörn Engel

Looks like Gregory Haskins' email bounces.  Should have figured as much.

Jörn

--
The only real mistake is the one from which we learn nothing.
-- John Powell
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC: revert 43fa5460fe60

2015-02-23 Thread Jörn Engel

Hello Steven!

I came across a silly problem that tempted me to revert 43fa5460fe60.
We had a high-priority realtime thread woken, TIF_NEED_RESCHED was set
for the running thread and the realtime thread didn't run for 2s.
Problem was a system call that mapped a ton of device memory and never
hit a cond_resched() point.  Obvious solution is to fix the long-running
system call.

Applying that solution quickly turns into a game of whack-a-mole.  Not
the worst game in the world and all those moles surely deserve a solid
hit on the head.  But what is annoying in my case is that I had plenty
of idle cpus during the entire time and the high-priority thread was
allowed to run anywhere.  So if the thread had been moved to a different
runqueue immediately there would have been zero delay.  Sure, the cache
is semi-cold or the migration may even be cross-package.  That is a
tradeoff we are willing to make and we set the cpumask explicitly that
way.  We want this thread to run quickly, anywhere.

So we could check for currently idle cpus when waking realtime threads.
If there are any, immediately move the woken thread over.  Maybe have a
check whether the running thread on the current cpu is in a syscall and
retain current behaviour if not.

Now this is not quite the same as reverting 43fa5460fe60 and I would
like to verify the idea before I spend time on a patch you would never
consider merging anyway.

Jörn

--
As more agents are added, systems become more reliable in the total-effort
case, but less reliable in the weakest-link case.  What are the implications?
Well, software companies schould hire more software testers and fewer (but
more competent) programmers.
-- Ross Anderson
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Print cpu number along with time

2014-09-09 Thread Jörn Engel

On Wed, 4 June 2014 19:15:06 -0400, Jörn Engel wrote:
> On Mon, 28 April 2014 17:22:19 -0700, Andrew Morton wrote:
> > On Mon, 28 Apr 2014 19:40:39 -0400 J__rn Engel  wrote:
> > > On Thu, 24 April 2014 15:40:24 -0400, J__rn Engel wrote:
> > > > On Wed, 23 April 2014 20:52:47 -0400, J__rn Engel wrote:
> > > > > 
> > > > > I use the patch below for some time now.  While it doesn't avoid the
> > > > > log pollution in the first place, it lessens the impact somewhat.
> > > > 
> > > > Added a config option and ported it to current -linus.  Andrew, would
> > > > you take this patch?
> > > 
> > > Andrew?  Did you dislike this patch for some reason or just miss it in
> > > the thread?
> > 
> > Neither ;) I try to respond in some way to all patches unless I think it's 
> > clear
> > to the originator why I took no action.  And I think I'm pretty good at
> > not losing stuff.
> > 
> > otoh, it has only been four days, three of which I spent offline...
> 
> Ping.

Ping.

Jörn

--
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] block2mtd: mtd: Present block2mtd timely on boot time

2014-09-09 Thread Jörn Engel

On Mon, 8 September 2014 16:04:40 -0400, Rodrigo Freire wrote:
> 
> From: Felix Fietkau 
> 
> block2mtd: Ensure that block2mtd is presented in a timely fashion 
> 
> Currently, a block MTD device is not presented to the system on time, in 
> order to start mounting the filesystems. This patch ensures that block2mtd 
> is presented at the right time, so filesystems can be mounted on boot time.
> This issue was seen on BCM2708 (Raspberry Pi) systems when mounting JFFS2 
> block2mtd filesystems.
> This patchset also adds a MTD device name and a timeout option to the driver.

Looks fine once the comments below are addressed.

> Original patchset:
> https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-3.12/440-block2mtd_init.patch?rev=40444
> https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-3.12/441-block2mtd_probe.patch?rev=40444
> 
> Signed-off-by: Felix Fietkau 
> Signed-off-by: Rodrigo Freire  
> 
> --- a/drivers/mtd/devices/block2mtd.c 2014-09-05 11:13:39.143698413 -0300
> +++ b/drivers/mtd/devices/block2mtd.c 2014-09-05 17:50:28.107366433 -0300
> @@ -9,7 +9,15 @@
>  
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> +/*
> +* When the first attempt at device initialization fails, we may need to
> +* wait a little bit and retry. This timeout, by default 3 seconds, gives
> +* device time to start up. Required on BCM2708 and a few other chipsets.
> +*/
> +#define MTD_DEFAULT_TIMEOUT  3
> +
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -17,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -209,12 +218,14 @@ static void block2mtd_free_device(struct
>  }
>  
>  
> -static struct block2mtd_dev *add_device(char *devname, int erase_size)
> +static struct block2mtd_dev *add_device(char *devname, int erase_size, const 
> char *mtdname, int timeout)
>  {
>   const fmode_t mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
> - struct block_device *bdev;
> + struct block_device *bdev = ERR_PTR(-ENODEV);
>   struct block2mtd_dev *dev;
> + struct mtd_partition *part;
>   char *name;
> + int i;
>  
>   if (!devname)
>   return NULL;
> @@ -225,15 +236,28 @@ static struct block2mtd_dev *add_device(
>  
>   /* Get a handle on the device */
>   bdev = blkdev_get_by_path(devname, mode, dev);
> -#ifndef MODULE
> - if (IS_ERR(bdev)) {
>  
> - /* We might not have rootfs mounted at this point. Try
> -to resolve the device name by other means. */
> -
> - dev_t devt = name_to_dev_t(devname);
> - if (devt)
> - bdev = blkdev_get_by_dev(devt, mode, dev);
> +#ifndef MODULE
> +/*
> +* We might not have the root device mounted at this point.
> +* Try to resolve the device name by other means.
> +*/
> + for (i = 0; IS_ERR(bdev) && i <= timeout; i++) {
> + dev_t devt;
> +
> + if (i)
> + /*
> +  * Calling wait_for_device_probe in the first loop 
> +  * was not enough, sleep for a bit in subsequent
> +  * go-arounds.
> + */
> + msleep(1000);
> + wait_for_device_probe();
> +
> + devt = name_to_dev_t(devname);
> + if (!devt)
> + continue;
> + bdev = blkdev_get_by_dev(devt, mode, dev);
>   }
>  #endif
>  
> @@ -257,13 +281,15 @@ static struct block2mtd_dev *add_device(
>  
>   /* Setup the MTD structure */
>   /* make the name contain the block device in */
> - name = kasprintf(GFP_KERNEL, "block2mtd: %s", devname);
> + if (!mtdname)
> + mtdname = devname;
> + name = kmalloc(strlen(mtdname) + 1, GFP_KERNEL);
>   if (!name)
>   goto err_destroy_mutex;
>  
> + strcpy(name, mtdname);

kstrdup.

And see below for the ABI change.

>   dev->mtd.name = name;
> -
> - dev->mtd.size = dev->blkdev->bd_inode->i_size & PAGE_MASK;
> + dev->mtd.size = dev->blkdev->bd_inode->i_size & PAGE_MASK & 
> ~(erase_size - 1);

PAGE_MASK is no longer needed with the new term.  Or does anyone
seriously want to support erase_size < PAGE_SIZE?

>   dev->mtd.erasesize = erase_size;
>   dev->mtd.writesize = 1;
>   dev->mtd.writebufsize = PAGE_SIZE;
> @@ -276,15 +302,19 @@ static struct block2mtd_dev *add_device(
>   dev->mtd.priv = dev;
>   dev->mtd.owner = THIS_MODULE;
>  
> - if (mtd_device_register(>mtd, NULL, 0)) {
> + part = kzalloc(sizeof(struct mtd_partition), GFP_KERNEL);
> + part->name = name;
> + part->offset = 0;
> + part->size = dev->mtd.size;
> + if (mtd_device_register(>mtd, part, 1)) {
>   /* Device didn't get added, so free the entry */
>   goto err_destroy_mutex;
>   }
> +
>   list_add(>list, _device_list);
>   pr_info("mtd%d: [%s] erase_size = %dKiB [%d]\n",
>

Re: [PATCH] block2mtd: mtd: Present block2mtd timely on boot time

2014-09-09 Thread Jörn Engel

On Mon, 8 September 2014 16:04:40 -0400, Rodrigo Freire wrote:
 
 From: Felix Fietkau n...@openwrt.org
 
 block2mtd: Ensure that block2mtd is presented in a timely fashion 
 
 Currently, a block MTD device is not presented to the system on time, in 
 order to start mounting the filesystems. This patch ensures that block2mtd 
 is presented at the right time, so filesystems can be mounted on boot time.
 This issue was seen on BCM2708 (Raspberry Pi) systems when mounting JFFS2 
 block2mtd filesystems.
 This patchset also adds a MTD device name and a timeout option to the driver.

Looks fine once the comments below are addressed.

 Original patchset:
 https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-3.12/440-block2mtd_init.patch?rev=40444
 https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-3.12/441-block2mtd_probe.patch?rev=40444
 
 Signed-off-by: Felix Fietkau n...@openwrt.org
 Signed-off-by: Rodrigo Freire rfre...@redhat.com 
 
 --- a/drivers/mtd/devices/block2mtd.c 2014-09-05 11:13:39.143698413 -0300
 +++ b/drivers/mtd/devices/block2mtd.c 2014-09-05 17:50:28.107366433 -0300
 @@ -9,7 +9,15 @@
  
  #define pr_fmt(fmt) KBUILD_MODNAME :  fmt
  
 +/*
 +* When the first attempt at device initialization fails, we may need to
 +* wait a little bit and retry. This timeout, by default 3 seconds, gives
 +* device time to start up. Required on BCM2708 and a few other chipsets.
 +*/
 +#define MTD_DEFAULT_TIMEOUT  3
 +
  #include linux/module.h
 +#include linux/delay.h
  #include linux/fs.h
  #include linux/blkdev.h
  #include linux/bio.h
 @@ -17,6 +25,7 @@
  #include linux/list.h
  #include linux/init.h
  #include linux/mtd/mtd.h
 +#include linux/mtd/partitions.h
  #include linux/mutex.h
  #include linux/mount.h
  #include linux/slab.h
 @@ -209,12 +218,14 @@ static void block2mtd_free_device(struct
  }
  
  
 -static struct block2mtd_dev *add_device(char *devname, int erase_size)
 +static struct block2mtd_dev *add_device(char *devname, int erase_size, const 
 char *mtdname, int timeout)
  {
   const fmode_t mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
 - struct block_device *bdev;
 + struct block_device *bdev = ERR_PTR(-ENODEV);
   struct block2mtd_dev *dev;
 + struct mtd_partition *part;
   char *name;
 + int i;
  
   if (!devname)
   return NULL;
 @@ -225,15 +236,28 @@ static struct block2mtd_dev *add_device(
  
   /* Get a handle on the device */
   bdev = blkdev_get_by_path(devname, mode, dev);
 -#ifndef MODULE
 - if (IS_ERR(bdev)) {
  
 - /* We might not have rootfs mounted at this point. Try
 -to resolve the device name by other means. */
 -
 - dev_t devt = name_to_dev_t(devname);
 - if (devt)
 - bdev = blkdev_get_by_dev(devt, mode, dev);
 +#ifndef MODULE
 +/*
 +* We might not have the root device mounted at this point.
 +* Try to resolve the device name by other means.
 +*/
 + for (i = 0; IS_ERR(bdev)  i = timeout; i++) {
 + dev_t devt;
 +
 + if (i)
 + /*
 +  * Calling wait_for_device_probe in the first loop 
 +  * was not enough, sleep for a bit in subsequent
 +  * go-arounds.
 + */
 + msleep(1000);
 + wait_for_device_probe();
 +
 + devt = name_to_dev_t(devname);
 + if (!devt)
 + continue;
 + bdev = blkdev_get_by_dev(devt, mode, dev);
   }
  #endif
  
 @@ -257,13 +281,15 @@ static struct block2mtd_dev *add_device(
  
   /* Setup the MTD structure */
   /* make the name contain the block device in */
 - name = kasprintf(GFP_KERNEL, block2mtd: %s, devname);
 + if (!mtdname)
 + mtdname = devname;
 + name = kmalloc(strlen(mtdname) + 1, GFP_KERNEL);
   if (!name)
   goto err_destroy_mutex;
  
 + strcpy(name, mtdname);

kstrdup.

And see below for the ABI change.

   dev-mtd.name = name;
 -
 - dev-mtd.size = dev-blkdev-bd_inode-i_size  PAGE_MASK;
 + dev-mtd.size = dev-blkdev-bd_inode-i_size  PAGE_MASK  
 ~(erase_size - 1);

PAGE_MASK is no longer needed with the new term.  Or does anyone
seriously want to support erase_size  PAGE_SIZE?

   dev-mtd.erasesize = erase_size;
   dev-mtd.writesize = 1;
   dev-mtd.writebufsize = PAGE_SIZE;
 @@ -276,15 +302,19 @@ static struct block2mtd_dev *add_device(
   dev-mtd.priv = dev;
   dev-mtd.owner = THIS_MODULE;
  
 - if (mtd_device_register(dev-mtd, NULL, 0)) {
 + part = kzalloc(sizeof(struct mtd_partition), GFP_KERNEL);
 + part-name = name;
 + part-offset = 0;
 + part-size = dev-mtd.size;
 + if (mtd_device_register(dev-mtd, part, 1)) {
   /* Device didn't get added, so free the entry */
   goto err_destroy_mutex;
   }
 +
   list_add(dev-list, blkmtd_device_list);

Re: [PATCH] printk: Print cpu number along with time

2014-09-09 Thread Jörn Engel

On Wed, 4 June 2014 19:15:06 -0400, Jörn Engel wrote:
 On Mon, 28 April 2014 17:22:19 -0700, Andrew Morton wrote:
  On Mon, 28 Apr 2014 19:40:39 -0400 J__rn Engel jo...@logfs.org wrote:
   On Thu, 24 April 2014 15:40:24 -0400, J__rn Engel wrote:
On Wed, 23 April 2014 20:52:47 -0400, J__rn Engel wrote:
 
 I use the patch below for some time now.  While it doesn't avoid the
 log pollution in the first place, it lessens the impact somewhat.

Added a config option and ported it to current -linus.  Andrew, would
you take this patch?
   
   Andrew?  Did you dislike this patch for some reason or just miss it in
   the thread?
  
  Neither ;) I try to respond in some way to all patches unless I think it's 
  clear
  to the originator why I took no action.  And I think I'm pretty good at
  not losing stuff.
  
  otoh, it has only been four days, three of which I spent offline...
 
 Ping.

Ping.

Jörn

--
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix race in get_request()

2014-08-08 Thread Jörn Engel

On Fri, 8 August 2014 08:28:47 -0600, Jens Axboe wrote:
> On 08/08/2014 08:24 AM, Jens Axboe wrote:
> > On 08/07/2014 06:54 PM, Jörn Engel wrote:
> >>
> >> If __get_request() returns NULL, get_request will call
> >> prepare_to_wait_exclusive() followed by io_schedule().  Not rechecking
> >> the sleep condition after prepare_to_wait_exclusive() leaves a race
> >> where the condition changes before prepare_to_wait_exclusive(), but
> >> not after and accordingly this thread never gets woken up.
> >>
> >> The race must be exceedingly hard to hit, otherwise I cannot explain how
> >> such a classic race could outlive the last millenium.
> > 
> > I think that is a genuine bug, it's just extremely hard to hit in real
> > life. It has probably only potentially ever triggered in the cases where
> > we are so out of memory that a blocking ~300b alloc fails, and Linux
> > generally shits itself pretty hard when it gets to that stage anyway...
> > And for the bug to be critical, you'd need this to happen for a device
> > that otherwise has no IO pending, since you'd get woken up by the next
> > completed request anyway.
> 
> Actually, this can't trigger for an empty queue, since the mempool holds
> a few requests. So it should never result in a softlock, we will make
> progress. Given that we also still hold the queue spinlock (that will be
> held for a free as well), we should not be able to get a free of a
> request until the prepare_to_wait() has been done. So not sure there is
> an actual bug there, but I agree the code looks confusing that way.

The race I was afraid of is not on an empty queue, but where all IO
gets completed between the __get_request() and
prepare_to_wait_exclusive().  That may well be hundreds of requests,
so under normal circumstances they should take far longer than the
race window to complete.  So something has to enlarge the window.

With the spin_lock_irq held, that removes interrupts and a preemptible
kernel from the candidate list.  Now we would need hardware support,
either some kind of unfair SMT cpus or a hypervisor that doesn't
schedule the victim cpu's thread for a long time and at just the right
time.

Maybe the hypervisor case is at least theoretically possible.  But I
would also be happy with leaving the code as-is and at most adding a
comment for the next person.

Jörn

--
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix race in get_request()

2014-08-08 Thread Jörn Engel

On Fri, 8 August 2014 08:28:47 -0600, Jens Axboe wrote:
 On 08/08/2014 08:24 AM, Jens Axboe wrote:
  On 08/07/2014 06:54 PM, Jörn Engel wrote:
 
  If __get_request() returns NULL, get_request will call
  prepare_to_wait_exclusive() followed by io_schedule().  Not rechecking
  the sleep condition after prepare_to_wait_exclusive() leaves a race
  where the condition changes before prepare_to_wait_exclusive(), but
  not after and accordingly this thread never gets woken up.
 
  The race must be exceedingly hard to hit, otherwise I cannot explain how
  such a classic race could outlive the last millenium.
  
  I think that is a genuine bug, it's just extremely hard to hit in real
  life. It has probably only potentially ever triggered in the cases where
  we are so out of memory that a blocking ~300b alloc fails, and Linux
  generally shits itself pretty hard when it gets to that stage anyway...
  And for the bug to be critical, you'd need this to happen for a device
  that otherwise has no IO pending, since you'd get woken up by the next
  completed request anyway.
 
 Actually, this can't trigger for an empty queue, since the mempool holds
 a few requests. So it should never result in a softlock, we will make
 progress. Given that we also still hold the queue spinlock (that will be
 held for a free as well), we should not be able to get a free of a
 request until the prepare_to_wait() has been done. So not sure there is
 an actual bug there, but I agree the code looks confusing that way.

The race I was afraid of is not on an empty queue, but where all IO
gets completed between the __get_request() and
prepare_to_wait_exclusive().  That may well be hundreds of requests,
so under normal circumstances they should take far longer than the
race window to complete.  So something has to enlarge the window.

With the spin_lock_irq held, that removes interrupts and a preemptible
kernel from the candidate list.  Now we would need hardware support,
either some kind of unfair SMT cpus or a hypervisor that doesn't
schedule the victim cpu's thread for a long time and at just the right
time.

Maybe the hypervisor case is at least theoretically possible.  But I
would also be happy with leaving the code as-is and at most adding a
comment for the next person.

Jörn

--
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix race in get_request()

2014-08-07 Thread Jörn Engel

Hello Jens!

I came across the below while investigating some other problem.
Something here doesn't seem right.  This looks like an obvious bug and
something roughly along the lines of my patch would fix it.  But I
must be in the wrong decade to find such a bug in the block layer.

Is this for real?  Or if not, what am I missing?

Jörn

--

If __get_request() returns NULL, get_request will call
prepare_to_wait_exclusive() followed by io_schedule().  Not rechecking
the sleep condition after prepare_to_wait_exclusive() leaves a race
where the condition changes before prepare_to_wait_exclusive(), but
not after and accordingly this thread never gets woken up.

The race must be exceedingly hard to hit, otherwise I cannot explain how
such a classic race could outlive the last millenium.

Signed-off-by: Joern Engel 
---
 block/blk-core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3275353957f0..00aa6c7abe5a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1068,6 +1068,11 @@ retry:
 
trace_block_sleeprq(q, bio, rw_flags & 1);
 
+   rq = __get_request(rl, rw_flags, bio, gfp_mask);
+   if (rq) {
+   finish_wait(>wait[is_sync], );
+   return rq;
+   }
spin_unlock_irq(q->queue_lock);
io_schedule();
 
-- 
2.0.0.rc0.1.g7b2ba98

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix race in get_request()

2014-08-07 Thread Jörn Engel

Hello Jens!

I came across the below while investigating some other problem.
Something here doesn't seem right.  This looks like an obvious bug and
something roughly along the lines of my patch would fix it.  But I
must be in the wrong decade to find such a bug in the block layer.

Is this for real?  Or if not, what am I missing?

Jörn

--

If __get_request() returns NULL, get_request will call
prepare_to_wait_exclusive() followed by io_schedule().  Not rechecking
the sleep condition after prepare_to_wait_exclusive() leaves a race
where the condition changes before prepare_to_wait_exclusive(), but
not after and accordingly this thread never gets woken up.

The race must be exceedingly hard to hit, otherwise I cannot explain how
such a classic race could outlive the last millenium.

Signed-off-by: Joern Engel jo...@logfs.org
---
 block/blk-core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3275353957f0..00aa6c7abe5a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1068,6 +1068,11 @@ retry:
 
trace_block_sleeprq(q, bio, rw_flags  1);
 
+   rq = __get_request(rl, rw_flags, bio, gfp_mask);
+   if (rq) {
+   finish_wait(rl-wait[is_sync], wait);
+   return rq;
+   }
spin_unlock_irq(q-queue_lock);
io_schedule();
 
-- 
2.0.0.rc0.1.g7b2ba98

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Check for Null return from logfs_readpage_nolock in btree_write_block

2014-06-16 Thread Jörn Engel

On Mon, 16 June 2014 20:46:32 -0400, Nick Krause wrote:
> 
> You are right about the compile errors it compiles but with warning.
> I am curious since the function is void how do you want me to
> give the error back to logfs and in turn the other subsystems
> of the kernel.

My first question would be: What is the problem?  If there is no
problem, don't try to fix it.  If there is a problem, your fix should
contain a decent description of the problem as part of the commit
message.  Even if the patch itself ends up being bad, having a good
problem description with a first stab at a fix serves as a useful bug
report.

Jörn

--
Was there any time in human history where democratic forms continued
when the citizens could not credibly defend their rights?
-- Daniel Suarez
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Check for Null return from logfs_readpage_nolock in btree_write_block

2014-06-16 Thread Jörn Engel

On Mon, 16 June 2014 20:46:32 -0400, Nick Krause wrote:
 
 You are right about the compile errors it compiles but with warning.
 I am curious since the function is void how do you want me to
 give the error back to logfs and in turn the other subsystems
 of the kernel.

My first question would be: What is the problem?  If there is no
problem, don't try to fix it.  If there is a problem, your fix should
contain a decent description of the problem as part of the commit
message.  Even if the patch itself ends up being bad, having a good
problem description with a first stab at a fix serves as a useful bug
report.

Jörn

--
Was there any time in human history where democratic forms continued
when the citizens could not credibly defend their rights?
-- Daniel Suarez
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-12 Thread Jörn Engel

On Wed, 11 June 2014 11:27:42 -0400, Theodore Ts'o wrote:
> On Tue, Jun 10, 2014 at 08:10:09PM -0400, Jörn Engel wrote:
> > > I'm also concerned about how much overhead this might eat up.  I've
> > > already had someone who was benchmarking a high performance storage
> > > array where the increased interrupt latency before adding something
> > > like this was something he noticed, and kvetched to me about.  The
> > > pt_regs structure is going to be larger than the fast_pool (which is
> > > only 16 bytes), and we're going to be doing it once every jiffy, which
> > > means between 100 and 1000 times per second.
> > 
> > Would that someone be me? ;)
> > 
> > The reason I prefer doing it once every jiffy is that it doesn't
> > change with interrupt load.  You get 10k interrupts per second?
> > You pay once per jiffy.  100k interrupts per second?  Pay once per
> > jiffy.
> 
> No, actually, it was someone who was worried about tail latency.  So
> even if the average overhead was small, if once a jiffy we had a much
> larger time spent in the interrupt handler, that's something that
> would a big concern for someone who was worried about big storage
> array performance.
> 
> I'd be happier if we used fast_mix() and mixed the registers into the
> fast pool.  That's much lighter weight than using mix_pool_bytes(),
> which involves many more memory accesses, MUCH higher probably of
> cache line bouncing between CPU's, etc.
> 
> And there has been some proposals that I've been looking at to make
> fast_mix to be use even less overhead, so if we use fast_mix, any
> changes we make to improve that will help here too.

That makes sense to me.  It would require replacing current fast_mix()
with somethat doesn't doesn't assume __u32 input[4] as the only
possible parameter.  In a previous incarnation of my patch I was using
a single round of siphash to condense the registers and added that to
our fast_pool.

While siphash is fairly fast, doing that for every interrupt seemed
too much for me, hence the current ratelimit approach.  But if people
care more about maximum latency than amortized cost, we can combine
siphash (or something similar) with the ratelimit.

At any rate, here is a slightly updated patch that is independent of
CONFIG_HZ.

Jörn

--
"Error protection by error detection and correction."
-- from a university class

>From 867d62e0165723f9d8dc568d0cb9d8898948ddaa Mon Sep 17 00:00:00 2001
From: Joern Engel 
Date: Sat, 15 Feb 2014 16:29:28 -0800
Subject: [PATCH] random: mix all saved registers into entropy pool

The single biggest entropy source is a high-resolution timer running
asynchronous to the triggering event.  That leaves systems without a
useful get_cycles() implementation with significantly less entropy.
Significant means orders of magnitude, not a few percent.

An alternate high-resolution timer is the register content at the time
of an interrupt.  While not monotonic, it is guaranteed to change every
few clock cycles and very unlikely to repeat the same pattern.  Not
useful for interpretation as time, but we only care about some bits
of the "clock" flipping in an unpredictable fashion.

Experimentation show this to be an excellent entropy source.  Doing 1000
boottests with kvm and dumping a hash of the registers for the first
1024 interrupts each, >40% of all hashes were unique and >80% of all
hashes occurred less than once 1% of the time.

Even assuming that only unique register hashes yield any entropy at all
and only one bit each, we would gather >400 bits of entropy early in
boot and another 40 bits/s from the timer interrupt during runtime.  I
would feel confident with this amount of entropy in the absence of any
other source.

Repeating the test on real hardware, albeit with fewer repetitions
yields roughly the same results, slightly above 40% unique hashes.

Signed-off-by: Joern Engel 
---
 drivers/char/random.c | 69 ---
 1 file changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 429b75bb60e8..8e9f6274cf76 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -553,6 +553,8 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,

 struct fast_pool {
__u32   pool[4];
+   u32 last_shift;
+   u32 regs_count;
unsigned long   last;
unsigned short  count;
unsigned char   rotate;
@@ -835,6 +837,64 @@ EXPORT_SYMBOL_GPL(add_input_randomness);

 static DEFINE_PER_CPU(struct fast_pool, irq_randomness);

+/*
+ * Ratelimit to a steady state of about 50Hz.  A naïve approach would be
+ * to return 1 every time jiffies changes.  But we want to avoid being
+ * closely coupled to the timer interrupt.  So instead we inc

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-12 Thread Jörn Engel

On Tue, 10 June 2014 20:10:09 -0400, Jörn Engel wrote:
> On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
> > On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
> > > +/*
> > > + * Ratelimit to a steady state of about once per jiffy.  A naïve approach
> > > + * would be to return 1 every time jiffies changes.  But we want to avoid
> > > + * being closely coupled to the timer interrupt.  So instead we increment
> > > + * a counter on every call and shift it right every time jiffies changes.
> > > + * If the counter is a power of two, return false;
> > > + *
> > > + * Effect is that some time after a jiffies change and cutting the 
> > > counter
> > > + * in half we reach another power of two and return false.  But the
> > > + * likelihood of this happening is about the same at any time within a
> > > + * jiffies interval.
> > > + */
> > > +static inline int ratelimited(struct fast_pool *p)
> > > +{
> > > + int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
> > > +
> > > + p->regs_count++;
> > > + if (p->last_shift != (u32)jiffies) {
> > > + p->regs_count >>= min(31u, (u32)jiffies - p->last_shift);
> > > + p->last_shift = (u32)jiffies;
> > > + }
> > > + return ret;
> > > +}
> > 
> > I wasn't convinced this would do the right thing, so I wrote a quick
> > test program where the main loop was basically this:
> > 
> > for (i=0; i < 1024; i++) {
> > jiffies = i >> 2;
> > r = ratelimited(jiffies);
> > printf("%5u %5u %5u %d\n", i, jiffies, regs_count, r);
> > }
> > 
> > ... which basically simulated a very simple scheme where there were
> > four interrupts for each clock tick.  In the steady state ratelimited
> > returns true 75% of the time.  If this was as documented, we would
> > expect it to return true 25% of the time.  So I don't think this is
> > working quite right:
> > 
> >20 5 3 1
> >21 5 4 1
> >22 5 5 0
> >23 5 6 1
> >24 6 3 1
> >25 6 4 1
> >26 6 5 0
> >27 6 6 1
> >28 7 3 1
> >29 7 4 1
> >30 7 5 0
> >31 7 6 1
> >32 8 3 1
> 
> Doh!  Removing the "!" from "!(p->regs_count..." will fix that.  Shows
> I spent 100x more time testing the bootup entropy collection.

On second look, the function was correct.  If ratelimited() returns
true, the interrupt is ratelimited, i.e. ignored.  So correct
behaviour is to return false 25% of the time, just like it does.

You confused me for a moment!

Jörn

--
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-12 Thread Jörn Engel

On Tue, 10 June 2014 20:10:09 -0400, Jörn Engel wrote:
 On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
  On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
   +/*
   + * Ratelimit to a steady state of about once per jiffy.  A naïve approach
   + * would be to return 1 every time jiffies changes.  But we want to avoid
   + * being closely coupled to the timer interrupt.  So instead we increment
   + * a counter on every call and shift it right every time jiffies changes.
   + * If the counter is a power of two, return false;
   + *
   + * Effect is that some time after a jiffies change and cutting the 
   counter
   + * in half we reach another power of two and return false.  But the
   + * likelihood of this happening is about the same at any time within a
   + * jiffies interval.
   + */
   +static inline int ratelimited(struct fast_pool *p)
   +{
   + int ret = !(p-regs_count == 0 || is_power_of_2(p-regs_count));
   +
   + p-regs_count++;
   + if (p-last_shift != (u32)jiffies) {
   + p-regs_count = min(31u, (u32)jiffies - p-last_shift);
   + p-last_shift = (u32)jiffies;
   + }
   + return ret;
   +}
  
  I wasn't convinced this would do the right thing, so I wrote a quick
  test program where the main loop was basically this:
  
  for (i=0; i  1024; i++) {
  jiffies = i  2;
  r = ratelimited(jiffies);
  printf(%5u %5u %5u %d\n, i, jiffies, regs_count, r);
  }
  
  ... which basically simulated a very simple scheme where there were
  four interrupts for each clock tick.  In the steady state ratelimited
  returns true 75% of the time.  If this was as documented, we would
  expect it to return true 25% of the time.  So I don't think this is
  working quite right:
  
 20 5 3 1
 21 5 4 1
 22 5 5 0
 23 5 6 1
 24 6 3 1
 25 6 4 1
 26 6 5 0
 27 6 6 1
 28 7 3 1
 29 7 4 1
 30 7 5 0
 31 7 6 1
 32 8 3 1
 
 Doh!  Removing the ! from !(p-regs_count... will fix that.  Shows
 I spent 100x more time testing the bootup entropy collection.

On second look, the function was correct.  If ratelimited() returns
true, the interrupt is ratelimited, i.e. ignored.  So correct
behaviour is to return false 25% of the time, just like it does.

You confused me for a moment!

Jörn

--
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-12 Thread Jörn Engel

On Wed, 11 June 2014 11:27:42 -0400, Theodore Ts'o wrote:
 On Tue, Jun 10, 2014 at 08:10:09PM -0400, Jörn Engel wrote:
   I'm also concerned about how much overhead this might eat up.  I've
   already had someone who was benchmarking a high performance storage
   array where the increased interrupt latency before adding something
   like this was something he noticed, and kvetched to me about.  The
   pt_regs structure is going to be larger than the fast_pool (which is
   only 16 bytes), and we're going to be doing it once every jiffy, which
   means between 100 and 1000 times per second.
  
  Would that someone be me? ;)
  
  The reason I prefer doing it once every jiffy is that it doesn't
  change with interrupt load.  You get 10k interrupts per second?
  You pay once per jiffy.  100k interrupts per second?  Pay once per
  jiffy.
 
 No, actually, it was someone who was worried about tail latency.  So
 even if the average overhead was small, if once a jiffy we had a much
 larger time spent in the interrupt handler, that's something that
 would a big concern for someone who was worried about big storage
 array performance.
 
 I'd be happier if we used fast_mix() and mixed the registers into the
 fast pool.  That's much lighter weight than using mix_pool_bytes(),
 which involves many more memory accesses, MUCH higher probably of
 cache line bouncing between CPU's, etc.
 
 And there has been some proposals that I've been looking at to make
 fast_mix to be use even less overhead, so if we use fast_mix, any
 changes we make to improve that will help here too.

That makes sense to me.  It would require replacing current fast_mix()
with somethat doesn't doesn't assume __u32 input[4] as the only
possible parameter.  In a previous incarnation of my patch I was using
a single round of siphash to condense the registers and added that to
our fast_pool.

While siphash is fairly fast, doing that for every interrupt seemed
too much for me, hence the current ratelimit approach.  But if people
care more about maximum latency than amortized cost, we can combine
siphash (or something similar) with the ratelimit.

At any rate, here is a slightly updated patch that is independent of
CONFIG_HZ.

Jörn

--
Error protection by error detection and correction.
-- from a university class

From 867d62e0165723f9d8dc568d0cb9d8898948ddaa Mon Sep 17 00:00:00 2001
From: Joern Engel jo...@logfs.org
Date: Sat, 15 Feb 2014 16:29:28 -0800
Subject: [PATCH] random: mix all saved registers into entropy pool

The single biggest entropy source is a high-resolution timer running
asynchronous to the triggering event.  That leaves systems without a
useful get_cycles() implementation with significantly less entropy.
Significant means orders of magnitude, not a few percent.

An alternate high-resolution timer is the register content at the time
of an interrupt.  While not monotonic, it is guaranteed to change every
few clock cycles and very unlikely to repeat the same pattern.  Not
useful for interpretation as time, but we only care about some bits
of the clock flipping in an unpredictable fashion.

Experimentation show this to be an excellent entropy source.  Doing 1000
boottests with kvm and dumping a hash of the registers for the first
1024 interrupts each, 40% of all hashes were unique and 80% of all
hashes occurred less than once 1% of the time.

Even assuming that only unique register hashes yield any entropy at all
and only one bit each, we would gather 400 bits of entropy early in
boot and another 40 bits/s from the timer interrupt during runtime.  I
would feel confident with this amount of entropy in the absence of any
other source.

Repeating the test on real hardware, albeit with fewer repetitions
yields roughly the same results, slightly above 40% unique hashes.

Signed-off-by: Joern Engel jo...@logfs.org
---
 drivers/char/random.c | 69 ---
 1 file changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 429b75bb60e8..8e9f6274cf76 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -553,6 +553,8 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,
 
 struct fast_pool {
__u32   pool[4];
+   u32 last_shift;
+   u32 regs_count;
unsigned long   last;
unsigned short  count;
unsigned char   rotate;
@@ -835,6 +837,64 @@ EXPORT_SYMBOL_GPL(add_input_randomness);
 
 static DEFINE_PER_CPU(struct fast_pool, irq_randomness);
 
+/*
+ * Ratelimit to a steady state of about 50Hz.  A naïve approach would be
+ * to return 1 every time jiffies changes.  But we want to avoid being
+ * closely coupled to the timer interrupt.  So instead we increment a
+ * counter on every call and shift it right every time jiffies increments
+ * by 20ms.
+ * 20ms or 50Hz is chosen to work well for all options of CONFIG_HZ.
+ * If the counter is a power of two, return false

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-10 Thread Jörn Engel

On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
> On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
> > +/*
> > + * Ratelimit to a steady state of about once per jiffy.  A naïve approach
> > + * would be to return 1 every time jiffies changes.  But we want to avoid
> > + * being closely coupled to the timer interrupt.  So instead we increment
> > + * a counter on every call and shift it right every time jiffies changes.
> > + * If the counter is a power of two, return false;
> > + *
> > + * Effect is that some time after a jiffies change and cutting the counter
> > + * in half we reach another power of two and return false.  But the
> > + * likelihood of this happening is about the same at any time within a
> > + * jiffies interval.
> > + */
> > +static inline int ratelimited(struct fast_pool *p)
> > +{
> > +   int ret = !(p->regs_count == 0 || is_power_of_2(p->regs_count));
> > +
> > +   p->regs_count++;
> > +   if (p->last_shift != (u32)jiffies) {
> > +   p->regs_count >>= min(31u, (u32)jiffies - p->last_shift);
> > +   p->last_shift = (u32)jiffies;
> > +   }
> > +   return ret;
> > +}
> 
> I wasn't convinced this would do the right thing, so I wrote a quick
> test program where the main loop was basically this:
> 
>   for (i=0; i < 1024; i++) {
>   jiffies = i >> 2;
>   r = ratelimited(jiffies);
>   printf("%5u %5u %5u %d\n", i, jiffies, regs_count, r);
>   }
> 
> ... which basically simulated a very simple scheme where there were
> four interrupts for each clock tick.  In the steady state ratelimited
> returns true 75% of the time.  If this was as documented, we would
> expect it to return true 25% of the time.  So I don't think this is
> working quite right:
> 
>20 5 3 1
>21 5 4 1
>22 5 5 0
>23 5 6 1
>24 6 3 1
>25 6 4 1
>26 6 5 0
>27 6 6 1
>28 7 3 1
>29 7 4 1
>30 7 5 0
>31 7 6 1
>32 8 3 1

Doh!  Removing the "!" from "!(p->regs_count..." will fix that.  Shows
I spent 100x more time testing the bootup entropy collection.

> > +static void mix_regs(struct pt_regs *regs, struct fast_pool *fast_pool)
> > +{
> > +   struct entropy_store*r;
> > +   /* Two variables avoid decrementing by two without using atomics */
> > +   static int boot_count = BOOT_IRQS;
> > +   int in_boot = boot_count;
> > +
> > +   if (in_boot) {
> > +   boot_count = in_boot - 1;
> > +   } else if (ratelimited(fast_pool))
> > +   return;
> > +
> > +   /* During bootup we alternately feed both pools */
> > +   r = (in_boot & 1) ? _pool : _pool;
> > +   __mix_pool_bytes(r, regs, sizeof(*regs), NULL);
> > +}
> 
> I'm also concerned about how much overhead this might eat up.  I've
> already had someone who was benchmarking a high performance storage
> array where the increased interrupt latency before adding something
> like this was something he noticed, and kvetched to me about.  The
> pt_regs structure is going to be larger than the fast_pool (which is
> only 16 bytes), and we're going to be doing it once every jiffy, which
> means between 100 and 1000 times per second.

Would that someone be me? ;)

The reason I prefer doing it once every jiffy is that it doesn't
change with interrupt load.  You get 10k interrupts per second?
You pay once per jiffy.  100k interrupts per second?  Pay once per
jiffy.

I dislike the fact that HZ is anywhere between 100 and 1000.  We could
sample every time jiffies has advanced by HZ/50, which gives a stable
sample rate of 50Hz for every currently possible config.

> If we do this only during boot up, I'd be much less concerned, but
> doing it all the time is making me a bit nervous about the overhead.
> 
> If we knew which parts of the pt_regs were actually the "best" in
> terms of entropy, we could xor that into the input[] array in
> add_interrupt_randomness(), perhaps?  
> 
> 
> > -   input[0] = cycles ^ j_high ^ irq;
> > -   input[1] = now ^ c_high;
> > +   input[0] ^= cycles ^ j_high ^ irq;
> > +   input[1] ^= now ^ c_high;
> 
> This is an unrelated change, so if you want to do this, let's discuss
> this on a separate commit.  We used to have something like this, but
> it causes static code checkers to complain about our using stack
> garbage, and people argued convincingly that it really did add as much
> unpredictability, especially when correlated wi

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-10 Thread Jörn Engel

On Tue, 10 June 2014 12:14:01 -0400, Theodore Ts'o wrote:
 On Mon, May 19, 2014 at 05:17:19PM -0400, Jörn Engel wrote:
  +/*
  + * Ratelimit to a steady state of about once per jiffy.  A naïve approach
  + * would be to return 1 every time jiffies changes.  But we want to avoid
  + * being closely coupled to the timer interrupt.  So instead we increment
  + * a counter on every call and shift it right every time jiffies changes.
  + * If the counter is a power of two, return false;
  + *
  + * Effect is that some time after a jiffies change and cutting the counter
  + * in half we reach another power of two and return false.  But the
  + * likelihood of this happening is about the same at any time within a
  + * jiffies interval.
  + */
  +static inline int ratelimited(struct fast_pool *p)
  +{
  +   int ret = !(p-regs_count == 0 || is_power_of_2(p-regs_count));
  +
  +   p-regs_count++;
  +   if (p-last_shift != (u32)jiffies) {
  +   p-regs_count = min(31u, (u32)jiffies - p-last_shift);
  +   p-last_shift = (u32)jiffies;
  +   }
  +   return ret;
  +}
 
 I wasn't convinced this would do the right thing, so I wrote a quick
 test program where the main loop was basically this:
 
   for (i=0; i  1024; i++) {
   jiffies = i  2;
   r = ratelimited(jiffies);
   printf(%5u %5u %5u %d\n, i, jiffies, regs_count, r);
   }
 
 ... which basically simulated a very simple scheme where there were
 four interrupts for each clock tick.  In the steady state ratelimited
 returns true 75% of the time.  If this was as documented, we would
 expect it to return true 25% of the time.  So I don't think this is
 working quite right:
 
20 5 3 1
21 5 4 1
22 5 5 0
23 5 6 1
24 6 3 1
25 6 4 1
26 6 5 0
27 6 6 1
28 7 3 1
29 7 4 1
30 7 5 0
31 7 6 1
32 8 3 1

Doh!  Removing the ! from !(p-regs_count... will fix that.  Shows
I spent 100x more time testing the bootup entropy collection.

  +static void mix_regs(struct pt_regs *regs, struct fast_pool *fast_pool)
  +{
  +   struct entropy_store*r;
  +   /* Two variables avoid decrementing by two without using atomics */
  +   static int boot_count = BOOT_IRQS;
  +   int in_boot = boot_count;
  +
  +   if (in_boot) {
  +   boot_count = in_boot - 1;
  +   } else if (ratelimited(fast_pool))
  +   return;
  +
  +   /* During bootup we alternately feed both pools */
  +   r = (in_boot  1) ? nonblocking_pool : input_pool;
  +   __mix_pool_bytes(r, regs, sizeof(*regs), NULL);
  +}
 
 I'm also concerned about how much overhead this might eat up.  I've
 already had someone who was benchmarking a high performance storage
 array where the increased interrupt latency before adding something
 like this was something he noticed, and kvetched to me about.  The
 pt_regs structure is going to be larger than the fast_pool (which is
 only 16 bytes), and we're going to be doing it once every jiffy, which
 means between 100 and 1000 times per second.

Would that someone be me? ;)

The reason I prefer doing it once every jiffy is that it doesn't
change with interrupt load.  You get 10k interrupts per second?
You pay once per jiffy.  100k interrupts per second?  Pay once per
jiffy.

I dislike the fact that HZ is anywhere between 100 and 1000.  We could
sample every time jiffies has advanced by HZ/50, which gives a stable
sample rate of 50Hz for every currently possible config.

 If we do this only during boot up, I'd be much less concerned, but
 doing it all the time is making me a bit nervous about the overhead.
 
 If we knew which parts of the pt_regs were actually the best in
 terms of entropy, we could xor that into the input[] array in
 add_interrupt_randomness(), perhaps?  
 
 
  -   input[0] = cycles ^ j_high ^ irq;
  -   input[1] = now ^ c_high;
  +   input[0] ^= cycles ^ j_high ^ irq;
  +   input[1] ^= now ^ c_high;
 
 This is an unrelated change, so if you want to do this, let's discuss
 this on a separate commit.  We used to have something like this, but
 it causes static code checkers to complain about our using stack
 garbage, and people argued convincingly that it really did add as much
 unpredictability, especially when correlated with the ip field.

Will remove.

 (And yes, I know that instruction_pointer(regs) doesn't work on all
 platforms, and _RET_IP_ becomes largely static --- but this is why I'd
 much rather have each architecture tell me which parts of regs were
 actually the best, and use that as the initialized portions of the
 input[] array.)

What I like about my approach is quite the opposite.  The only thing
an architecture maintainer can mess up is not saving registers on
interrupts.  And if they mess that one up, they are very likely to
notice.

Jörn

--
Victory in war is not repetitious.
-- Sun Tzu
--
To unsubscribe from this list: send the line

Re: [PATCH] printk: Print cpu number along with time

2014-06-04 Thread Jörn Engel

On Wed, 4 June 2014 16:28:32 -0700, Andrew Morton wrote:
> 
> Sorry, it seems I'm better at losing stuff than I thought I was.

The trouble with forgetting stuff is that few people remember ever
doing so.  But they do remember other people forgetting stuff.

> I've restored the patch to the input queue.

Thanks!

Jörn

--
Maintenance in other professions and of other articles is concerned with
the return of the item to its original state; in Software, maintenance
is concerned with moving an item away from its original state.
-- Les Belady
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-04 Thread Jörn Engel

Ping.

Jörn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success."
-- Tom DeMarco
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Print cpu number along with time

2014-06-04 Thread Jörn Engel

On Mon, 28 April 2014 17:22:19 -0700, Andrew Morton wrote:
> On Mon, 28 Apr 2014 19:40:39 -0400 J__rn Engel  wrote:
> > On Thu, 24 April 2014 15:40:24 -0400, J__rn Engel wrote:
> > > On Wed, 23 April 2014 20:52:47 -0400, J__rn Engel wrote:
> > > > 
> > > > I use the patch below for some time now.  While it doesn't avoid the
> > > > log pollution in the first place, it lessens the impact somewhat.
> > > 
> > > Added a config option and ported it to current -linus.  Andrew, would
> > > you take this patch?
> > 
> > Andrew?  Did you dislike this patch for some reason or just miss it in
> > the thread?
> 
> Neither ;) I try to respond in some way to all patches unless I think it's 
> clear
> to the originator why I took no action.  And I think I'm pretty good at
> not losing stuff.
> 
> otoh, it has only been four days, three of which I spent offline...

Ping.  It is closer to 40 days now.  Can I fault this patch back into
your memory?

Jörn

--
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Print cpu number along with time

2014-06-04 Thread Jörn Engel

On Mon, 28 April 2014 17:22:19 -0700, Andrew Morton wrote:
 On Mon, 28 Apr 2014 19:40:39 -0400 J__rn Engel jo...@logfs.org wrote:
  On Thu, 24 April 2014 15:40:24 -0400, J__rn Engel wrote:
   On Wed, 23 April 2014 20:52:47 -0400, J__rn Engel wrote:

I use the patch below for some time now.  While it doesn't avoid the
log pollution in the first place, it lessens the impact somewhat.
   
   Added a config option and ported it to current -linus.  Andrew, would
   you take this patch?
  
  Andrew?  Did you dislike this patch for some reason or just miss it in
  the thread?
 
 Neither ;) I try to respond in some way to all patches unless I think it's 
 clear
 to the originator why I took no action.  And I think I'm pretty good at
 not losing stuff.
 
 otoh, it has only been four days, three of which I spent offline...

Ping.  It is closer to 40 days now.  Can I fault this patch back into
your memory?

Jörn

--
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: mix all saved registers into entropy pool

2014-06-04 Thread Jörn Engel

Ping.

Jörn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success.
-- Tom DeMarco
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Print cpu number along with time

2014-06-04 Thread Jörn Engel

On Wed, 4 June 2014 16:28:32 -0700, Andrew Morton wrote:
 
 Sorry, it seems I'm better at losing stuff than I thought I was.

The trouble with forgetting stuff is that few people remember ever
doing so.  But they do remember other people forgetting stuff.

 I've restored the patch to the input queue.

Thanks!

Jörn

--
Maintenance in other professions and of other articles is concerned with
the return of the item to its original state; in Software, maintenance
is concerned with moving an item away from its original state.
-- Les Belady
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1168 matches

Mail list logo