subject:"Re\: \[RFC\] gcc feature request\: Moving blocks into sections"

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread Steven Rostedt

On Tue, 13 Aug 2013 07:46:46 -0700
"H. Peter Anvin"  wrote:

> > On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
> >> Since we really doesn't want to...
> 
> Ow.  Can't believe I wrote that.
> 

All your base are belong to us!

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread H. Peter Anvin

> On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
>> Since we really doesn't want to...

Ow.  Can't believe I wrote that.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
> On 08/12/2013 09:09 AM, Peter Zijlstra wrote:
> >>
> >> On the majority of architectures, including x86, you cannot simply copy
> >> a piece of code elsewhere and have it still work.
> > 
> > I thought we used -fPIC which would allow just that.
> > 
> 
> Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
> allow this kind of movement for code that contains intramodule
> references (that is *all* references in the kernel).  Since we really
> doesn't want to burden the kernel with a GOT and a PLT, that is life.

OK. never mind then..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
 On 08/12/2013 09:09 AM, Peter Zijlstra wrote:
 
  On the majority of architectures, including x86, you cannot simply copy
  a piece of code elsewhere and have it still work.
  
  I thought we used -fPIC which would allow just that.
  
 
 Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
 allow this kind of movement for code that contains intramodule
 references (that is *all* references in the kernel).  Since we really
 doesn't want to burden the kernel with a GOT and a PLT, that is life.

OK. never mind then..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread H. Peter Anvin

 On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
 Since we really doesn't want to...

Ow.  Can't believe I wrote that.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread Steven Rostedt

On Tue, 13 Aug 2013 07:46:46 -0700
H. Peter Anvin h...@linux.intel.com wrote:

  On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
  Since we really doesn't want to...
 
 Ow.  Can't believe I wrote that.
 

All your base are belong to us!

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin

On 08/12/2013 09:09 AM, Peter Zijlstra wrote:
>>
>> On the majority of architectures, including x86, you cannot simply copy
>> a piece of code elsewhere and have it still work.
> 
> I thought we used -fPIC which would allow just that.
> 

Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
allow this kind of movement for code that contains intramodule
references (that is *all* references in the kernel).  Since we really
doesn't want to burden the kernel with a GOT and a PLT, that is life.

>> You end up doing a
>> bunch of the work that a JIT would do anyway, and would end up with
>> considerably higher complexity and worse results than a true JIT.  
> 
> Well, less complexity but worse result, yes. We'd only poke the specific
> static_branch sites with either NOPs or the (relative) jump target for
> each of these branches. Then copy the result.

Once again, you can't "copy the result".  You end up with a full
disassembler.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 09:02:02AM -0700, Andi Kleen wrote:
> "H. Peter Anvin"  writes:
> 
> > However, I would really like to
> > understand what the value is.
> 
> Probably very little. When I last looked at it, the main overhead in
> perf currently seems to be backtraces and the ring buffer, not this
> code.

backtraces do indeed blow and make pretty much everything else
irrelevant, but when not using them the branch forest was significant
when I last looked at it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 07:56:10AM -0700, H. Peter Anvin wrote:
> On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
> > 
> > I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
> > if-forest functions like perf_prepare_sample() and perf_output_sample().
> > 
> > They are of the form:
> > 
> > void func(obj, args..)
> > {
> > unsigned long f = ...;
> > 
> > if (f & F1)
> > do_f1();
> > 
> > if (f & F2)
> > do_f2();
> > 
> > ...
> > 
> > if (f & FN)
> > do_fn();
> > }
> > 
> 
> Am I reading this right that f can be a combination of any of these?

Correct.

> > Where f is constant for the entire lifetime of the particular object.
> > 
> > So I was thinking of having these functions use static_key/asm-goto;
> > then write the proper static key values unsafe so as to avoid all
> > trickery (as these functions would never actually be used) and copy the
> > end result into object private memory. The object will then use indirect
> > calls into these functions.
> 
> I'm really not following what you are proposing here, especially not
> "copy the end result into object private memory."
> 
> With asm goto you end up with at minimum a jump or NOP for each of these
> function entries, whereas an actual JIT can elide that as well.
> 
> On the majority of architectures, including x86, you cannot simply copy
> a piece of code elsewhere and have it still work.

I thought we used -fPIC which would allow just that.

> You end up doing a
> bunch of the work that a JIT would do anyway, and would end up with
> considerably higher complexity and worse results than a true JIT.  

Well, less complexity but worse result, yes. We'd only poke the specific
static_branch sites with either NOPs or the (relative) jump target for
each of these branches. Then copy the result.

> You
> also say "the object will then use indirect calls into these
> functions"... you mean the JIT or pseudo-JIT generated functions, or the
> calls inside them?

The calls to these pseudo-JIT generated functions.

> > I suppose the question is, do people strenuously object to creativity
> > like that and or is there something GCC can do to make this
> > easier/better still?
> 
> I think it would be much easier to just write a minimal JIT for this,
> even though it is per architecture.  However, I would really like to
> understand what the value is.

Removing a lot of the conditionals from the sample path. Depending on
the configuration these can be quite expensive.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Andi Kleen

"H. Peter Anvin"  writes:

> However, I would really like to
> understand what the value is.

Probably very little. When I last looked at it, the main overhead in
perf currently seems to be backtraces and the ring buffer, not this
code.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin

On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
> 
> I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
> if-forest functions like perf_prepare_sample() and perf_output_sample().
> 
> They are of the form:
> 
> void func(obj, args..)
> {
>   unsigned long f = ...;
> 
>   if (f & F1)
>   do_f1();
> 
>   if (f & F2)
>   do_f2();
> 
>   ...
> 
>   if (f & FN)
>   do_fn();
> }
> 

Am I reading this right that f can be a combination of any of these?

> Where f is constant for the entire lifetime of the particular object.
> 
> So I was thinking of having these functions use static_key/asm-goto;
> then write the proper static key values unsafe so as to avoid all
> trickery (as these functions would never actually be used) and copy the
> end result into object private memory. The object will then use indirect
> calls into these functions.

I'm really not following what you are proposing here, especially not
"copy the end result into object private memory."

With asm goto you end up with at minimum a jump or NOP for each of these
function entries, whereas an actual JIT can elide that as well.

On the majority of architectures, including x86, you cannot simply copy
a piece of code elsewhere and have it still work.  You end up doing a
bunch of the work that a JIT would do anyway, and would end up with
considerably higher complexity and worse results than a true JIT.  You
also say "the object will then use indirect calls into these
functions"... you mean the JIT or pseudo-JIT generated functions, or the
calls inside them?

> I suppose the question is, do people strenuously object to creativity
> like that and or is there something GCC can do to make this
> easier/better still?

I think it would be much easier to just write a minimal JIT for this,
even though it is per architecture.  However, I would really like to
understand what the value is.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 05, 2013 at 12:55:15PM -0400, Steven Rostedt wrote:
> [ sent to both Linux kernel mailing list and to gcc list ]
> 

Let me hijack this thread for something related...

I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
if-forest functions like perf_prepare_sample() and perf_output_sample().

They are of the form:

void func(obj, args..)
{
unsigned long f = ...;

if (f & F1)
do_f1();

if (f & F2)
do_f2();

...

if (f & FN)
do_fn();
}

Where f is constant for the entire lifetime of the particular object.

So I was thinking of having these functions use static_key/asm-goto;
then write the proper static key values unsafe so as to avoid all
trickery (as these functions would never actually be used) and copy the
end result into object private memory. The object will then use indirect
calls into these functions.

The advantage of using something like this is that it would work for all
architectures that now support the asm-goto feature. For arch/gcc
combinations that do not we'd simply revert to the current state of
affairs.

I suppose the question is, do people strenuously object to creativity
like that and or is there something GCC can do to make this
easier/better still?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 05, 2013 at 12:55:15PM -0400, Steven Rostedt wrote:
 [ sent to both Linux kernel mailing list and to gcc list ]
 

Let me hijack this thread for something related...

I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
if-forest functions like perf_prepare_sample() and perf_output_sample().

They are of the form:

void func(obj, args..)
{
unsigned long f = ...;

if (f  F1)
do_f1();

if (f  F2)
do_f2();

...

if (f  FN)
do_fn();
}

Where f is constant for the entire lifetime of the particular object.

So I was thinking of having these functions use static_key/asm-goto;
then write the proper static key values unsafe so as to avoid all
trickery (as these functions would never actually be used) and copy the
end result into object private memory. The object will then use indirect
calls into these functions.

The advantage of using something like this is that it would work for all
architectures that now support the asm-goto feature. For arch/gcc
combinations that do not we'd simply revert to the current state of
affairs.

I suppose the question is, do people strenuously object to creativity
like that and or is there something GCC can do to make this
easier/better still?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin

On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
 
 I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
 if-forest functions like perf_prepare_sample() and perf_output_sample().
 
 They are of the form:
 
 void func(obj, args..)
 {
   unsigned long f = ...;
 
   if (f  F1)
   do_f1();
 
   if (f  F2)
   do_f2();
 
   ...
 
   if (f  FN)
   do_fn();
 }
 

Am I reading this right that f can be a combination of any of these?

 Where f is constant for the entire lifetime of the particular object.
 
 So I was thinking of having these functions use static_key/asm-goto;
 then write the proper static key values unsafe so as to avoid all
 trickery (as these functions would never actually be used) and copy the
 end result into object private memory. The object will then use indirect
 calls into these functions.

I'm really not following what you are proposing here, especially not
copy the end result into object private memory.

With asm goto you end up with at minimum a jump or NOP for each of these
function entries, whereas an actual JIT can elide that as well.

On the majority of architectures, including x86, you cannot simply copy
a piece of code elsewhere and have it still work.  You end up doing a
bunch of the work that a JIT would do anyway, and would end up with
considerably higher complexity and worse results than a true JIT.  You
also say the object will then use indirect calls into these
functions... you mean the JIT or pseudo-JIT generated functions, or the
calls inside them?

 I suppose the question is, do people strenuously object to creativity
 like that and or is there something GCC can do to make this
 easier/better still?

I think it would be much easier to just write a minimal JIT for this,
even though it is per architecture.  However, I would really like to
understand what the value is.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Andi Kleen

H. Peter Anvin h...@linux.intel.com writes:

 However, I would really like to
 understand what the value is.

Probably very little. When I last looked at it, the main overhead in
perf currently seems to be backtraces and the ring buffer, not this
code.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 07:56:10AM -0700, H. Peter Anvin wrote:
 On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
  
  I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
  if-forest functions like perf_prepare_sample() and perf_output_sample().
  
  They are of the form:
  
  void func(obj, args..)
  {
  unsigned long f = ...;
  
  if (f  F1)
  do_f1();
  
  if (f  F2)
  do_f2();
  
  ...
  
  if (f  FN)
  do_fn();
  }
  
 
 Am I reading this right that f can be a combination of any of these?

Correct.

  Where f is constant for the entire lifetime of the particular object.
  
  So I was thinking of having these functions use static_key/asm-goto;
  then write the proper static key values unsafe so as to avoid all
  trickery (as these functions would never actually be used) and copy the
  end result into object private memory. The object will then use indirect
  calls into these functions.
 
 I'm really not following what you are proposing here, especially not
 copy the end result into object private memory.
 
 With asm goto you end up with at minimum a jump or NOP for each of these
 function entries, whereas an actual JIT can elide that as well.
 
 On the majority of architectures, including x86, you cannot simply copy
 a piece of code elsewhere and have it still work.

I thought we used -fPIC which would allow just that.

 You end up doing a
 bunch of the work that a JIT would do anyway, and would end up with
 considerably higher complexity and worse results than a true JIT.  

Well, less complexity but worse result, yes. We'd only poke the specific
static_branch sites with either NOPs or the (relative) jump target for
each of these branches. Then copy the result.

 You
 also say the object will then use indirect calls into these
 functions... you mean the JIT or pseudo-JIT generated functions, or the
 calls inside them?

The calls to these pseudo-JIT generated functions.

  I suppose the question is, do people strenuously object to creativity
  like that and or is there something GCC can do to make this
  easier/better still?
 
 I think it would be much easier to just write a minimal JIT for this,
 even though it is per architecture.  However, I would really like to
 understand what the value is.

Removing a lot of the conditionals from the sample path. Depending on
the configuration these can be quite expensive.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread Peter Zijlstra

On Mon, Aug 12, 2013 at 09:02:02AM -0700, Andi Kleen wrote:
 H. Peter Anvin h...@linux.intel.com writes:
 
  However, I would really like to
  understand what the value is.
 
 Probably very little. When I last looked at it, the main overhead in
 perf currently seems to be backtraces and the ring buffer, not this
 code.

backtraces do indeed blow and make pretty much everything else
irrelevant, but when not using them the branch forest was significant
when I last looked at it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin

On 08/12/2013 09:09 AM, Peter Zijlstra wrote:

 On the majority of architectures, including x86, you cannot simply copy
 a piece of code elsewhere and have it still work.
 
 I thought we used -fPIC which would allow just that.
 

Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
allow this kind of movement for code that contains intramodule
references (that is *all* references in the kernel).  Since we really
doesn't want to burden the kernel with a GOT and a PLT, that is life.

 You end up doing a
 bunch of the work that a JIT would do anyway, and would end up with
 considerably higher complexity and worse results than a true JIT.  
 
 Well, less complexity but worse result, yes. We'd only poke the specific
 static_branch sites with either NOPs or the (relative) jump target for
 each of these branches. Then copy the result.

Once again, you can't copy the result.  You end up with a full
disassembler.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Wed, 2013-08-07 at 12:03 -0400, Mathieu Desnoyers wrote:
> 
> > You might want to try creating a global array of counters (accessible
> > both from C for printout and assembly for update).
> > 
> > Index the array from assembly using:   (2f - 1f)
> > 
> > 1:
> > jmp ...;
> > 2:
> > 
> > And put an atomic increment of the counter. This increment instruction
> > should be located prior to the jmp for obvious reasons.
> > 
> > You'll end up with the sums you're looking for at indexes 2 and 5 of the
> > array.
> 
> After I post the patches, feel free to knock yourself out.

I just need the calculation, not the entire patchset. For this purpose:

Based on top of 3.10.5:

---
 arch/x86/include/asm/jump_label.h |   15 ++-
 include/linux/jump_label.h|3 +++
 kernel/jump_label.c   |   12 
 3 files changed, 29 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/jump_label.h
===
--- linux.orig/arch/x86/include/asm/jump_label.h
+++ linux/arch/x86/include/asm/jump_label.h
@@ -15,9 +15,20 @@ static __always_inline bool arch_static_
 {
asm goto("1:"
STATIC_KEY_INITIAL_NOP
+#ifdef CONFIG_X86_64
+   "lock; incq 4f \n\t"
+#else
+   "lock; incl 4f \n\t"
+#endif
+   "jmp 3f \n\t"
+   "2:"
+   "jmp %l[l_yes] \n\t"
+   "3:"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
-   _ASM_PTR "1b, %l[l_yes], %c0 \n\t"
+   _ASM_PTR "1b, %l[l_yes], %c0, (3b - 2b) \n\t"
+   "4:"/* nr_hit */
+   _ASM_PTR "0 \n\t"
".popsection \n\t"
: :  "i" (key) : : l_yes);
return false;
@@ -37,6 +48,8 @@ struct jump_entry {
jump_label_t code;
jump_label_t target;
jump_label_t key;
+   jump_label_t jmp_insn_len;
+   jump_label_t nr_hit;
 };
 
 #endif
Index: linux/include/linux/jump_label.h
===
--- linux.orig/include/linux/jump_label.h
+++ linux/include/linux/jump_label.h
@@ -208,4 +208,7 @@ static inline bool static_key_enabled(st
return (atomic_read(>enabled) > 0);
 }
 
+struct jump_entry *get_jump_label_start(void);
+struct jump_entry *get_jump_label_stop(void);
+
 #endif /* _LINUX_JUMP_LABEL_H */
Index: linux/kernel/jump_label.c
===
--- linux.orig/kernel/jump_label.c
+++ linux/kernel/jump_label.c
@@ -16,6 +16,18 @@
 
 #ifdef HAVE_JUMP_LABEL
 
+struct jump_entry *get_jump_label_start(void)
+{
+   return __start___jump_table;
+}
+EXPORT_SYMBOL_GPL(get_jump_label_start);
+
+struct jump_entry *get_jump_label_stop(void)
+{
+   return __stop___jump_table;
+}
+EXPORT_SYMBOL_GPL(get_jump_label_stop);
+
 /* mutex to protect coming/going of the the jump_label table */
 static DEFINE_MUTEX(jump_label_mutex);
 
-
test.c:

/*
 * Copyright 2013 - Mathieu Desnoyers 
 *
 * GPLv2 license.
 */

#include 
#include 
#include 
#include 
#include 

void print_static_jumps(void)
{
struct jump_entry *iter_start = get_jump_label_start();
struct jump_entry *iter_stop = get_jump_label_stop();
struct jump_entry *iter;

for (iter = iter_start; iter < iter_stop; iter++) {
char symbol[KSYM_SYMBOL_LEN] = "";

if (sprint_symbol(symbol, iter->code) == 0) {
WARN_ON_ONCE(1);
}
printk("Jump label: addr: %lx symbol: %s ilen: %lu hits: %lu\n",
(unsigned long) iter->code, symbol,
(unsigned long) iter->jmp_insn_len,
(unsigned long) iter->nr_hit);
}
}

int initfct(void)
{
print_static_jumps();

return -EPERM;
}

module_init(initfct);

MODULE_LICENSE("GPL");

--

Results sorted by reverse number of hits, after boot + starting firefox,
200s after boot:

Jump label: addr: 810d9805 symbol: 
balance_dirty_pages_ratelimited+0x425/0x9c0 ilen: 5  814700
Jump label: addr: 81138135 symbol: writeback_sb_inodes+0x195/0x4a0 
ilen: 5  752021
Jump label: addr: 8103695e symbol: 
call_console_drivers.constprop.13+0xe/0x140 ilen: 5  726153
Jump label: addr: 8103ce11 symbol: __do_softirq+0xe1/0x310 ilen: 5  
724803
Jump label: addr: 810e07a6 symbol: shrink_inactive_list+0x2e6/0x420 
ilen: 2  328701
Jump label: addr: 810dfdad symbol: shrink_page_list+0x56d/0x8c0 ilen: 5 
 315157
Jump label: addr: 810e9510 symbol: congestion_wait+0xb0/0x170 ilen: 2  
241653
Jump label: addr: 810deb0f symbol: shrink_slab+0x1df/0x390 ilen: 5  
231215
Jump label: addr: 810d9bd7 symbol: 
balance_dirty_pages_ratelimited+0x7f7/0x9c0 ilen: 5  166859

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Steven Rostedt

On Wed, 2013-08-07 at 12:03 -0400, Mathieu Desnoyers wrote:

> You might want to try creating a global array of counters (accessible
> both from C for printout and assembly for update).
> 
> Index the array from assembly using:   (2f - 1f)
> 
> 1:
> jmp ...;
> 2:
> 
> And put an atomic increment of the counter. This increment instruction
> should be located prior to the jmp for obvious reasons.
> 
> You'll end up with the sums you're looking for at indexes 2 and 5 of the
> array.

After I post the patches, feel free to knock yourself out.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote:
> 
> > Add short_counter,long_counter and before increment counter before each
> > jump. That way we will know how many short/long jumps were taken. 
> 
> That's not trivial at all. The jump is a single location (in an asm
> goto() statement) that happens to be inlined through out the kernel. The
> assembler decides if it will be a short or long jump. How do you add a
> counter to count the difference?

You might want to try creating a global array of counters (accessible
both from C for printout and assembly for update).

Index the array from assembly using:   (2f - 1f)

1:
jmp ...;
2:

And put an atomic increment of the counter. This increment instruction
should be located prior to the jmp for obvious reasons.

You'll end up with the sums you're looking for at indexes 2 and 5 of the
array.

Thanks,

Mathieu

> 
> The output I gave is from the boot up code that converts the jmp back to
> a nop (or in this case, the default nop to the ideal nop). It knows the
> size by reading the op code. This is a static analysis, not a running
> one. It's no trivial task to have a counter for each jump.
> 
> There is a way though. If we enable all the jumps (all tracepoints, and
> other users of jumplabel), record the trace and then compare the trace
> to the output that shows which ones were short jumps, and all others are
> long jumps.
> 
> I'll post the patches soon and you can have fun doing the compare :-)
> 
> Actually, I'm working on the 4 patches of the series that is more about
> clean ups and safety checks than the jmp conversion. That is not
> controversial, and I'll be posting them for 3.12 soon.
> 
> After that, I'll post the updated patches that have the conversion as
> well as the counter, for RFC and for others to play with.
> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Steven Rostedt

On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote:

> Add short_counter,long_counter and before increment counter before each
> jump. That way we will know how many short/long jumps were taken. 

That's not trivial at all. The jump is a single location (in an asm
goto() statement) that happens to be inlined through out the kernel. The
assembler decides if it will be a short or long jump. How do you add a
counter to count the difference?

The output I gave is from the boot up code that converts the jmp back to
a nop (or in this case, the default nop to the ideal nop). It knows the
size by reading the op code. This is a static analysis, not a running
one. It's no trivial task to have a counter for each jump.

There is a way though. If we enable all the jumps (all tracepoints, and
other users of jumplabel), record the trace and then compare the trace
to the output that shows which ones were short jumps, and all others are
long jumps.

I'll post the patches soon and you can have fun doing the compare :-)

Actually, I'm working on the 4 patches of the series that is more about
clean ups and safety checks than the jmp conversion. That is not
controversial, and I'll be posting them for 3.12 soon.

After that, I'll post the updated patches that have the conversion as
well as the counter, for RFC and for others to play with.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Steven Rostedt

On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote:

 Add short_counter,long_counter and before increment counter before each
 jump. That way we will know how many short/long jumps were taken. 

That's not trivial at all. The jump is a single location (in an asm
goto() statement) that happens to be inlined through out the kernel. The
assembler decides if it will be a short or long jump. How do you add a
counter to count the difference?

The output I gave is from the boot up code that converts the jmp back to
a nop (or in this case, the default nop to the ideal nop). It knows the
size by reading the op code. This is a static analysis, not a running
one. It's no trivial task to have a counter for each jump.

There is a way though. If we enable all the jumps (all tracepoints, and
other users of jumplabel), record the trace and then compare the trace
to the output that shows which ones were short jumps, and all others are
long jumps.

I'll post the patches soon and you can have fun doing the compare :-)

Actually, I'm working on the 4 patches of the series that is more about
clean ups and safety checks than the jmp conversion. That is not
controversial, and I'll be posting them for 3.12 soon.

After that, I'll post the updated patches that have the conversion as
well as the counter, for RFC and for others to play with.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
 On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote:
 
  Add short_counter,long_counter and before increment counter before each
  jump. That way we will know how many short/long jumps were taken. 
 
 That's not trivial at all. The jump is a single location (in an asm
 goto() statement) that happens to be inlined through out the kernel. The
 assembler decides if it will be a short or long jump. How do you add a
 counter to count the difference?

You might want to try creating a global array of counters (accessible
both from C for printout and assembly for update).

Index the array from assembly using:   (2f - 1f)

1:
jmp ...;
2:

And put an atomic increment of the counter. This increment instruction
should be located prior to the jmp for obvious reasons.

You'll end up with the sums you're looking for at indexes 2 and 5 of the
array.

Thanks,

Mathieu

 
 The output I gave is from the boot up code that converts the jmp back to
 a nop (or in this case, the default nop to the ideal nop). It knows the
 size by reading the op code. This is a static analysis, not a running
 one. It's no trivial task to have a counter for each jump.
 
 There is a way though. If we enable all the jumps (all tracepoints, and
 other users of jumplabel), record the trace and then compare the trace
 to the output that shows which ones were short jumps, and all others are
 long jumps.
 
 I'll post the patches soon and you can have fun doing the compare :-)
 
 Actually, I'm working on the 4 patches of the series that is more about
 clean ups and safety checks than the jmp conversion. That is not
 controversial, and I'll be posting them for 3.12 soon.
 
 After that, I'll post the updated patches that have the conversion as
 well as the counter, for RFC and for others to play with.
 
 -- Steve
 
 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Steven Rostedt

On Wed, 2013-08-07 at 12:03 -0400, Mathieu Desnoyers wrote:

 You might want to try creating a global array of counters (accessible
 both from C for printout and assembly for update).
 
 Index the array from assembly using:   (2f - 1f)
 
 1:
 jmp ...;
 2:
 
 And put an atomic increment of the counter. This increment instruction
 should be located prior to the jmp for obvious reasons.
 
 You'll end up with the sums you're looking for at indexes 2 and 5 of the
 array.

After I post the patches, feel free to knock yourself out.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-07 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
 On Wed, 2013-08-07 at 12:03 -0400, Mathieu Desnoyers wrote:
 
  You might want to try creating a global array of counters (accessible
  both from C for printout and assembly for update).
  
  Index the array from assembly using:   (2f - 1f)
  
  1:
  jmp ...;
  2:
  
  And put an atomic increment of the counter. This increment instruction
  should be located prior to the jmp for obvious reasons.
  
  You'll end up with the sums you're looking for at indexes 2 and 5 of the
  array.
 
 After I post the patches, feel free to knock yourself out.

I just need the calculation, not the entire patchset. For this purpose:

Based on top of 3.10.5:

---
 arch/x86/include/asm/jump_label.h |   15 ++-
 include/linux/jump_label.h|3 +++
 kernel/jump_label.c   |   12 
 3 files changed, 29 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/jump_label.h
===
--- linux.orig/arch/x86/include/asm/jump_label.h
+++ linux/arch/x86/include/asm/jump_label.h
@@ -15,9 +15,20 @@ static __always_inline bool arch_static_
 {
asm goto(1:
STATIC_KEY_INITIAL_NOP
+#ifdef CONFIG_X86_64
+   lock; incq 4f \n\t
+#else
+   lock; incl 4f \n\t
+#endif
+   jmp 3f \n\t
+   2:
+   jmp %l[l_yes] \n\t
+   3:
.pushsection __jump_table,  \aw\ \n\t
_ASM_ALIGN \n\t
-   _ASM_PTR 1b, %l[l_yes], %c0 \n\t
+   _ASM_PTR 1b, %l[l_yes], %c0, (3b - 2b) \n\t
+   4:/* nr_hit */
+   _ASM_PTR 0 \n\t
.popsection \n\t
: :  i (key) : : l_yes);
return false;
@@ -37,6 +48,8 @@ struct jump_entry {
jump_label_t code;
jump_label_t target;
jump_label_t key;
+   jump_label_t jmp_insn_len;
+   jump_label_t nr_hit;
 };
 
 #endif
Index: linux/include/linux/jump_label.h
===
--- linux.orig/include/linux/jump_label.h
+++ linux/include/linux/jump_label.h
@@ -208,4 +208,7 @@ static inline bool static_key_enabled(st
return (atomic_read(key-enabled)  0);
 }
 
+struct jump_entry *get_jump_label_start(void);
+struct jump_entry *get_jump_label_stop(void);
+
 #endif /* _LINUX_JUMP_LABEL_H */
Index: linux/kernel/jump_label.c
===
--- linux.orig/kernel/jump_label.c
+++ linux/kernel/jump_label.c
@@ -16,6 +16,18 @@
 
 #ifdef HAVE_JUMP_LABEL
 
+struct jump_entry *get_jump_label_start(void)
+{
+   return __start___jump_table;
+}
+EXPORT_SYMBOL_GPL(get_jump_label_start);
+
+struct jump_entry *get_jump_label_stop(void)
+{
+   return __stop___jump_table;
+}
+EXPORT_SYMBOL_GPL(get_jump_label_stop);
+
 /* mutex to protect coming/going of the the jump_label table */
 static DEFINE_MUTEX(jump_label_mutex);
 
-
test.c:

/*
 * Copyright 2013 - Mathieu Desnoyers mathieu.desnoy...@efficios.com
 *
 * GPLv2 license.
 */

#include linux/module.h
#include linux/mm.h
#include linux/kernel.h
#include linux/jump_label.h
#include linux/kallsyms.h

void print_static_jumps(void)
{
struct jump_entry *iter_start = get_jump_label_start();
struct jump_entry *iter_stop = get_jump_label_stop();
struct jump_entry *iter;

for (iter = iter_start; iter  iter_stop; iter++) {
char symbol[KSYM_SYMBOL_LEN] = ;

if (sprint_symbol(symbol, iter-code) == 0) {
WARN_ON_ONCE(1);
}
printk(Jump label: addr: %lx symbol: %s ilen: %lu hits: %lu\n,
(unsigned long) iter-code, symbol,
(unsigned long) iter-jmp_insn_len,
(unsigned long) iter-nr_hit);
}
}

int initfct(void)
{
print_static_jumps();

return -EPERM;
}

module_init(initfct);

MODULE_LICENSE(GPL);

--

Results sorted by reverse number of hits, after boot + starting firefox,
200s after boot:

Jump label: addr: 810d9805 symbol: 
balance_dirty_pages_ratelimited+0x425/0x9c0 ilen: 5  814700
Jump label: addr: 81138135 symbol: writeback_sb_inodes+0x195/0x4a0 
ilen: 5  752021
Jump label: addr: 8103695e symbol: 
call_console_drivers.constprop.13+0xe/0x140 ilen: 5  726153
Jump label: addr: 8103ce11 symbol: __do_softirq+0xe1/0x310 ilen: 5  
724803
Jump label: addr: 810e07a6 symbol: shrink_inactive_list+0x2e6/0x420 
ilen: 2  328701
Jump label: addr: 810dfdad symbol: shrink_page_list+0x56d/0x8c0 ilen: 5 
 315157
Jump label: addr: 810e9510 symbol: congestion_wait+0xb0/0x170 ilen: 2  
241653
Jump label: addr: 810deb0f symbol: shrink_slab+0x1df/0x390 ilen: 5  
231215
Jump label: addr: 810d9bd7 symbol:

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Ondřej Bílka

On Tue, Aug 06, 2013 at 08:56:00PM -0400, Steven Rostedt wrote:
> On Tue, 2013-08-06 at 20:45 -0400, Steven Rostedt wrote:
> 
> > [3.387362] short jumps: 106
> > [3.390277]  long jumps: 330
> > 
> > Thus, approximately 25%. Not bad.
> 
> Also, where these happen to be is probably even more important than how
> many. If all the short jumps happen in slow paths, it's rather
> pointless. But they seem to be in some rather hot paths. I had it print
> out where it placed the short jumps too:
> 
 
> The kmem_cache_* and the try_to_wake_up* are the hot paths that caught
> my eye.
> 
> But still, is this worth it?
>
Add short_counter,long_counter and before increment counter before each
jump. That way we will know how many short/long jumps were taken. 
> -- Steve
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 20:45 -0400, Steven Rostedt wrote:

> [3.387362] short jumps: 106
> [3.390277]  long jumps: 330
> 
> Thus, approximately 25%. Not bad.

Also, where these happen to be is probably even more important than how
many. If all the short jumps happen in slow paths, it's rather
pointless. But they seem to be in some rather hot paths. I had it print
out where it placed the short jumps too:

[0.00] short jump at: place_entity+0x53/0x87 8106e139^M
[0.00] short jump at: place_entity+0x17/0x87 8106e0fd^M
[0.00] short jump at: check_preempt_wakeup+0x11c/0x16e 
8106f92b^M
[0.00] short jump at: can_migrate_task+0xc6/0x15d 8106e72e
[0.00] short jump at: update_group_power+0x72/0x1df 81070394
[0.00] short jump at: update_group_power+0xaf/0x1df 810703d1^M
[0.00] short jump at: hrtick_enabled+0x4/0x35 8106de51
[0.00] short jump at: task_tick_fair+0x5c/0xf9 81070102^M
[0.00] short jump at: source_load+0x27/0x40 8106da7c^M
[0.00] short jump at: target_load+0x27/0x40 8106dabc^M
[0.00] short jump at: try_to_wake_up+0x127/0x1e2 8106b1d4^M
[0.00] short jump at: build_sched_domains+0x219/0x90b 8106bc24^M
[0.00] short jump at: 
smp_trace_call_function_single_interrupt+0x79/0x112 8102616f^M
[0.00] short jump at: smp_trace_call_function_interrupt+0x7a/0x111 
81026038
[0.00] short jump at: smp_trace_error_interrupt+0x72/0x109 
81028c9e
[0.00] short jump at: smp_trace_spurious_interrupt+0x71/0x107 
81028b77
[0.00] short jump at: smp_trace_reschedule_interrupt+0x7a/0x110 
81025f01^M
[0.00] short jump at: __raise_softirq_irqoff+0xf/0x90 810406e0^M
[0.00] short jump at: it_real_fn+0x17/0xb2 8103ed85
[0.00] short jump at: trace_itimer_state+0x13/0x97 8103e9ff^M
[0.00] short jump at: debug_deactivate+0xa/0x7a 8106014d^M
[0.00] short jump at: debug_activate+0x10/0x86 810478c7^M
[0.00] short jump at: __send_signal+0x233/0x268 8104a6bb
[0.00] short jump at: send_sigqueue+0x103/0x148 8104bbbf^M
[0.00] short jump at: trace_workqueue_activate_work+0xa/0x7a 
81053deb^M
[0.00] short jump at: _rcu_barrier_trace+0x31/0xbc 810b8f81
[0.00] short jump at: trace_rcu_dyntick+0x14/0x8f 810ba3a2^M
[0.00] short jump at: rcu_implicit_dynticks_qs+0x95/0xc4 
810ba35f
[0.00] short jump at: rcu_implicit_dynticks_qs+0x47/0xc4 
810ba311^M
[0.00] short jump at: trace_rcu_future_gp.isra.38+0x46/0xe9 
810b91e8
[0.00] short jump at: trace_rcu_grace_period+0x14/0x8f 810b90d3
[0.00] short jump at: trace_rcu_utilization+0xa/0x7a 810b9a6b
[0.00] short jump at: update_curr+0x89/0x14f 8106f4c9^M
[0.00] short jump at: update_stats_wait_end+0x5a/0xda 8106f203^M
[0.00] short jump at: delayed_put_task_struct+0x1b/0x95 
8103c798^M
[0.00] short jump at: trace_module_get+0x10/0x86 81096b44^M
[0.00] short jump at: pm_qos_update_flags+0xc5/0x149 81076fa0^M
[0.00] short jump at: pm_qos_update_request+0x51/0xf3 81076b1e^M
[0.00] short jump at: pm_qos_add_request+0xb7/0x14e 81076db9^M
[0.00] short jump at: wakeup_source_report_event+0x7b/0xfc 
81323045
[0.00] short jump at: trace_rpm_return_int+0x14/0x8f 81323d3d^M
[0.00] short jump at: __activate_page+0xdd/0x183 810f8a1d^M
[0.00] short jump at: __pagevec_lru_add_fn+0x139/0x1c4 
810f88b5^M
[0.00] short jump at: shrink_inactive_list+0x364/0x400 
810fcee8^M
[0.00] short jump at: isolate_lru_pages.isra.57+0xb6/0x14a 
810fbafb^M
[0.00] short jump at: wakeup_kswapd+0xaf/0x14a 810fbd20^M
[0.00] short jump at: free_hot_cold_page_list+0x2a/0xca 810f3d1e
[0.00] short jump at: kmem_cache_free+0x74/0xee 81129f9a^M
[0.00] short jump at: kmem_cache_alloc_node+0xe6/0x17b 
8112afb1^M
[0.00] short jump at: kmem_cache_alloc_node_trace+0xe1/0x176 
8112b615^M
[0.00] short jump at: kmem_cache_alloc+0xd8/0x168 8112c1fe^M
[0.00] short jump at: trace_kmalloc+0x21/0xac 8112aa7e^M
[0.00] short jump at: wait_iff_congested+0xdc/0x158 81105ee3^M
[0.00] short jump at: congestion_wait+0xa6/0x122 81106005^M
[0.00] short jump at: global_dirty_limits+0xd7/0x151 810f5f74
[0.00] short jump at: queue_io+0x165/0x1e6 811568ec
[0.00] short jump at: bdi_register+0xe9/0x161 81106329^M
[0.00] short jump at: bdi_start_background_writeback+0xf/0x9c

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 16:43 -0400, Steven Rostedt wrote:
> On Tue, 2013-08-06 at 16:33 -0400, Mathieu Desnoyers wrote:
> 
> > Steve, perhaps you could add a mode to your binary rewriting program
> > that counts the number of 2-byte vs 5-byte jumps found, and if possible
> > get a breakdown of those per subsystem ?
> 
> I actually started doing that, as I was curious to how many were being
> changed as well.

I didn't add it to the update program as that runs on each individual
object (needs to handle modules). But I put in the start up code a
counter to see what types were converted:

[3.387362] short jumps: 106
[3.390277]  long jumps: 330

Thus, approximately 25%. Not bad.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 16:33 -0400, Mathieu Desnoyers wrote:

> Steve, perhaps you could add a mode to your binary rewriting program
> that counts the number of 2-byte vs 5-byte jumps found, and if possible
> get a breakdown of those per subsystem ?

I actually started doing that, as I was curious to how many were being
changed as well.

Note, this is low on my priority list, so I work on it as I get time.

-- Steve




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Tue, 2013-08-06 at 10:48 -0700, Linus Torvalds wrote:
> 
> > So I wonder if this is a "ok, let's not bother, it's not worth the
> > pain" issue. 128 bytes of offset is very small, so there probably
> > aren't all that many cases that would use it.
> 
> OK, I'll forward port the original patches for the hell of it anyway,
> and post it as an RFC. Let people play with it if they want, and if it
> seems like it would benefit the kernel perhaps we can reconsider.
> 
> It shouldn't be too hard to do the forward port, and if we don't ever
> take it, it would be a fun exercise regardless ;-)
> 
> Actually, the first three patches should be added as they are clean ups
> and safety checks. Nothing to do with the actual 2-5 byte jumps. They
> were lost due to their association with the complex patches. :-/

Steve, perhaps you could add a mode to your binary rewriting program
that counts the number of 2-byte vs 5-byte jumps found, and if possible
get a breakdown of those per subsystem ? It might help us getting a
clearer picture of how many important sites, insn cache-wise, are being
shrinked by this approach.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 10:48 -0700, Linus Torvalds wrote:

> So I wonder if this is a "ok, let's not bother, it's not worth the
> pain" issue. 128 bytes of offset is very small, so there probably
> aren't all that many cases that would use it.

OK, I'll forward port the original patches for the hell of it anyway,
and post it as an RFC. Let people play with it if they want, and if it
seems like it would benefit the kernel perhaps we can reconsider.

It shouldn't be too hard to do the forward port, and if we don't ever
take it, it would be a fun exercise regardless ;-)

Actually, the first three patches should be added as they are clean ups
and safety checks. Nothing to do with the actual 2-5 byte jumps. They
were lost due to their association with the complex patches. :-/

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Linus Torvalds

On Tue, Aug 6, 2013 at 7:19 AM, Steven Rostedt  wrote:
>
> After playing with the patches again, I now understand why I did that.
> It wasn't just for optimization.

[explanation snipped]

> Anyway, if you feel that update_jump_label is too complex, I can go the
> "update at early boot" route and see how that goes.

Ugh. I'd love to see short jumps, but I do dislike binary rewriting,
and doing it at early boot seems really quite scary too.

So I wonder if this is a "ok, let's not bother, it's not worth the
pain" issue. 128 bytes of offset is very small, so there probably
aren't all that many cases that would use it.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin

On 08/06/2013 09:26 AM, Steven Rostedt wrote:
>>
>> No, but if we ever end up doing MPX in the kernel, for example, we would
>> have to put an MPX prefix on the jmp.
> 
> Well then we just have to update the rest of the jump label code :-)
> 

For MPX in the kernel, this would be a small part of the work...!

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 09:19 -0700, H. Peter Anvin wrote:
> On 08/06/2013 09:15 AM, Steven Rostedt wrote:
> > On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
> > 
> >> For unconditional jmp that should be pretty safe barring any fundamental
> >> changes to the instruction set, in which case we can enable it as
> >> needed, but for extra robustness it probably should skip prefix bytes.
> > 
> > Would the assembler add prefix bytes to:
> > 
> > jmp 1f
> > 
> 
> No, but if we ever end up doing MPX in the kernel, for example, we would
> have to put an MPX prefix on the jmp.

Well then we just have to update the rest of the jump label code :-)

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin

On 08/06/2013 09:15 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
> 
>> For unconditional jmp that should be pretty safe barring any fundamental
>> changes to the instruction set, in which case we can enable it as
>> needed, but for extra robustness it probably should skip prefix bytes.
> 
> Would the assembler add prefix bytes to:
> 
>   jmp 1f
> 

No, but if we ever end up doing MPX in the kernel, for example, we would
have to put an MPX prefix on the jmp.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:

> For unconditional jmp that should be pretty safe barring any fundamental
> changes to the instruction set, in which case we can enable it as
> needed, but for extra robustness it probably should skip prefix bytes.

Would the assembler add prefix bytes to:

jmp 1f

1:

??

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:49 -0700, Linus Torvalds wrote:

> Ugh. Why the crazy update_jump_label script stuff?

After playing with the patches again, I now understand why I did that.
It wasn't just for optimization.

Currently the way jump labels work is that we use asm goto() and place a
5 byte nop in the assembly, with some labels. The location of the nop is
stored in the __jump_table section.

In order to use either 2 or 5 byte jumps, I had to put in the actual
jump and let the assembler place the correct op code in. This changes
the default switch for jump labels. Instead of being default off, it is
now default on. To handle this, I had to convert all the jumps back to
nops before the kernel runs. This was done at compile time with the
update_jump_label script/program.

Now, we can just do the update in early boot, but is this the best way?
This means that the update must happen before any jump label is used.
This may not be an issue, but as jump labels can be used for anything
(not just tracing), it may be hard to know when the first instance is
actually used.

Also, if there is any issue with the op codes as Mathieu has been
pointing out, it would only be caught at run time (boot time).


The update_jump_label program isn't really that complex. Yes it parses
the elf tables, but that's rather standard and that method is used by
ftrace with the mcount locations (instead of that nasty daemon). It
finds the __jump_table section and runs down the list of locations just
like the boot up code does, and modifies the jumps to nops. If the
compiler does something strange, it would be caught at compile time not
boot time.

Anyway, if you feel that update_jump_label is too complex, I can go the
"update at early boot" route and see how that goes.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:49 -0700, Linus Torvalds wrote:

 Ugh. Why the crazy update_jump_label script stuff?

After playing with the patches again, I now understand why I did that.
It wasn't just for optimization.

Currently the way jump labels work is that we use asm goto() and place a
5 byte nop in the assembly, with some labels. The location of the nop is
stored in the __jump_table section.

In order to use either 2 or 5 byte jumps, I had to put in the actual
jump and let the assembler place the correct op code in. This changes
the default switch for jump labels. Instead of being default off, it is
now default on. To handle this, I had to convert all the jumps back to
nops before the kernel runs. This was done at compile time with the
update_jump_label script/program.

Now, we can just do the update in early boot, but is this the best way?
This means that the update must happen before any jump label is used.
This may not be an issue, but as jump labels can be used for anything
(not just tracing), it may be hard to know when the first instance is
actually used.

Also, if there is any issue with the op codes as Mathieu has been
pointing out, it would only be caught at run time (boot time).


The update_jump_label program isn't really that complex. Yes it parses
the elf tables, but that's rather standard and that method is used by
ftrace with the mcount locations (instead of that nasty daemon). It
finds the __jump_table section and runs down the list of locations just
like the boot up code does, and modifies the jumps to nops. If the
compiler does something strange, it would be caught at compile time not
boot time.

Anyway, if you feel that update_jump_label is too complex, I can go the
update at early boot route and see how that goes.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:

 For unconditional jmp that should be pretty safe barring any fundamental
 changes to the instruction set, in which case we can enable it as
 needed, but for extra robustness it probably should skip prefix bytes.

Would the assembler add prefix bytes to:

jmp 1f

1:

??

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin

On 08/06/2013 09:15 AM, Steven Rostedt wrote:
 On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
 
 For unconditional jmp that should be pretty safe barring any fundamental
 changes to the instruction set, in which case we can enable it as
 needed, but for extra robustness it probably should skip prefix bytes.
 
 Would the assembler add prefix bytes to:
 
   jmp 1f
 

No, but if we ever end up doing MPX in the kernel, for example, we would
have to put an MPX prefix on the jmp.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 09:19 -0700, H. Peter Anvin wrote:
 On 08/06/2013 09:15 AM, Steven Rostedt wrote:
  On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
  
  For unconditional jmp that should be pretty safe barring any fundamental
  changes to the instruction set, in which case we can enable it as
  needed, but for extra robustness it probably should skip prefix bytes.
  
  Would the assembler add prefix bytes to:
  
  jmp 1f
  
 
 No, but if we ever end up doing MPX in the kernel, for example, we would
 have to put an MPX prefix on the jmp.

Well then we just have to update the rest of the jump label code :-)

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin

On 08/06/2013 09:26 AM, Steven Rostedt wrote:

 No, but if we ever end up doing MPX in the kernel, for example, we would
 have to put an MPX prefix on the jmp.
 
 Well then we just have to update the rest of the jump label code :-)
 

For MPX in the kernel, this would be a small part of the work...!

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Linus Torvalds

On Tue, Aug 6, 2013 at 7:19 AM, Steven Rostedt rost...@goodmis.org wrote:

 After playing with the patches again, I now understand why I did that.
 It wasn't just for optimization.

[explanation snipped]

 Anyway, if you feel that update_jump_label is too complex, I can go the
 update at early boot route and see how that goes.

Ugh. I'd love to see short jumps, but I do dislike binary rewriting,
and doing it at early boot seems really quite scary too.

So I wonder if this is a ok, let's not bother, it's not worth the
pain issue. 128 bytes of offset is very small, so there probably
aren't all that many cases that would use it.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 10:48 -0700, Linus Torvalds wrote:

 So I wonder if this is a ok, let's not bother, it's not worth the
 pain issue. 128 bytes of offset is very small, so there probably
 aren't all that many cases that would use it.

OK, I'll forward port the original patches for the hell of it anyway,
and post it as an RFC. Let people play with it if they want, and if it
seems like it would benefit the kernel perhaps we can reconsider.

It shouldn't be too hard to do the forward port, and if we don't ever
take it, it would be a fun exercise regardless ;-)

Actually, the first three patches should be added as they are clean ups
and safety checks. Nothing to do with the actual 2-5 byte jumps. They
were lost due to their association with the complex patches. :-/

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
 On Tue, 2013-08-06 at 10:48 -0700, Linus Torvalds wrote:
 
  So I wonder if this is a ok, let's not bother, it's not worth the
  pain issue. 128 bytes of offset is very small, so there probably
  aren't all that many cases that would use it.
 
 OK, I'll forward port the original patches for the hell of it anyway,
 and post it as an RFC. Let people play with it if they want, and if it
 seems like it would benefit the kernel perhaps we can reconsider.
 
 It shouldn't be too hard to do the forward port, and if we don't ever
 take it, it would be a fun exercise regardless ;-)
 
 Actually, the first three patches should be added as they are clean ups
 and safety checks. Nothing to do with the actual 2-5 byte jumps. They
 were lost due to their association with the complex patches. :-/

Steve, perhaps you could add a mode to your binary rewriting program
that counts the number of 2-byte vs 5-byte jumps found, and if possible
get a breakdown of those per subsystem ? It might help us getting a
clearer picture of how many important sites, insn cache-wise, are being
shrinked by this approach.

Thanks,

Mathieu

 
 -- Steve
 
 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 16:33 -0400, Mathieu Desnoyers wrote:

 Steve, perhaps you could add a mode to your binary rewriting program
 that counts the number of 2-byte vs 5-byte jumps found, and if possible
 get a breakdown of those per subsystem ?

I actually started doing that, as I was curious to how many were being
changed as well.

Note, this is low on my priority list, so I work on it as I get time.

-- Steve




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 16:43 -0400, Steven Rostedt wrote:
 On Tue, 2013-08-06 at 16:33 -0400, Mathieu Desnoyers wrote:
 
  Steve, perhaps you could add a mode to your binary rewriting program
  that counts the number of 2-byte vs 5-byte jumps found, and if possible
  get a breakdown of those per subsystem ?
 
 I actually started doing that, as I was curious to how many were being
 changed as well.

I didn't add it to the update program as that runs on each individual
object (needs to handle modules). But I put in the start up code a
counter to see what types were converted:

[3.387362] short jumps: 106
[3.390277]  long jumps: 330

Thus, approximately 25%. Not bad.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Steven Rostedt

On Tue, 2013-08-06 at 20:45 -0400, Steven Rostedt wrote:

 [3.387362] short jumps: 106
 [3.390277]  long jumps: 330
 
 Thus, approximately 25%. Not bad.

Also, where these happen to be is probably even more important than how
many. If all the short jumps happen in slow paths, it's rather
pointless. But they seem to be in some rather hot paths. I had it print
out where it placed the short jumps too:

[0.00] short jump at: place_entity+0x53/0x87 8106e139^M
[0.00] short jump at: place_entity+0x17/0x87 8106e0fd^M
[0.00] short jump at: check_preempt_wakeup+0x11c/0x16e 
8106f92b^M
[0.00] short jump at: can_migrate_task+0xc6/0x15d 8106e72e
[0.00] short jump at: update_group_power+0x72/0x1df 81070394
[0.00] short jump at: update_group_power+0xaf/0x1df 810703d1^M
[0.00] short jump at: hrtick_enabled+0x4/0x35 8106de51
[0.00] short jump at: task_tick_fair+0x5c/0xf9 81070102^M
[0.00] short jump at: source_load+0x27/0x40 8106da7c^M
[0.00] short jump at: target_load+0x27/0x40 8106dabc^M
[0.00] short jump at: try_to_wake_up+0x127/0x1e2 8106b1d4^M
[0.00] short jump at: build_sched_domains+0x219/0x90b 8106bc24^M
[0.00] short jump at: 
smp_trace_call_function_single_interrupt+0x79/0x112 8102616f^M
[0.00] short jump at: smp_trace_call_function_interrupt+0x7a/0x111 
81026038
[0.00] short jump at: smp_trace_error_interrupt+0x72/0x109 
81028c9e
[0.00] short jump at: smp_trace_spurious_interrupt+0x71/0x107 
81028b77
[0.00] short jump at: smp_trace_reschedule_interrupt+0x7a/0x110 
81025f01^M
[0.00] short jump at: __raise_softirq_irqoff+0xf/0x90 810406e0^M
[0.00] short jump at: it_real_fn+0x17/0xb2 8103ed85
[0.00] short jump at: trace_itimer_state+0x13/0x97 8103e9ff^M
[0.00] short jump at: debug_deactivate+0xa/0x7a 8106014d^M
[0.00] short jump at: debug_activate+0x10/0x86 810478c7^M
[0.00] short jump at: __send_signal+0x233/0x268 8104a6bb
[0.00] short jump at: send_sigqueue+0x103/0x148 8104bbbf^M
[0.00] short jump at: trace_workqueue_activate_work+0xa/0x7a 
81053deb^M
[0.00] short jump at: _rcu_barrier_trace+0x31/0xbc 810b8f81
[0.00] short jump at: trace_rcu_dyntick+0x14/0x8f 810ba3a2^M
[0.00] short jump at: rcu_implicit_dynticks_qs+0x95/0xc4 
810ba35f
[0.00] short jump at: rcu_implicit_dynticks_qs+0x47/0xc4 
810ba311^M
[0.00] short jump at: trace_rcu_future_gp.isra.38+0x46/0xe9 
810b91e8
[0.00] short jump at: trace_rcu_grace_period+0x14/0x8f 810b90d3
[0.00] short jump at: trace_rcu_utilization+0xa/0x7a 810b9a6b
[0.00] short jump at: update_curr+0x89/0x14f 8106f4c9^M
[0.00] short jump at: update_stats_wait_end+0x5a/0xda 8106f203^M
[0.00] short jump at: delayed_put_task_struct+0x1b/0x95 
8103c798^M
[0.00] short jump at: trace_module_get+0x10/0x86 81096b44^M
[0.00] short jump at: pm_qos_update_flags+0xc5/0x149 81076fa0^M
[0.00] short jump at: pm_qos_update_request+0x51/0xf3 81076b1e^M
[0.00] short jump at: pm_qos_add_request+0xb7/0x14e 81076db9^M
[0.00] short jump at: wakeup_source_report_event+0x7b/0xfc 
81323045
[0.00] short jump at: trace_rpm_return_int+0x14/0x8f 81323d3d^M
[0.00] short jump at: __activate_page+0xdd/0x183 810f8a1d^M
[0.00] short jump at: __pagevec_lru_add_fn+0x139/0x1c4 
810f88b5^M
[0.00] short jump at: shrink_inactive_list+0x364/0x400 
810fcee8^M
[0.00] short jump at: isolate_lru_pages.isra.57+0xb6/0x14a 
810fbafb^M
[0.00] short jump at: wakeup_kswapd+0xaf/0x14a 810fbd20^M
[0.00] short jump at: free_hot_cold_page_list+0x2a/0xca 810f3d1e
[0.00] short jump at: kmem_cache_free+0x74/0xee 81129f9a^M
[0.00] short jump at: kmem_cache_alloc_node+0xe6/0x17b 
8112afb1^M
[0.00] short jump at: kmem_cache_alloc_node_trace+0xe1/0x176 
8112b615^M
[0.00] short jump at: kmem_cache_alloc+0xd8/0x168 8112c1fe^M
[0.00] short jump at: trace_kmalloc+0x21/0xac 8112aa7e^M
[0.00] short jump at: wait_iff_congested+0xdc/0x158 81105ee3^M
[0.00] short jump at: congestion_wait+0xa6/0x122 81106005^M
[0.00] short jump at: global_dirty_limits+0xd7/0x151 810f5f74
[0.00] short jump at: queue_io+0x165/0x1e6 811568ec
[0.00] short jump at: bdi_register+0xe9/0x161 81106329^M
[0.00] short jump at: bdi_start_background_writeback+0xf/0x9c

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Ondřej Bílka

On Tue, Aug 06, 2013 at 08:56:00PM -0400, Steven Rostedt wrote:
 On Tue, 2013-08-06 at 20:45 -0400, Steven Rostedt wrote:
 
  [3.387362] short jumps: 106
  [3.390277]  long jumps: 330
  
  Thus, approximately 25%. Not bad.
 
 Also, where these happen to be is probably even more important than how
 many. If all the short jumps happen in slow paths, it's rather
 pointless. But they seem to be in some rather hot paths. I had it print
 out where it placed the short jumps too:
 
 
 The kmem_cache_* and the try_to_wake_up* are the hot paths that caught
 my eye.
 
 But still, is this worth it?

Add short_counter,long_counter and before increment counter before each
jump. That way we will know how many short/long jumps were taken. 
 -- Steve
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 09:14 PM, Mathieu Desnoyers wrote:
>>
>> For unconditional jmp that should be pretty safe barring any fundamental
>> changes to the instruction set, in which case we can enable it as
>> needed, but for extra robustness it probably should skip prefix bytes.
> 
> On x86-32, some prefixes are actually meaningful. AFAIK, the 0x66 prefix
> is used for:
> 
> E9 cw   jmp rel16   relative jump, only in 32-bit
> 
> Other prefixes can probably be safely skipped.
> 

Yes.  Some of them are used as hints or for MPX.

> Another question is whether anything prevents the assembler from
> generating a jump near (absolute indirect), or far jump. The code above
> seems to assume that we have either a short or near relative jump.

Absolutely something prevents!  It would be a very serious error for the
assembler to generate such instructions.

-hpa




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Mathieu Desnoyers

* H. Peter Anvin (h...@linux.intel.com) wrote:
> On 08/05/2013 02:28 PM, Mathieu Desnoyers wrote:
> > * Linus Torvalds (torva...@linux-foundation.org) wrote:
> >> On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
> >>  wrote:
> >>>
> >>> I remember that choosing between 2 and 5 bytes nop in the asm goto was
> >>> tricky: it had something to do with the fact that gcc doesn't know the
> >>> exact size of each instructions until further down within compilation
> >>
> >> Oh, you can't do it in the coompiler, no. But you don't need to. The
> >> assembler will pick the right version if you just do "jmp target".
> > 
> > Yep.
> > 
> > Another thing that bothers me with Steven's approach is that decoding
> > jumps generated by the compiler seems fragile IMHO.
> > 
> > x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :
> > 
> > +static int make_nop_x86(void *map, size_t const offset)
> > +{
> > +   unsigned char *op;
> > +   unsigned char *nop;
> > +   int size;
> > +
> > +   /* Determine which type of jmp this is 2 byte or 5. */
> > +   op = map + offset;
> > +   switch (*op) {
> > +   case 0xeb: /* 2 byte */
> > +   size = 2;
> > +   nop = ideal_nop2_x86;
> > +   break;
> > +   case 0xe9: /* 5 byte */
> > +   size = 5;
> > +   nop = ideal_nop;
> > +   break;
> > +   default:
> > +   die(NULL, "Bad jump label section (bad op %x)\n", *op);
> > +   __builtin_unreachable();
> > +   }
> > 
> > My though is that the code above does not cover all jump encodings that
> > can be generated by past, current and future x86 assemblers.
> > 
> 
> For unconditional jmp that should be pretty safe barring any fundamental
> changes to the instruction set, in which case we can enable it as
> needed, but for extra robustness it probably should skip prefix bytes.

On x86-32, some prefixes are actually meaningful. AFAIK, the 0x66 prefix
is used for:

E9 cw   jmp rel16   relative jump, only in 32-bit

Other prefixes can probably be safely skipped.

Another question is whether anything prevents the assembler from
generating a jump near (absolute indirect), or far jump. The code above
seems to assume that we have either a short or near relative jump.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 22:26 -0400, Jason Baron wrote:

> I think if the 'cold' attribute on the default disabled static_key 
> branch moved the text completely out-of-line, it would satisfy your 
> requirement here?
> 
> If you like this approach, perhaps we can make something like this work 
> within gcc. As its already supported, but doesn't quite go far enough 
> for our purposes.

It may not be too bad to use.

> 
> Also, if we go down this path, it means the 2-byte jump sequence is 
> probably not going to be too useful.

Don't count us out yet :-)


static inline bool arch_static_branch(struct static_key *key)
{
asm goto("1:"
[...]
: : "i" (key) : : l_yes);
return false;
l_yes:
goto __l_yes;
__l_yes: __attribute__((cold));
return false;
}

Or put that logic in the caller of arch_static_branch(). Basically, we
may be able to do a short jump to the place that will do a long jump to
the real work.

I'll have to play with this and see what gcc does with the output.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Jason Baron


On 08/05/2013 04:35 PM, Richard Henderson wrote:

On 08/05/2013 09:57 AM, Jason Baron wrote:

On 08/05/2013 03:40 PM, Marek Polacek wrote:

On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote:

On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
 wrote:

Ugh. I can see the attraction of your section thing for that case, I
just get the feeling that we should be able to do better somehow.

Hmm.. Quite frankly, Steven, for your use case I think you actually
want the C goto *labels* associated with a section. Which sounds like
it might be a cleaner syntax than making it about the basic block
anyway.

FWIW, we also support hot/cold attributes for labels, thus e.g.

if (bar ())
  goto A;
/* ... */
A: __attribute__((cold))
/* ... */

I don't know whether that might be useful for what you want or not though...

 Marek


It certainly would be.

That was how I wanted to the 'static_key' stuff to work, but unfortunately the
last time I tried it, it didn't move the text out-of-line any further than it
was already doing. Would that be expected? The change for us, if it worked
would be quite simple. Something like:

It is expected.  One must use -freorder-blocks-and-partition, and use real
profile feedback to get blocks moved completely out-of-line.

Whether that's a sensible default or not is debatable.



Hi Steve,

I think if the 'cold' attribute on the default disabled static_key 
branch moved the text completely out-of-line, it would satisfy your 
requirement here?


If you like this approach, perhaps we can make something like this work 
within gcc. As its already supported, but doesn't quite go far enough 
for our purposes.


Also, if we go down this path, it means the 2-byte jump sequence is 
probably not going to be too useful.


Thanks,

-Jason




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Mon, 2013-08-05 at 17:28 -0400, Mathieu Desnoyers wrote:
> 
[...]
> > My though is that the code above does not cover all jump encodings that
> > can be generated by past, current and future x86 assemblers.
> > 
> > Another way around this issue might be to keep the instruction size
> > within a non-allocated section:
> > 
> > static __always_inline bool arch_static_branch(struct static_key *key)
> > {
> > asm goto("1:"
> > "jmp %l[l_yes]\n\t"
> > "2:"
> > 
> > ".pushsection __jump_table,  \"aw\" \n\t"
> > _ASM_ALIGN "\n\t"
> > _ASM_PTR "1b, %l[l_yes], %c0 \n\t"
> > ".popsection \n\t"
> > 
> > ".pushsection __jump_table_ilen \n\t"
> > _ASM_PTR "1b \n\t"  /* Address of the jmp */
> > ".byte 2b - 1b \n\t"/* Size of the jmp instruction */
> > ".popsection \n\t"
> > 
> > : :  "i" (key) : : l_yes);
> > return false;
> > l_yes:
> > return true;
> > }
> > 
> > And use (2b - 1b) to know what size of no-op should be used rather than
> > to rely on instruction decoding.
> > 
> > Thoughts ?
> > 
> 
> Then we need to add yet another table of information to the kernel that
> needs to hang around. This goes with another kernel-discuss request
> talking about kernel data bloat.

Perhaps this section could be simply removed by the post-link stage ?

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 02:28 PM, Mathieu Desnoyers wrote:
> * Linus Torvalds (torva...@linux-foundation.org) wrote:
>> On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
>>  wrote:
>>>
>>> I remember that choosing between 2 and 5 bytes nop in the asm goto was
>>> tricky: it had something to do with the fact that gcc doesn't know the
>>> exact size of each instructions until further down within compilation
>>
>> Oh, you can't do it in the coompiler, no. But you don't need to. The
>> assembler will pick the right version if you just do "jmp target".
> 
> Yep.
> 
> Another thing that bothers me with Steven's approach is that decoding
> jumps generated by the compiler seems fragile IMHO.
> 
> x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :
> 
> +static int make_nop_x86(void *map, size_t const offset)
> +{
> + unsigned char *op;
> + unsigned char *nop;
> + int size;
> +
> + /* Determine which type of jmp this is 2 byte or 5. */
> + op = map + offset;
> + switch (*op) {
> + case 0xeb: /* 2 byte */
> + size = 2;
> + nop = ideal_nop2_x86;
> + break;
> + case 0xe9: /* 5 byte */
> + size = 5;
> + nop = ideal_nop;
> + break;
> + default:
> + die(NULL, "Bad jump label section (bad op %x)\n", *op);
> + __builtin_unreachable();
> + }
> 
> My though is that the code above does not cover all jump encodings that
> can be generated by past, current and future x86 assemblers.
> 

For unconditional jmp that should be pretty safe barring any fundamental
changes to the instruction set, in which case we can enable it as
needed, but for extra robustness it probably should skip prefix bytes.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 17:28 -0400, Mathieu Desnoyers wrote:

> Another thing that bothers me with Steven's approach is that decoding
> jumps generated by the compiler seems fragile IMHO.

The encodings wont change. If they do, then old kernels will not run on
new hardware.

Now if it adds a third option to jmp, then we hit the "die" path and
know right away that it wont work anymore. Then we fix it properly.

> 
> x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :
> 
> +static int make_nop_x86(void *map, size_t const offset)
> +{
> + unsigned char *op;
> + unsigned char *nop;
> + int size;
> +
> + /* Determine which type of jmp this is 2 byte or 5. */
> + op = map + offset;
> + switch (*op) {
> + case 0xeb: /* 2 byte */
> + size = 2;
> + nop = ideal_nop2_x86;
> + break;
> + case 0xe9: /* 5 byte */
> + size = 5;
> + nop = ideal_nop;
> + break;
> + default:
> + die(NULL, "Bad jump label section (bad op %x)\n", *op);
> + __builtin_unreachable();
> + }
> 
> My though is that the code above does not cover all jump encodings that
> can be generated by past, current and future x86 assemblers.
> 
> Another way around this issue might be to keep the instruction size
> within a non-allocated section:
> 
> static __always_inline bool arch_static_branch(struct static_key *key)
> {
> asm goto("1:"
> "jmp %l[l_yes]\n\t"
> "2:"
> 
> ".pushsection __jump_table,  \"aw\" \n\t"
> _ASM_ALIGN "\n\t"
> _ASM_PTR "1b, %l[l_yes], %c0 \n\t"
> ".popsection \n\t"
> 
> ".pushsection __jump_table_ilen \n\t"
> _ASM_PTR "1b \n\t"  /* Address of the jmp */
> ".byte 2b - 1b \n\t"/* Size of the jmp instruction */
> ".popsection \n\t"
> 
> : :  "i" (key) : : l_yes);
> return false;
> l_yes:
> return true;
> }
> 
> And use (2b - 1b) to know what size of no-op should be used rather than
> to rely on instruction decoding.
> 
> Thoughts ?
> 

Then we need to add yet another table of information to the kernel that
needs to hang around. This goes with another kernel-discuss request
talking about kernel data bloat.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Mathieu Desnoyers

* Linus Torvalds (torva...@linux-foundation.org) wrote:
> On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
>  wrote:
> >
> > I remember that choosing between 2 and 5 bytes nop in the asm goto was
> > tricky: it had something to do with the fact that gcc doesn't know the
> > exact size of each instructions until further down within compilation
> 
> Oh, you can't do it in the coompiler, no. But you don't need to. The
> assembler will pick the right version if you just do "jmp target".

Yep.

Another thing that bothers me with Steven's approach is that decoding
jumps generated by the compiler seems fragile IMHO.

x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :

+static int make_nop_x86(void *map, size_t const offset)
+{
+   unsigned char *op;
+   unsigned char *nop;
+   int size;
+
+   /* Determine which type of jmp this is 2 byte or 5. */
+   op = map + offset;
+   switch (*op) {
+   case 0xeb: /* 2 byte */
+   size = 2;
+   nop = ideal_nop2_x86;
+   break;
+   case 0xe9: /* 5 byte */
+   size = 5;
+   nop = ideal_nop;
+   break;
+   default:
+   die(NULL, "Bad jump label section (bad op %x)\n", *op);
+   __builtin_unreachable();
+   }

My though is that the code above does not cover all jump encodings that
can be generated by past, current and future x86 assemblers.

Another way around this issue might be to keep the instruction size
within a non-allocated section:

static __always_inline bool arch_static_branch(struct static_key *key)
{
asm goto("1:"
"jmp %l[l_yes]\n\t"
"2:"

".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
_ASM_PTR "1b, %l[l_yes], %c0 \n\t"
".popsection \n\t"

".pushsection __jump_table_ilen \n\t"
_ASM_PTR "1b \n\t"  /* Address of the jmp */
".byte 2b - 1b \n\t"/* Size of the jmp instruction */
".popsection \n\t"

: :  "i" (key) : : l_yes);
return false;
l_yes:
return true;
}

And use (2b - 1b) to know what size of no-op should be used rather than
to rely on instruction decoding.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Richard Henderson

On 08/05/2013 09:57 AM, Jason Baron wrote:
> On 08/05/2013 03:40 PM, Marek Polacek wrote:
>> On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote:
>>> On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
>>>  wrote:
 Ugh. I can see the attraction of your section thing for that case, I
 just get the feeling that we should be able to do better somehow.
>>> Hmm.. Quite frankly, Steven, for your use case I think you actually
>>> want the C goto *labels* associated with a section. Which sounds like
>>> it might be a cleaner syntax than making it about the basic block
>>> anyway.
>> FWIW, we also support hot/cold attributes for labels, thus e.g.
>>
>>if (bar ())
>>  goto A;
>>/* ... */
>> A: __attribute__((cold))
>>/* ... */
>>
>> I don't know whether that might be useful for what you want or not though...
>>
>> Marek
>>
> 
> It certainly would be.
> 
> That was how I wanted to the 'static_key' stuff to work, but unfortunately the
> last time I tried it, it didn't move the text out-of-line any further than it
> was already doing. Would that be expected? The change for us, if it worked
> would be quite simple. Something like:

It is expected.  One must use -freorder-blocks-and-partition, and use real
profile feedback to get blocks moved completely out-of-line.

Whether that's a sensible default or not is debatable.


r~
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Jason Baron


On 08/05/2013 02:39 PM, Steven Rostedt wrote:

On Mon, 2013-08-05 at 11:20 -0700, Linus Torvalds wrote:


Of course, it would be good to optimize static_key_false() itself -
right now those static key jumps are always five bytes, and while they
get nopped out, it would still be nice if there was some way to have
just a two-byte nop (turning into a short branch) *if* we can reach
another jump that way..For small functions that would be lovely. Oh
well.

I had patches that did exactly this:

  https://lkml.org/lkml/2012/3/8/461

But it got dropped for some reason. I don't remember why. Maybe because
of the complexity?

-- Steve


Hi Steve,

I recall testing your patches and the text size increased unexpectedly. 
I believe I correctly accounted for changes to the text size *outside* 
of branch points. If you do re-visit the series that is one thing I'd 
like to double check/understand.


Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 12:57 -0700, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
>  wrote:
> >
> > I remember that choosing between 2 and 5 bytes nop in the asm goto was
> > tricky: it had something to do with the fact that gcc doesn't know the
> > exact size of each instructions until further down within compilation
> 
> Oh, you can't do it in the coompiler, no. But you don't need to. The
> assembler will pick the right version if you just do "jmp target".

Right, and that's exactly what my patches did.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
 wrote:
>
> I remember that choosing between 2 and 5 bytes nop in the asm goto was
> tricky: it had something to do with the fact that gcc doesn't know the
> exact size of each instructions until further down within compilation

Oh, you can't do it in the coompiler, no. But you don't need to. The
assembler will pick the right version if you just do "jmp target".

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Jason Baron


On 08/05/2013 03:40 PM, Marek Polacek wrote:

On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote:

On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
 wrote:

Ugh. I can see the attraction of your section thing for that case, I
just get the feeling that we should be able to do better somehow.

Hmm.. Quite frankly, Steven, for your use case I think you actually
want the C goto *labels* associated with a section. Which sounds like
it might be a cleaner syntax than making it about the basic block
anyway.

FWIW, we also support hot/cold attributes for labels, thus e.g.

   if (bar ())
 goto A;
   /* ... */
A: __attribute__((cold))
   /* ... */

I don't know whether that might be useful for what you want or not though...

Marek



It certainly would be.

That was how I wanted to the 'static_key' stuff to work, but 
unfortunately the last time I tried it, it didn't move the text 
out-of-line any further than it was already doing. Would that be 
expected? The change for us, if it worked would be quite simple. 
Something like:


--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -21,7 +21,7 @@ static __always_inline bool arch_static_branch(struct 
static_key *key)

".popsection \n\t"
: :  "i" (key) : : l_yes);
return false;
-l_yes:
+l_yes: __attribute__((cold))
return true;
 }




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 12:40 PM, Marek Polacek  wrote:
>
> FWIW, we also support hot/cold attributes for labels, thus e.g.
>
>   if (bar ())
> goto A;
>   /* ... */
> A: __attribute__((cold))
>   /* ... */
>
> I don't know whether that might be useful for what you want or not though...

Steve? That does sound like it might at least re-order the basic
blocks better for your cases. Worth checking out, no?

That said, I don't know what gcc actually does for that case. It may
be that it just ends up trying to transfer that "cold" information to
the conditional itself, which wouldn't work for our asm goto use. I
hope/assume it doesn't do that, though, since the "cold" attribute
would presumably also be useful for things like computed gotos etc -
so it really isn't about the _source_ of the branch, but about that
specific target, and the basic block re-ordering.

Anyway, the exact implementation details may make it more or less
useful for our special static key things. But it does sound like the
right thing to do for static keys.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Mathieu Desnoyers

* Linus Torvalds (torva...@linux-foundation.org) wrote:
[...]
> With two-byte jumps, you'd still get the I$ fragmentation (the
> argument generation and the call and the branch back would all be in
> the same code segment as the hot code), but that would be offset by
> the fact that at least the hot code itself could use a short jump when
> possible (ie a 2-byte nop rather than a 5-byte one).

I remember that choosing between 2 and 5 bytes nop in the asm goto was
tricky: it had something to do with the fact that gcc doesn't know the
exact size of each instructions until further down within compilation
phases on architectures with variable instruction size like x86. If we
have guarantees that the guessed size of each instruction is an upper
bound on the instruction size, this could probably work though.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Marek Polacek

On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
>  wrote:
> >
> > Ugh. I can see the attraction of your section thing for that case, I
> > just get the feeling that we should be able to do better somehow.
> 
> Hmm.. Quite frankly, Steven, for your use case I think you actually
> want the C goto *labels* associated with a section. Which sounds like
> it might be a cleaner syntax than making it about the basic block
> anyway.

FWIW, we also support hot/cold attributes for labels, thus e.g.

  if (bar ()) 
goto A;
  /* ... */
A: __attribute__((cold))
  /* ... */

I don't know whether that might be useful for what you want or not though...

Marek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:49 -0700, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 11:39 AM, Steven Rostedt  wrote:
> >
> > I had patches that did exactly this:
> >
> >  https://lkml.org/lkml/2012/3/8/461
> >
> > But it got dropped for some reason. I don't remember why. Maybe because
> > of the complexity?
> 
> Ugh. Why the crazy update_jump_label script stuff? I'd go "Eww" at
> that too, it looks crazy. The assembler already knows to make short
> 2-byte "jmp" instructions for near jumps, and you can just look at the
> opcode itself to determine size, why is all that other stuff required?

Hmm, I probably added that "optimization" in there because I was doing a
bunch of jump label work and just included it in. It's been over a year
since I've worked on this so I don't remember all the details. That
update_jump_label program may have just been to do the conversion of
nops at compile time and not during boot. It may not be needed. Also, it
was based on the record-mcount code that the function tracer uses, which
is also done at compile time, to get all the mcount locations.

> 
> IOW, 5/7 looks sane, but 4/7 makes me go "there's something wrong with
> that series".

I just quickly looked at the changes again. I think I can redo them and
send them again for 3.12. What do you think about keeping all but patch
4?

1 - Use a default nop at boot. I had help from hpa on this. Currently,
jump labels use a jmp instead of a nop on boot.

2 - On boot, the jump label nops (jump before patch 1) looks at the best
run time nop, and converts them. Since it is likely that the current
nop is already ideal, skip the conversion. Again, this is just a
boot up optimization.

3 - Add a test to see what we are converting. Adds safety checks like
there
is in the function tracer, where if it updates a location, and does
not
find what it expects to find, output a nasty bug.

< will skip patch 4 >

5 - Does what you want, with the 2 and 5 byte nops.

6 - When/if a failure does trigger. Print out information to what went
wrong. Helps debugging splats caused by patch 3.

7 - needs to go before patch 3. As patch 3 can trigger if the default
nop
is not the ideal nop for the box that is running. 

If I take out patch 4, would that solution look fine for you? I can get
this ready for 3.12.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Xinliang David Li

On Mon, Aug 5, 2013 at 12:16 PM, Steven Rostedt  wrote:
> On Mon, 2013-08-05 at 12:04 -0700, Andi Kleen wrote:
>> Steven Rostedt  writes:
>>
>> Can't you just use -freorder-blocks-and-partition?
>
> Yeah, I'm familiar with this option.
>

This option works best with FDO.   FDOed linux kernel rocks :)

>>
>> This should already partition unlikely blocks into a
>> different section. Just a single one of course.
>>
>> FWIW the disadvantage is that multiple code sections tends
>> to break various older dwarf unwinders, as it needs
>> dwarf3 latest'n'greatest.
>
> If the option was so good, I would expect everyone would be using it ;-)
>

There were lots of problems with this option -- recently cleaned
up/fixed by Teresa in GCC trunk.

thanks,

David

>
> I'm mainly only concerned with the tracepoints. I'm asking to be able to
> do this with just the tracepoint code, and affect nobody else.
>
> -- Steve
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 12:04 PM, Andi Kleen  wrote:
> Steven Rostedt  writes:
>
> Can't you just use -freorder-blocks-and-partition?
>
> This should already partition unlikely blocks into a
> different section. Just a single one of course.

That's horrible. Not because of dwarf problems, but exactly because
unlikely code isn't necessarily *that* unlikely, and normal unlikely
code is reached with a small branch. Making it a whole different
section breaks both of those.

Maybe some "really_unlikely()" would make it ok.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 12:04 -0700, Andi Kleen wrote:
> Steven Rostedt  writes:
> 
> Can't you just use -freorder-blocks-and-partition?

Yeah, I'm familiar with this option.

> 
> This should already partition unlikely blocks into a
> different section. Just a single one of course.
> 
> FWIW the disadvantage is that multiple code sections tends
> to break various older dwarf unwinders, as it needs
> dwarf3 latest'n'greatest.

If the option was so good, I would expect everyone would be using it ;-)


I'm mainly only concerned with the tracepoints. I'm asking to be able to
do this with just the tracepoint code, and affect nobody else.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:51 -0700, H. Peter Anvin wrote:
> On 08/05/2013 11:49 AM, Steven Rostedt wrote:
> > On Mon, 2013-08-05 at 11:29 -0700, H. Peter Anvin wrote:
> > 
> >> Traps nest, that's why there is a stack.  (OK, so you don't want to take
> >> the same trap inside the trap handler, but that code should be very
> >> limited.)  The trap instruction just becomes very short, but rather
> >> slow, call-return.
> >>
> >> However, when you consider the cost you have to consider that the
> >> tracepoint is doing other work, so it may very well amortize out.
> > 
> > Also, how would you pass the parameters? Every tracepoint has its own
> > parameters to pass to it. How would a trap know what where to get "prev"
> > and "next"?
> > 
> 
> How do you do that now?
> 
> You have to do an IP lookup to find out what you are doing.

??

You mean to do the enabling? Sure, but not after the code is enabled.
There's no lookup. It just calls functions directly.

> 
> (Note: I wonder how much the parameter generation costs the tracepoints.)

The same as doing a function call.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Andi Kleen

Steven Rostedt  writes:

Can't you just use -freorder-blocks-and-partition?

This should already partition unlikely blocks into a
different section. Just a single one of course.

FWIW the disadvantage is that multiple code sections tends
to break various older dwarf unwinders, as it needs
dwarf3 latest'n'greatest.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:34 -0700, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
>  wrote:
> >
> > Ugh. I can see the attraction of your section thing for that case, I
> > just get the feeling that we should be able to do better somehow.
> 
> Hmm.. Quite frankly, Steven, for your use case I think you actually
> want the C goto *labels* associated with a section. Which sounds like
> it might be a cleaner syntax than making it about the basic block
> anyway.

I would love to. But IIRC, the asm_goto() has some strict constraints.
We may be able to jump to a different section, but we have no way of
coming back. Not to mention, you must tell the asm goto() what label you
may be jumping to.

I don't know how safe something like this may be:

static inline trace_sched_switch(prev, next)
{
asm goto("jmp foo1\n" : : foo2);
 foo1:
return;

asm goto(".pushsection\n"
"section \".foo\"\n");
 foo2:
__trace_sched_switch(prev, next);
asm goto("jmp foo1"
".popsection\n" : : foo1);
}

The above looks too fragile for my taste. I'm afraid gcc will move stuff
out of those "asm goto" locations, and make things just fail. But I can
play with this, but I don't like it.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 11:51 AM, H. Peter Anvin  wrote:
>>
>> Also, how would you pass the parameters? Every tracepoint has its own
>> parameters to pass to it. How would a trap know what where to get "prev"
>> and "next"?
>
> How do you do that now?
>
> You have to do an IP lookup to find out what you are doing.

No, he just generates the code for the call and then uses a static_key
to jump to it. So normally it's all out-of-line, and the only thing in
the hot-path is that 5-byte nop (which gets turned into a 5-byte jump
when the tracing key is enabled)

Works fine, but the normally unused stubs end up mixing in the normal
code segment. Which I actually think is fine, but right now we don't
get the short-jump advantage from it (and there is likely some I$
disadvantage from just fragmentation of the code).

With two-byte jumps, you'd still get the I$ fragmentation (the
argument generation and the call and the branch back would all be in
the same code segment as the hot code), but that would be offset by
the fact that at least the hot code itself could use a short jump when
possible (ie a 2-byte nop rather than a 5-byte one).

Don't know which way it would go performance-wise. But it shouldn't
need gcc changes, it just needs the static key branch/nop rewriting to
be able to handle both sizes. I couldn't tell why Steven's series to
do that was so complex, though - I only glanced through the patches.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 11:49 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 11:29 -0700, H. Peter Anvin wrote:
> 
>> Traps nest, that's why there is a stack.  (OK, so you don't want to take
>> the same trap inside the trap handler, but that code should be very
>> limited.)  The trap instruction just becomes very short, but rather
>> slow, call-return.
>>
>> However, when you consider the cost you have to consider that the
>> tracepoint is doing other work, so it may very well amortize out.
> 
> Also, how would you pass the parameters? Every tracepoint has its own
> parameters to pass to it. How would a trap know what where to get "prev"
> and "next"?
> 

How do you do that now?

You have to do an IP lookup to find out what you are doing.

(Note: I wonder how much the parameter generation costs the tracepoints.)

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 11:39 AM, Steven Rostedt  wrote:
>
> I had patches that did exactly this:
>
>  https://lkml.org/lkml/2012/3/8/461
>
> But it got dropped for some reason. I don't remember why. Maybe because
> of the complexity?

Ugh. Why the crazy update_jump_label script stuff? I'd go "Eww" at
that too, it looks crazy. The assembler already knows to make short
2-byte "jmp" instructions for near jumps, and you can just look at the
opcode itself to determine size, why is all that other stuff required?

IOW, 5/7 looks sane, but 4/7 makes me go "there's something wrong with
that series".

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:29 -0700, H. Peter Anvin wrote:

> Traps nest, that's why there is a stack.  (OK, so you don't want to take
> the same trap inside the trap handler, but that code should be very
> limited.)  The trap instruction just becomes very short, but rather
> slow, call-return.
> 
> However, when you consider the cost you have to consider that the
> tracepoint is doing other work, so it may very well amortize out.

Also, how would you pass the parameters? Every tracepoint has its own
parameters to pass to it. How would a trap know what where to get "prev"
and "next"?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 11:34 AM, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
>  wrote:
>>
>> Ugh. I can see the attraction of your section thing for that case, I
>> just get the feeling that we should be able to do better somehow.
> 
> Hmm.. Quite frankly, Steven, for your use case I think you actually
> want the C goto *labels* associated with a section. Which sounds like
> it might be a cleaner syntax than making it about the basic block
> anyway.
> 

A label wouldn't have an endpoint, though...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:20 -0700, Linus Torvalds wrote:

> Of course, it would be good to optimize static_key_false() itself -
> right now those static key jumps are always five bytes, and while they
> get nopped out, it would still be nice if there was some way to have
> just a two-byte nop (turning into a short branch) *if* we can reach
> another jump that way..For small functions that would be lovely. Oh
> well.

I had patches that did exactly this:

 https://lkml.org/lkml/2012/3/8/461

But it got dropped for some reason. I don't remember why. Maybe because
of the complexity?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
 wrote:
>
> Ugh. I can see the attraction of your section thing for that case, I
> just get the feeling that we should be able to do better somehow.

Hmm.. Quite frankly, Steven, for your use case I think you actually
want the C goto *labels* associated with a section. Which sounds like
it might be a cleaner syntax than making it about the basic block
anyway.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 11:20 AM, Linus Torvalds wrote:
> 
> Of course, it would be good to optimize static_key_false() itself -
> right now those static key jumps are always five bytes, and while they
> get nopped out, it would still be nice if there was some way to have
> just a two-byte nop (turning into a short branch) *if* we can reach
> another jump that way..For small functions that would be lovely. Oh
> well.
> 

That would definitely require gcc support.  It would be useful, but
probably requires a lot of machinery.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 11:23 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 11:17 -0700, H. Peter Anvin wrote:
>> On 08/05/2013 10:55 AM, Steven Rostedt wrote:
>>>
>>> Well, as tracepoints are being added quite a bit in Linux, my concern is
>>> with the inlined functions that they bring. With jump labels they are
>>> disabled in a very unlikely way (the static_key_false() is a nop to skip
>>> the code, and is dynamically enabled to a jump).
>>>
>>
>> Have you considered using traps for tracepoints?  A trapping instruction
>> can be as small as a single byte.  The downside, of course, is that it
>> is extremely suppressed -- the trap is always expensive -- and you then
>> have to do a lookup to find the target based on the originating IP.
> 
> No, never considered it, nor would I. Those that use tracepoints, do use
> them extensively, and adding traps like this would probably cause
> heissenbugs and make tracepoints useless.
> 
> Not to mention, how would we add a tracepoint to a trap handler?
> 

Traps nest, that's why there is a stack.  (OK, so you don't want to take
the same trap inside the trap handler, but that code should be very
limited.)  The trap instruction just becomes very short, but rather
slow, call-return.

However, when you consider the cost you have to consider that the
tracepoint is doing other work, so it may very well amortize out.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 11:20 AM, Linus Torvalds
 wrote:
>
> The static_key_false() approach with minimal inlining sounds like a
> much better approach overall.

Sorry, I misunderstood your thing. That's actually what you want that
section thing for, because right now you cannot generate the argument
expansion otherwise.

Ugh. I can see the attraction of your section thing for that case, I
just get the feeling that we should be able to do better somehow.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 11:17 -0700, H. Peter Anvin wrote:
> On 08/05/2013 10:55 AM, Steven Rostedt wrote:
> > 
> > Well, as tracepoints are being added quite a bit in Linux, my concern is
> > with the inlined functions that they bring. With jump labels they are
> > disabled in a very unlikely way (the static_key_false() is a nop to skip
> > the code, and is dynamically enabled to a jump).
> > 
> 
> Have you considered using traps for tracepoints?  A trapping instruction
> can be as small as a single byte.  The downside, of course, is that it
> is extremely suppressed -- the trap is always expensive -- and you then
> have to do a lookup to find the target based on the originating IP.

No, never considered it, nor would I. Those that use tracepoints, do use
them extensively, and adding traps like this would probably cause
heissenbugs and make tracepoints useless.

Not to mention, how would we add a tracepoint to a trap handler?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 10:55 AM, Steven Rostedt  wrote:
>
> My main concern is with tracepoints. Which on 90% (or more) of systems
> running Linux, is completely off, and basically just dead code, until
> someone wants to see what's happening and enables them.

The static_key_false() approach with minimal inlining sounds like a
much better approach overall. Sure, it might add a call/ret, but it
adds it to just the unlikely tracepoint taken path.

Of course, it would be good to optimize static_key_false() itself -
right now those static key jumps are always five bytes, and while they
get nopped out, it would still be nice if there was some way to have
just a two-byte nop (turning into a short branch) *if* we can reach
another jump that way..For small functions that would be lovely. Oh
well.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 10:55 AM, Steven Rostedt wrote:
> 
> Well, as tracepoints are being added quite a bit in Linux, my concern is
> with the inlined functions that they bring. With jump labels they are
> disabled in a very unlikely way (the static_key_false() is a nop to skip
> the code, and is dynamically enabled to a jump).
> 

Have you considered using traps for tracepoints?  A trapping instruction
can be as small as a single byte.  The downside, of course, is that it
is extremely suppressed -- the trap is always expensive -- and you then
have to do a lookup to find the target based on the originating IP.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 13:55 -0400, Steven Rostedt wrote:
>  The difference between this and the
> "section" hack I suggested, is that this would use a "call"/"ret" when
> enabled instead of a "jmp"/"jmp".

I wonder if this is what Kris Kross meant in their song?

/me goes back to work...

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 10:12 -0700, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 9:55 AM, Steven Rostedt  wrote:

> First off, we have very few things that are *so* unlikely that they
> never get executed. Putting things in a separate section would
> actually be really bad.

My main concern is with tracepoints. Which on 90% (or more) of systems
running Linux, is completely off, and basically just dead code, until
someone wants to see what's happening and enables them.

> 
> Secondly, you don't want a separate section anyway for any normal
> kernel code, since you want short jumps if possible (pretty much every
> single architecture out there has a concept of shorter jumps that are
> noticeably cheaper than long ones). You want the unlikely code to be
> out-of-line, but still *close*. Which is largely what gcc already does
> (except if you use "-Os", which disables all the basic block movement
> and thus makes "likely/unlikely" pointless to begin with).
> 
> There are some situations where you'd want extremely unlikely code to
> really be elsewhere, but they are rare as hell, and mostly in user
> code where you might try to avoid demand-loading such code entirely.

Well, as tracepoints are being added quite a bit in Linux, my concern is
with the inlined functions that they bring. With jump labels they are
disabled in a very unlikely way (the static_key_false() is a nop to skip
the code, and is dynamically enabled to a jump).

I did a make kernel/sched/core.i to get what we have in the current
sched_switch code:

static inline __attribute__((no_instrument_function)) void
trace_sched_switch (struct task_struct *prev, struct task_struct *next) {
if (static_key_false(& __tracepoint_sched_switch .key)) do {
struct tracepoint_func *it_func_ptr;
void *it_func;
void *__data;
rcu_read_lock_sched_notrace();
it_func_ptr = ({
typeof(*((&__tracepoint_sched_switch)->funcs)) 
*_p1 =

(typeof(*((&__tracepoint_sched_switch)->funcs))* )
(*(volatile 
typeof(((&__tracepoint_sched_switch)->funcs)) *)
&(((&__tracepoint_sched_switch)->funcs)));
do {
static bool __attribute__ 
((__section__(".data.unlikely"))) __warned;
if (debug_lockdep_rcu_enabled() && !__warned && 
!(rcu_read_lock_sched_held() || (0))) {
__warned = true;
lockdep_rcu_suspicious( , 153 , 
"suspicious rcu_dereference_check()" " usage");
}
} while (0);
((typeof(*((&__tracepoint_sched_switch)->funcs)) 
*)(_p1));
});
if (it_func_ptr) {
do {
it_func = (it_func_ptr)->func;
__data = (it_func_ptr)->data;
((void(*)(void *__data, struct task_struct 
*prev, struct task_struct *next))(it_func))(__data, prev, next);
} while ((++it_func_ptr)->func);
}
rcu_read_unlock_sched_notrace();
} while (0);
} 

I massaged it to look more readable. This is inlined right at the
beginning of the prepare_task_switch(). Now, most of this code should be
moved to the end of the function by gcc (well, as you stated -Os may not
play nice here). And perhaps its not that bad of an issue. That is, how
much of the icache does this actually take up? Maybe we are lucky and it
sits outside the icache of the hot path.

I still need to start running a bunch of benchmarks to see how much
overhead these tracepoints cause. Herbert Xu brought up the concern
about various latencies in the kernel, including tracing, in his ATTEND
request on the kernel-discuss mailing list.

> 
> So give up on sections. They are a bad idea for anything except the
> things we already use them for. Sure, you can try to fix the problems
> with sections with link-time optimization work and a *lot* of small
> individual sections (the way per-function sections work already), but
> that's basically just undoing the stupidity of using sections to begin
> with.

OK, this was just a suggestion. Perhaps my original patch that just
moves this code into a real function where the trace_sched_switch() only
contains the jump_label and a call to another function that does all the
work when enabled, is still a better idea. That is, if benchmarks prove
that it's worth it.

Instead of the above, my patches would make the code into:

static inline __attribute__((no_instrument_function)) void
trace_sched_switch (struct task_struct *prev, struct task_struct *next)
{
if (static_key_false(& __tracepoint_sched_switch .key))
__trace_sched_switch(prev,

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 10:02 -0700, H. Peter Anvin wrote:

> > if (x) __attibute__((section(".foo"))) {
> > /* do something */
> > }
> > 
> 
> One concern I have is how this kind of code would work when embedded
> inside a function which already has a section attribute.  This could
> easily cause really weird bugs when someone "optimizes" an inline or
> macro and breaks a single call site...

I would say that it overrides the section it is embedded in. Basically
like a .pushsection and .popsection would work.

What bugs do you think would happen? Sure, this used in an .init section
would have this code sit around after boot up. I'm sure modules could
handle this properly. What other uses of attribute section is there for
code? I'm aware of locks and sched using it but that's more for
debugging purposes and even there, the worse thing I see is that a debug
report wont say that the code is in the section.

We do a lot of tricks with sections in the Linux kernel, so I too share
your concern. But even with that, if we audit all use cases, we may
still be able to safely do this. This is why I'm asking for comments :-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 10:12 AM, Linus Torvalds
 wrote:
>
> Secondly, you don't want a separate section anyway for any normal
> kernel code, since you want short jumps if possible

Just to clarify: the short jump is important regardless of how
unlikely the code you're jumping is, since even if you'd be jumping to
very unlikely ("never executed") code, the branch to that code is
itself in the hot path.

And the difference between a two-byte short jump to the end of a short
function, and a five-byte long jump (to pick the x86 case) is quite
noticeable.

Other cases do long jumps by jumping to a thunk, and so the "hot case"
is unaffected, but at least one common architecture very much sees the
difference in the likely code.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 9:55 AM, Steven Rostedt  wrote:
>
> Almost a full year ago, Mathieu suggested something like:
>
> if (unlikely(x)) __attribute__((section(".unlikely"))) {
> ...
> } else __attribute__((section(".likely"))) {
> ...
> }

It's almost certainly a horrible idea.

First off, we have very few things that are *so* unlikely that they
never get executed. Putting things in a separate section would
actually be really bad.

Secondly, you don't want a separate section anyway for any normal
kernel code, since you want short jumps if possible (pretty much every
single architecture out there has a concept of shorter jumps that are
noticeably cheaper than long ones). You want the unlikely code to be
out-of-line, but still *close*. Which is largely what gcc already does
(except if you use "-Os", which disables all the basic block movement
and thus makes "likely/unlikely" pointless to begin with).

There are some situations where you'd want extremely unlikely code to
really be elsewhere, but they are rare as hell, and mostly in user
code where you might try to avoid demand-loading such code entirely.

So give up on sections. They are a bad idea for anything except the
things we already use them for. Sure, you can try to fix the problems
with sections with link-time optimization work and a *lot* of small
individual sections (the way per-function sections work already), but
that's basically just undoing the stupidity of using sections to begin
with.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 09:55 AM, Steven Rostedt wrote:
> 
> Almost a full year ago, Mathieu suggested something like:
> 
> if (unlikely(x)) __attribute__((section(".unlikely"))) {
> ...
> } else __attribute__((section(".likely"))) {
> ...
> }
> 
> https://lkml.org/lkml/2012/8/9/658
> 
> Which got me thinking. How hard would it be to set a block in its own
> section. Like what Mathieu suggested, but it doesn't have to be
> ".unlikely".
> 
> if (x) __attibute__((section(".foo"))) {
>   /* do something */
> }
> 

One concern I have is how this kind of code would work when embedded
inside a function which already has a section attribute.  This could
easily cause really weird bugs when someone "optimizes" an inline or
macro and breaks a single call site...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 09:55 AM, Steven Rostedt wrote:
 
 Almost a full year ago, Mathieu suggested something like:
 
 if (unlikely(x)) __attribute__((section(.unlikely))) {
 ...
 } else __attribute__((section(.likely))) {
 ...
 }
 
 https://lkml.org/lkml/2012/8/9/658
 
 Which got me thinking. How hard would it be to set a block in its own
 section. Like what Mathieu suggested, but it doesn't have to be
 .unlikely.
 
 if (x) __attibute__((section(.foo))) {
   /* do something */
 }
 

One concern I have is how this kind of code would work when embedded
inside a function which already has a section attribute.  This could
easily cause really weird bugs when someone optimizes an inline or
macro and breaks a single call site...

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 9:55 AM, Steven Rostedt rost...@goodmis.org wrote:

 Almost a full year ago, Mathieu suggested something like:

 if (unlikely(x)) __attribute__((section(.unlikely))) {
 ...
 } else __attribute__((section(.likely))) {
 ...
 }

It's almost certainly a horrible idea.

First off, we have very few things that are *so* unlikely that they
never get executed. Putting things in a separate section would
actually be really bad.

Secondly, you don't want a separate section anyway for any normal
kernel code, since you want short jumps if possible (pretty much every
single architecture out there has a concept of shorter jumps that are
noticeably cheaper than long ones). You want the unlikely code to be
out-of-line, but still *close*. Which is largely what gcc already does
(except if you use -Os, which disables all the basic block movement
and thus makes likely/unlikely pointless to begin with).

There are some situations where you'd want extremely unlikely code to
really be elsewhere, but they are rare as hell, and mostly in user
code where you might try to avoid demand-loading such code entirely.

So give up on sections. They are a bad idea for anything except the
things we already use them for. Sure, you can try to fix the problems
with sections with link-time optimization work and a *lot* of small
individual sections (the way per-function sections work already), but
that's basically just undoing the stupidity of using sections to begin
with.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 10:12 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Secondly, you don't want a separate section anyway for any normal
 kernel code, since you want short jumps if possible

Just to clarify: the short jump is important regardless of how
unlikely the code you're jumping is, since even if you'd be jumping to
very unlikely (never executed) code, the branch to that code is
itself in the hot path.

And the difference between a two-byte short jump to the end of a short
function, and a five-byte long jump (to pick the x86 case) is quite
noticeable.

Other cases do long jumps by jumping to a thunk, and so the hot case
is unaffected, but at least one common architecture very much sees the
difference in the likely code.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 10:02 -0700, H. Peter Anvin wrote:

  if (x) __attibute__((section(.foo))) {
  /* do something */
  }
  
 
 One concern I have is how this kind of code would work when embedded
 inside a function which already has a section attribute.  This could
 easily cause really weird bugs when someone optimizes an inline or
 macro and breaks a single call site...

I would say that it overrides the section it is embedded in. Basically
like a .pushsection and .popsection would work.

What bugs do you think would happen? Sure, this used in an .init section
would have this code sit around after boot up. I'm sure modules could
handle this properly. What other uses of attribute section is there for
code? I'm aware of locks and sched using it but that's more for
debugging purposes and even there, the worse thing I see is that a debug
report wont say that the code is in the section.

We do a lot of tricks with sections in the Linux kernel, so I too share
your concern. But even with that, if we audit all use cases, we may
still be able to safely do this. This is why I'm asking for comments :-)

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 10:12 -0700, Linus Torvalds wrote:
 On Mon, Aug 5, 2013 at 9:55 AM, Steven Rostedt rost...@goodmis.org wrote:

 First off, we have very few things that are *so* unlikely that they
 never get executed. Putting things in a separate section would
 actually be really bad.

My main concern is with tracepoints. Which on 90% (or more) of systems
running Linux, is completely off, and basically just dead code, until
someone wants to see what's happening and enables them.

 
 Secondly, you don't want a separate section anyway for any normal
 kernel code, since you want short jumps if possible (pretty much every
 single architecture out there has a concept of shorter jumps that are
 noticeably cheaper than long ones). You want the unlikely code to be
 out-of-line, but still *close*. Which is largely what gcc already does
 (except if you use -Os, which disables all the basic block movement
 and thus makes likely/unlikely pointless to begin with).
 
 There are some situations where you'd want extremely unlikely code to
 really be elsewhere, but they are rare as hell, and mostly in user
 code where you might try to avoid demand-loading such code entirely.

Well, as tracepoints are being added quite a bit in Linux, my concern is
with the inlined functions that they bring. With jump labels they are
disabled in a very unlikely way (the static_key_false() is a nop to skip
the code, and is dynamically enabled to a jump).

I did a make kernel/sched/core.i to get what we have in the current
sched_switch code:

static inline __attribute__((no_instrument_function)) void
trace_sched_switch (struct task_struct *prev, struct task_struct *next) {
if (static_key_false( __tracepoint_sched_switch .key)) do {
struct tracepoint_func *it_func_ptr;
void *it_func;
void *__data;
rcu_read_lock_sched_notrace();
it_func_ptr = ({
typeof(*((__tracepoint_sched_switch)-funcs)) 
*_p1 =

(typeof(*((__tracepoint_sched_switch)-funcs))* )
(*(volatile 
typeof(((__tracepoint_sched_switch)-funcs)) *)
(((__tracepoint_sched_switch)-funcs)));
do {
static bool __attribute__ 
((__section__(.data.unlikely))) __warned;
if (debug_lockdep_rcu_enabled()  !__warned  
!(rcu_read_lock_sched_held() || (0))) {
__warned = true;
lockdep_rcu_suspicious( , 153 , 
suspicious rcu_dereference_check()  usage);
}
} while (0);
((typeof(*((__tracepoint_sched_switch)-funcs)) 
*)(_p1));
});
if (it_func_ptr) {
do {
it_func = (it_func_ptr)-func;
__data = (it_func_ptr)-data;
((void(*)(void *__data, struct task_struct 
*prev, struct task_struct *next))(it_func))(__data, prev, next);
} while ((++it_func_ptr)-func);
}
rcu_read_unlock_sched_notrace();
} while (0);
} 

I massaged it to look more readable. This is inlined right at the
beginning of the prepare_task_switch(). Now, most of this code should be
moved to the end of the function by gcc (well, as you stated -Os may not
play nice here). And perhaps its not that bad of an issue. That is, how
much of the icache does this actually take up? Maybe we are lucky and it
sits outside the icache of the hot path.

I still need to start running a bunch of benchmarks to see how much
overhead these tracepoints cause. Herbert Xu brought up the concern
about various latencies in the kernel, including tracing, in his ATTEND
request on the kernel-discuss mailing list.



 
 So give up on sections. They are a bad idea for anything except the
 things we already use them for. Sure, you can try to fix the problems
 with sections with link-time optimization work and a *lot* of small
 individual sections (the way per-function sections work already), but
 that's basically just undoing the stupidity of using sections to begin
 with.

OK, this was just a suggestion. Perhaps my original patch that just
moves this code into a real function where the trace_sched_switch() only
contains the jump_label and a call to another function that does all the
work when enabled, is still a better idea. That is, if benchmarks prove
that it's worth it.

Instead of the above, my patches would make the code into:

static inline __attribute__((no_instrument_function)) void
trace_sched_switch (struct task_struct *prev, struct task_struct *next)
{
if (static_key_false( __tracepoint_sched_switch .key))
__trace_sched_switch(prev, next);
}

That is, when this

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Steven Rostedt

On Mon, 2013-08-05 at 13:55 -0400, Steven Rostedt wrote:
  The difference between this and the
 section hack I suggested, is that this would use a call/ret when
 enabled instead of a jmp/jmp.

I wonder if this is what Kris Kross meant in their song?

/me goes back to work...

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin

On 08/05/2013 10:55 AM, Steven Rostedt wrote:
 
 Well, as tracepoints are being added quite a bit in Linux, my concern is
 with the inlined functions that they bring. With jump labels they are
 disabled in a very unlikely way (the static_key_false() is a nop to skip
 the code, and is dynamically enabled to a jump).
 

Have you considered using traps for tracepoints?  A trapping instruction
can be as small as a single byte.  The downside, of course, is that it
is extremely suppressed -- the trap is always expensive -- and you then
have to do a lookup to find the target based on the originating IP.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds

On Mon, Aug 5, 2013 at 10:55 AM, Steven Rostedt rost...@goodmis.org wrote:

 My main concern is with tracepoints. Which on 90% (or more) of systems
 running Linux, is completely off, and basically just dead code, until
 someone wants to see what's happening and enables them.

The static_key_false() approach with minimal inlining sounds like a
much better approach overall. Sure, it might add a call/ret, but it
adds it to just the unlikely tracepoint taken path.

Of course, it would be good to optimize static_key_false() itself -
right now those static key jumps are always five bytes, and while they
get nopped out, it would still be nice if there was some way to have
just a two-byte nop (turning into a short branch) *if* we can reach
another jump that way..For small functions that would be lovely. Oh
well.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 134 matches

Mail list logo