Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 04:52, David Chisnall wrote:
> On 22 Sep 2015, at 12:47, H.J. Lu  wrote:
>>
>> since __builtin_exception_error () is the same as
>> __builtin_return_address (0) and __builtin_interrupt_data () is
>> address of __builtin_exception_error () + size of register.
> 
> Except that they’re *not*.  __builtin_return_address(0) is guaranteed to be 
> the same for the duration of the function.  __builtin_exception_error() needs 
> to either:
> 
> 1) Fetch the values early with interrupts disabled, store them in a
> well-known location, and load them from this place when the intrinsic
> is called, or
> 
> 2) Force any function that calls the intrinsic (and wants a
> meaningful result) to run with interrupts disabled, which is
> something that the compiler can’t verify without knowing the full
> chain of code from the interrupt handler to the current point (and
> therefore prone to error).
> 
> It is trivial to write a little bit of inline assembly that reads
> these values from the CPU and expose that for C code.  There is a
> good reason why no one does this.
> 

This is why it makes no sense for the intrinsics to be callable from
anywhere except inside the interrupt handler.  It is really nothing
other than a way to pass arguments -- whether or not it is simpler for
the compilers to implement than supporting a different function
signature is beyond my scope of expertise.

-hpa



Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 04:44, David Chisnall wrote:
> On 22 Sep 2015, at 12:39, H.J. Lu via cfe-dev  wrote:
>>
>> The center piece of my proposal is not to change how parameters
>> are passed in compiler.  As for user experience, the feedbacks on
>> my proposal from our users are very positive.
> 
> Implementing the intrinsics for getting the current interrupt
> requires a lot of support code for it to actually be useful.  For it
> to be useful, you are requiring all of the C code to be run with
> interrupts disabled (and even that doesn’t work if you get a NMI in
> the middle).  Most implementations use a small amount of assembly to
> capture the interrupt cause and the register state on entry to the
> handler, then reenable interrupts while the C code runs.  This means
> that any interrupts (e.g. page faults, illegal instruction traps,
> whatever) that happen while the C code is running do not mask the
> values.  Accessing these values from *existing* C code is simply a
> matter of loading a field from a structure.
> 
> I’m really unconvinced by something that something with such a narrow
> use case (and one that encourages writing bad code) belongs in the
> compiler.
> 

You seem to not understand how x86 works, nor have noted how this is
nearly universally supported by various architectures; x86 is the
exception here.

x86 stores its interrupt state on the stack, not in a register which can
be clobbered.  Also, a lot of your assertions about "most
implementations" only apply to full-scale operating systems.

-hpa




Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 01:41, David Chisnall wrote:
> On 21 Sep 2015, at 21:45, H.J. Lu via cfe-dev  wrote:
>>
>> The main purpose of x86 interrupt attribute is to allow programmers
>> to write x86 interrupt/exception handlers in C WITHOUT assembly
>> stubs to avoid extra branch from assembly stubs to C functions.  I
>> want to keep the number of new intrinsics to minimum without sacrificing
>> handler performance. I leave faking error code in interrupt handler to
>> the programmer.
> 
> The assembly stubs have to come from somewhere.  You either put them
> in an assembly file (most people doing embedded x86 stuff steal the
> ones from NetBSD), or you put them in the compiler where they can be
> inlined.  In terms of user interface, there’s not much difference in
> complexity.  Having written this kind of code in the past, I can
> honestly say that using the assembly stubs was the least difficult
> part of getting them right.  In terms of compiler complexity, there’s
> a big difference: in one case the compiler contains nothing, in the
> other it contains something special for a single use case.  In terms
> of performance, the compiler version has the potential to be faster,
> but if we’re going to pay for the complexity then I think that we’d
> need to see some strong evidence that someone else is getting a
> noticeable benefit.
> 

It is worth noting that most architectures has this support for a reason.

-hpa



Re: gcc feature request / RFC: extra clobbered regs

2015-07-02 Thread H. Peter Anvin
On 07/01/2015 10:43 AM, Jakub Jelinek wrote:
 On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote:
 Actually it raise a question for me.  If we describe that a function
 clobbers more than calling convention and then use it as a value (assigning
 a variable or passing as an argument) and loosing a track of it and than
 call it.  How can RA know what the call clobbers actually.  So for the
 function with the attributes we should prohibit use it as a value or make
 the attributes as a part of the function type, or at least say it is unsafe.
 So now I see this as a *bigger problem* with this extension.  Although I
 guess it already exists as we have description of different ABI as an
 extension.
 
 Unfortunately target attribute is function decl attribute rather than
 function type.  And having more attributes affect switchable targets will be
 non-fun.
 

How on Earth does that work with existing switchable ABIs?  Keep in mind
that we already support multiple ABIs...

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:48 PM, Andy Lutomirski wrote:
 On Tue, Jun 30, 2015 at 2:41 PM, H. Peter Anvin h...@zytor.com wrote:
 On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
 I'd say the most natural API for this would be to allow
 f{fixed,call-{used,saved}}-REG in target attribute.

 Either that or

 __attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))

 ... just to be shorter.  Either way, I would consider this to be
 desirable -- I have myself used this to good effect in a past life
 (*cough* Transmeta *cough*) -- but not a high priority feature.
 
 I think I mean the per-function equivalent of -fcall-used-reg, so
 hpa's used suggestion would do the trick.
 
 I guess that clobbering the frame pointer is a non-starter, but five
 out of six isn't so bad.  It would be nice to error out instead of
 producing disastrous results, though, if another bad reg is chosen.
 (Presumably the PIC register on PIC builds would be an example of
 that.)
 

Clobbering the frame pointer is perfectly fine, as is the PIC register.
 However, gcc might need to handle them as fixed rather than clobbered.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
 I'd say the most natural API for this would be to allow
 f{fixed,call-{used,saved}}-REG in target attribute.

Either that or

__attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))

... just to be shorter.  Either way, I would consider this to be
desirable -- I have myself used this to good effect in a past life
(*cough* Transmeta *cough*) -- but not a high priority feature.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:22 PM, Andy Lutomirski wrote:
 Hi all-
 
 I'm working on a massive set of cleanups to Linux's syscall handling.
 We currently have a nasty optimization in which we don't save rbx,
 rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
 This works, but it makes the code a huge mess.  I'd rather save all
 regs in asm and then call C code.
 
 Unfortunately, this will add five cycles (on SNB) to one of the
 hottest paths in the kernel.  To counteract it, I have a gcc feature
 request that might not be all that crazy.  When writing C functions
 intended to be called from asm, what if we could do:
 
 __attribute__((extra_clobber(rbx, rbp, r12, r13, r14,
 r15))) void func(void);
 
 This will save enough pushes and pops that it could easily give us our
 five cycles back and then some.  It's also easy to be compatible with
 old GCC versions -- we could just omit the attribute, since preserving
 a register is always safe.
 
 Thoughts?  Is this totally crazy?  Is it easy to implement?
 
 (I'm not necessarily suggesting that we do this for the syscall bodies
 themselves.  I want to do it for the entry and exit helpers, so we'd
 still lose the five cycles in the full fast-path case, but we'd do
 better in the slower paths, and the slower paths are becoming
 increasingly important in real workloads.)
 

Some gcc targets have done this in the past.  There are command-line
options to do that, but using attributes you have to handle cross-ABI
compilation.

However, I don't see this being done in the upstream gcc.

Keep in mind the runway that we'll need, though.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:55 PM, Andy Lutomirski wrote:
 On Tue, Jun 30, 2015 at 2:52 PM, H. Peter Anvin h...@zytor.com wrote:
 On 06/30/2015 02:48 PM, Andy Lutomirski wrote:
 On Tue, Jun 30, 2015 at 2:41 PM, H. Peter Anvin h...@zytor.com wrote:
 On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
 I'd say the most natural API for this would be to allow
 f{fixed,call-{used,saved}}-REG in target attribute.

 Either that or

 __attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))

 ... just to be shorter.  Either way, I would consider this to be
 desirable -- I have myself used this to good effect in a past life
 (*cough* Transmeta *cough*) -- but not a high priority feature.

 I think I mean the per-function equivalent of -fcall-used-reg, so
 hpa's used suggestion would do the trick.

 I guess that clobbering the frame pointer is a non-starter, but five
 out of six isn't so bad.  It would be nice to error out instead of
 producing disastrous results, though, if another bad reg is chosen.
 (Presumably the PIC register on PIC builds would be an example of
 that.)


 Clobbering the frame pointer is perfectly fine, as is the PIC register.
  However, gcc might need to handle them as fixed rather than clobbered.
 
 Hmm.  True, I guess, although I wouldn't necessarily expect gcc to be
 able to generate code to call a function like that.
 

No, but you need to be able to call other functions, or you just push
the issue down one level.

-hpa




Re: Is it safe to use _Bool as asm statement outputs on x86?

2015-06-03 Thread H. Peter Anvin
On 06/02/2015 06:02 PM, Richard Henderson wrote:
 On 06/02/2015 04:46 PM, H. Peter Anvin wrote:
 For the x86 backend explicitly, is doing something like:

 _Bool x;

 asm(blah ; setc %0 : =qm (x));

 ... guaranteed to be safe for older versions of gcc?
 
 I believe so, for the restricted set of conditions I expect you're asking.
 In particular:
 
  (1) Linux has always defined _Bool as a byte (indeed, afaik only Darwin
  has ever done otherwise).
 
  (2) You must really produce 0/1 from the asm; the compiler doesn't re-do
  the canonicalization afterward, and afaik we do rely on that in the
  optimizers.  But certainly that's true for any version of GCC.
 

That is all good as far as I am concerned.

-hpa




Re: Is it safe to use _Bool as asm statement outputs on x86?

2015-06-03 Thread H. Peter Anvin
On 06/02/2015 11:23 PM, H.J. Lu wrote:
 Trampling code for nested functions puts code on stack.

This is an issue for the CS ≠ DS case, as opposed to _Bool, I assume.
In other words, we are okay if and only if we can run with an NX stack?

-hpa



Is it safe to use _Bool as asm statement outputs on x86?

2015-06-02 Thread H. Peter Anvin
For the x86 backend explicitly, is doing something like:

_Bool x;

asm(blah ; setc %0 : =qm (x));

... guaranteed to be safe for older versions of gcc?

-hpa


Re: [PATCH v2 6/6] i386: Implement asm flag outputs

2015-05-20 Thread H. Peter Anvin
Well, these kinds of asm are inherently target specific, but I did already ask 
for a cpp symbol to indicate this faculty us available.

On May 20, 2015 9:21:07 AM PDT, Jeff Law l...@redhat.com wrote:
On 05/15/2015 09:37 AM, Richard Henderson wrote:
 Version 2 includes proper test cases and documentation.
 Hopefully the documentation even makes sense.  Suggestions
 and improvements there gratefully appreciated.


 r~
 ---
   gcc/config/i386/constraints.md |   5 ++
   gcc/config/i386/i386.c | 137
+++--
   gcc/doc/extend.texi|  76 
   gcc/testsuite/gcc.target/i386/asm-flag-0.c |  15 
   gcc/testsuite/gcc.target/i386/asm-flag-1.c |  18 
   gcc/testsuite/gcc.target/i386/asm-flag-2.c |  16 
   gcc/testsuite/gcc.target/i386/asm-flag-3.c |  22 +
   gcc/testsuite/gcc.target/i386/asm-flag-4.c |  20 +
   gcc/testsuite/gcc.target/i386/asm-flag-5.c |  19 
   9 files changed, 321 insertions(+), 7 deletions(-)
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-0.c
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-1.c
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-2.c
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-3.c
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-4.c
   create mode 100644 gcc/testsuite/gcc.target/i386/asm-flag-5.c
It all seems to make sense.  Obviously you'll need a ChangeLog and the 
usual testing before committing.

I won't stress much if this needs a bit of further tweaking as the 
kernel folks start to exploit the capability and we find weaknesses in 
the implementation.

What I don't see is any way to know if the target supports asm flag 
outputs.  Are we expecting the kernel folks to do some kind of test
then 
enable/disable based on the result?

I'm going to assume the mapping of the constraints to the actual modes 
and codes is correct.


Jeff

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.


Re: [RFC 0/6] Flags outputs for asms

2015-05-08 Thread H. Peter Anvin
On 05/08/2015 08:54 AM, Richard Henderson wrote:
 
 Anyway, I'll look into whether the branch around alpha can be optimized, but
 I'd be shocked if I'd be able to do anything about the branch around beta.
 True, there's nothing in between that will clobber the flags so it would be an
 excellent improvement, but combine doesn't work across basic blocks and
 changing that would be a major task.
 

Either way... optimization is something that can be done gradually.
Once we start using the feature we can figure out where it makes sense
to do further optimizations.

0hpa




Re: [PATCH 6/6] i386: Implement asm flag outputs

2015-05-08 Thread H. Peter Anvin
On 05/07/2015 02:39 PM, Richard Henderson wrote:
 All jcc mnemonics implemented as =@
 to make it easy for someone reading the manual
 to figure out what condition is desired.

One request: would it be possible to get a cpp symbol for this (e.g.
__GCC_X86_INLINE_ASM_CC__) so we don't have to do explicit gcc version
checks?

-hpa




Re: [RFC 0/6] Flags outputs for asms

2015-05-07 Thread H. Peter Anvin
On 05/07/2015 02:38 PM, Richard Henderson wrote:
 Here's a prototype for i386 only, which stands up to light testing.
 I'd rather post this tonight rather than wait until tomorrow when I
 can write more proper dejagnu tests.
 
 I've tested the intermedate patches via config-list.mk, so despite
 mucking around with vec.h vs target.h, all targets still compile.
 
 That said, quite a bit of cleanup in expand_asm_stmt was required
 in order to make the target hook not be completely unintelligable,
 so depsite full regression testing on x86_64 and ppc64, I could
 well have broken something.

Hi!  I took this for a spin, and:

a) It seems to work correctly; haven't been able to break it yet.
b) It seems very easy to provoke it into producing pretty bad code.

(b) is obviously not at all unexpected, this is working impressively
well for a first RFC.  Here is a piece of test code I used for this.

I'm very impressed to see this happen so quickly, thank you!

-hpa







extern void alpha(void);
extern void beta(void);

/* This case really should produce good code in both cases */

void good1(int x, int y)
{
  _Bool pf;

  asm(cmpl %2,%1
  : =@ccp (pf)
  : r (x), g (y));

  if (pf)
beta();
}

void bad1(int x, int y)
{
  _Bool le, pf;

  asm(cmpl %3,%2
  : =@ccle (le), =@ccp (pf)
  : r (x), g (y));

  if (le)
alpha();
  else if (pf)
beta();
}

/* This case really is too much to ask... */

_Bool good2(int x, int y)
{
  _Bool le;

  asm(cmpl %2,%1
  : =@ccle (le)
  : r (x), g (y));

  return le;
}

_Bool bad2(int x, int y)
{
  _Bool zf, of, sf;

  asm(cmpl %4,%3
  : =@ccz (zf), =@cco (of), =@ccs (sf)
  : r (x), g (y));

  return zf | (sf ^ of);
}

/* One should expect this shouldn't produce *worse* code than the above... */

int good3(int x, int y, int a, int b)
{
  _Bool le;

  asm(cmpl %2,%1
  : =@ccle (le)
  : r (x), g (y));

  return le ? b : a;
}

int bad3(int x, int y, int a, int b)
{
  _Bool zf, of, sf;

  asm(cmpl %4,%3
  : =@ccz (zf), =@cco (of), =@ccs (sf)
  : r (x), g (y));

  return zf | (sf ^ of) ? b : a;
}


Re: [RFC 0/6] Flags outputs for asms

2015-05-07 Thread H. Peter Anvin
This is a separate issue which really shouldn't have anything to do with
this, but is there a specific reason why:

void good1(int x, int y)
{
  _Bool pf;

  asm(cmpl %2,%1
  : =@ccp (pf)
  : r (x), g (y));

  if (pf)
beta();
}

... ends up generating a jump to a jump?

 good1:
   0:   39 f7   cmp%esi,%edi
   2:   7a 0c   jp 10 good1+0x10
   4:   f3 c3   repz retq
   6:   66 2e 0f 1f 84 00 00nopw   %cs:0x0(%rax,%rax,1)
   d:   00 00 00
  10:   e9 00 00 00 00  jmpq   15 good1+0x15
11: R_X86_64_PC32   beta-0x4
  15:   66 66 2e 0f 1f 84 00data32 nopw %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00



Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 01:35 PM, Linus Torvalds wrote:
 On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin h...@zytor.com wrote:

 I would argue that for x86 what you actually want is to model the
 *conditions* that are available on the flags, not the flags themselves.
 
 Yes. Otherwise it would be a nightmare to try to describe simple
 conditions like le, which a rather complicated combination of three
 of the actual flag bits:
 
 ((SF ^^ OF) || ZF) = 1
 
 which would just be ridiculously painful for (a) the user to describe
 and (b) fior the compiler to recognize once described.
 
 Now, I do admit that most of the cases where you'd use inline asm with
 condition codes would probably fall into just simple test ZF or CF.
 But I could certainly imagine other cases.
 

Yes, although once again I'm more than happy to let gcc do the boolean
optimizations if it already has logic to do so (which it might have/want
for its own reasons.)

-hpa




Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 12:33 PM, Richard Henderson wrote:
 
 (0) The C level output variable should be an integral type, from bool on up.
 
 The flags are a scarse resource, easily clobbered.  We cannot allow user code
 to keep data in the flags.  While x86 does have lahf/sahf, they don't exactly
 perform well.  And other targets like arm don't even have that bad option.
 
 Therefore, the language level semantics are that the output is a boolean store
 into the variable with a condition specified by a magic constraint.
 
 That said, just like the compiler should be able to optimize
 
 void bar(int y)
 {
   int x = (y = 0);
   if (x) foo();
 }
 
 such that we only use a single compare against y, the expectation is that
 within a similarly constrained context the compiler will not require two tests
 for these boolean outputs.
 
 Therefore:
 
 (1) Each target defines a set of constraint strings,
 
E.g. for x86, wherein we're almost out of constraint letters,
 
  ja   aux carry flag
  jc   carry flag
  jo   overflow flag
  jp   parity flag
  js   sign flag
  jz   zero flag
 

I would argue that for x86 what you actually want is to model the
*conditions* that are available on the flags, not the flags themselves.
 There are 16 such conditions, 8 if we discard the inversions.

It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
instructions; it is only ever consumed by the DAA/DAS instructions which
makes it pointless to try to model it in a compiler any more than, say, IF.

 (2) A new target hook post-processes the asm_insn, looking for the
 new constraint strings.  The hook expands the condition prescribed
 by the string, adjusting the asm_insn as required.
 
   E.g.
 
 bool x, y, z;
 asm (xyzzy : =jc(x), =jp(y), =jo(z) : : );

Other than that, this is exactly what would be wonderful to see.

-hpa



Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 01:57 PM, Richard Henderson wrote:
 
 Sure.
 
 I'd be more inclined to support these compound conditionals directly, rather
 than try to get the compiler to recognize them after the fact.
 
 Indeed, I believe we have a near complete set of them in the x86 backend
 already.  It'd just be a matter of selecting the spellings for the 
 constraints.
 

Whichever works for you.

The full set of conditions, mnemonics, and a bitmask with the bits in
the order from MSB to LSB (OF,SF,ZF,PF,CF) which is probably the sanest
way to model these for the purpose of boolean optimization.

Opcode  Mnemonics   Condition   Bitmask
0   o   OF  0x
1   no  !OF 0x
2   b/c/nae CF  0x
3   ae/nb/nc!CF 0x
4   e/z ZF  0xf0f0f0f0
5   ne/nz   !ZF 0x0f0f0f0f
6   na  CF || ZF0xfafafafa
7   a   !CF  !ZF  0x05050505
8   s   SF  0xff00ff00
9   ns  !SF 0x00ff00ff
A   p/pePF  0x
B   np/po   !PF 0x
C   l/nge   SF != OF0x0000
D   ge/nl   SF == OF0xffff
E   le/ng   ZF || (SF != OF)0xf0f0
F   g/nle   !ZF  (SF == OF)   0x0f0f

-hpa



Re: [PATCH] X86-64: Add -mskip-rax-setup

2014-12-18 Thread H. Peter Anvin
On 12/18/2014 06:12 AM, Uros Bizjak wrote:

 # temporary until string.h is fixed
 KBUILD_CFLAGS += -ffreestanding

 Yes, it looks to me that new option is the way to go.

 Is this an OK?
 
 In principle, I'm OK with the patch approach, but let's wait for
 eventual comments from Linux people.
 

Acked-by: H. Peter Anvin h...@linux.intel.com

H.J. already coordinated with us; we are more than happy with this approach.

Thank you!

-hpa




Re: [PATCH] X86-64: Add -mskip-rax-setup

2014-12-18 Thread H. Peter Anvin
On 12/18/2014 09:43 AM, H.J. Lu wrote:
 
 Peter, please feel free to use my kernel patch or create a different
 one.
 

Great, thanks!

-hpa




Re: [PATCH] X86-64: Add -mskip-rax-setup

2014-12-18 Thread H. Peter Anvin
On 12/18/2014 10:37 AM, Rasmus Villemoes wrote:
 
 Minor thing: If it's not too late, I'd appreciate a 'Suggested-by' or
 similar mention in the kernel change log.
 

I think we can get that.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread H. Peter Anvin
 On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
 Since we really doesn't want to...

Ow.  Can't believe I wrote that.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin
On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
 
 I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
 if-forest functions like perf_prepare_sample() and perf_output_sample().
 
 They are of the form:
 
 void func(obj, args..)
 {
   unsigned long f = ...;
 
   if (f  F1)
   do_f1();
 
   if (f  F2)
   do_f2();
 
   ...
 
   if (f  FN)
   do_fn();
 }
 

Am I reading this right that f can be a combination of any of these?

 Where f is constant for the entire lifetime of the particular object.
 
 So I was thinking of having these functions use static_key/asm-goto;
 then write the proper static key values unsafe so as to avoid all
 trickery (as these functions would never actually be used) and copy the
 end result into object private memory. The object will then use indirect
 calls into these functions.

I'm really not following what you are proposing here, especially not
copy the end result into object private memory.

With asm goto you end up with at minimum a jump or NOP for each of these
function entries, whereas an actual JIT can elide that as well.

On the majority of architectures, including x86, you cannot simply copy
a piece of code elsewhere and have it still work.  You end up doing a
bunch of the work that a JIT would do anyway, and would end up with
considerably higher complexity and worse results than a true JIT.  You
also say the object will then use indirect calls into these
functions... you mean the JIT or pseudo-JIT generated functions, or the
calls inside them?

 I suppose the question is, do people strenuously object to creativity
 like that and or is there something GCC can do to make this
 easier/better still?

I think it would be much easier to just write a minimal JIT for this,
even though it is per architecture.  However, I would really like to
understand what the value is.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin
On 08/12/2013 09:09 AM, Peter Zijlstra wrote:

 On the majority of architectures, including x86, you cannot simply copy
 a piece of code elsewhere and have it still work.
 
 I thought we used -fPIC which would allow just that.
 

Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
allow this kind of movement for code that contains intramodule
references (that is *all* references in the kernel).  Since we really
doesn't want to burden the kernel with a GOT and a PLT, that is life.

 You end up doing a
 bunch of the work that a JIT would do anyway, and would end up with
 considerably higher complexity and worse results than a true JIT.  
 
 Well, less complexity but worse result, yes. We'd only poke the specific
 static_branch sites with either NOPs or the (relative) jump target for
 each of these branches. Then copy the result.

Once again, you can't copy the result.  You end up with a full
disassembler.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin
On 08/06/2013 09:15 AM, Steven Rostedt wrote:
 On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
 
 For unconditional jmp that should be pretty safe barring any fundamental
 changes to the instruction set, in which case we can enable it as
 needed, but for extra robustness it probably should skip prefix bytes.
 
 Would the assembler add prefix bytes to:
 
   jmp 1f
 

No, but if we ever end up doing MPX in the kernel, for example, we would
have to put an MPX prefix on the jmp.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin
On 08/06/2013 09:26 AM, Steven Rostedt wrote:

 No, but if we ever end up doing MPX in the kernel, for example, we would
 have to put an MPX prefix on the jmp.
 
 Well then we just have to update the rest of the jump label code :-)
 

For MPX in the kernel, this would be a small part of the work...!

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 09:55 AM, Steven Rostedt wrote:
 
 Almost a full year ago, Mathieu suggested something like:
 
 if (unlikely(x)) __attribute__((section(.unlikely))) {
 ...
 } else __attribute__((section(.likely))) {
 ...
 }
 
 https://lkml.org/lkml/2012/8/9/658
 
 Which got me thinking. How hard would it be to set a block in its own
 section. Like what Mathieu suggested, but it doesn't have to be
 .unlikely.
 
 if (x) __attibute__((section(.foo))) {
   /* do something */
 }
 

One concern I have is how this kind of code would work when embedded
inside a function which already has a section attribute.  This could
easily cause really weird bugs when someone optimizes an inline or
macro and breaks a single call site...

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 10:55 AM, Steven Rostedt wrote:
 
 Well, as tracepoints are being added quite a bit in Linux, my concern is
 with the inlined functions that they bring. With jump labels they are
 disabled in a very unlikely way (the static_key_false() is a nop to skip
 the code, and is dynamically enabled to a jump).
 

Have you considered using traps for tracepoints?  A trapping instruction
can be as small as a single byte.  The downside, of course, is that it
is extremely suppressed -- the trap is always expensive -- and you then
have to do a lookup to find the target based on the originating IP.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:23 AM, Steven Rostedt wrote:
 On Mon, 2013-08-05 at 11:17 -0700, H. Peter Anvin wrote:
 On 08/05/2013 10:55 AM, Steven Rostedt wrote:

 Well, as tracepoints are being added quite a bit in Linux, my concern is
 with the inlined functions that they bring. With jump labels they are
 disabled in a very unlikely way (the static_key_false() is a nop to skip
 the code, and is dynamically enabled to a jump).


 Have you considered using traps for tracepoints?  A trapping instruction
 can be as small as a single byte.  The downside, of course, is that it
 is extremely suppressed -- the trap is always expensive -- and you then
 have to do a lookup to find the target based on the originating IP.
 
 No, never considered it, nor would I. Those that use tracepoints, do use
 them extensively, and adding traps like this would probably cause
 heissenbugs and make tracepoints useless.
 
 Not to mention, how would we add a tracepoint to a trap handler?
 

Traps nest, that's why there is a stack.  (OK, so you don't want to take
the same trap inside the trap handler, but that code should be very
limited.)  The trap instruction just becomes very short, but rather
slow, call-return.

However, when you consider the cost you have to consider that the
tracepoint is doing other work, so it may very well amortize out.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:20 AM, Linus Torvalds wrote:
 
 Of course, it would be good to optimize static_key_false() itself -
 right now those static key jumps are always five bytes, and while they
 get nopped out, it would still be nice if there was some way to have
 just a two-byte nop (turning into a short branch) *if* we can reach
 another jump that way..For small functions that would be lovely. Oh
 well.
 

That would definitely require gcc support.  It would be useful, but
probably requires a lot of machinery.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:34 AM, Linus Torvalds wrote:
 On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:

 Ugh. I can see the attraction of your section thing for that case, I
 just get the feeling that we should be able to do better somehow.
 
 Hmm.. Quite frankly, Steven, for your use case I think you actually
 want the C goto *labels* associated with a section. Which sounds like
 it might be a cleaner syntax than making it about the basic block
 anyway.
 

A label wouldn't have an endpoint, though...

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:49 AM, Steven Rostedt wrote:
 On Mon, 2013-08-05 at 11:29 -0700, H. Peter Anvin wrote:
 
 Traps nest, that's why there is a stack.  (OK, so you don't want to take
 the same trap inside the trap handler, but that code should be very
 limited.)  The trap instruction just becomes very short, but rather
 slow, call-return.

 However, when you consider the cost you have to consider that the
 tracepoint is doing other work, so it may very well amortize out.
 
 Also, how would you pass the parameters? Every tracepoint has its own
 parameters to pass to it. How would a trap know what where to get prev
 and next?
 

How do you do that now?

You have to do an IP lookup to find out what you are doing.

(Note: I wonder how much the parameter generation costs the tracepoints.)

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 02:28 PM, Mathieu Desnoyers wrote:
 * Linus Torvalds (torva...@linux-foundation.org) wrote:
 On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
 mathieu.desnoy...@efficios.com wrote:

 I remember that choosing between 2 and 5 bytes nop in the asm goto was
 tricky: it had something to do with the fact that gcc doesn't know the
 exact size of each instructions until further down within compilation

 Oh, you can't do it in the coompiler, no. But you don't need to. The
 assembler will pick the right version if you just do jmp target.
 
 Yep.
 
 Another thing that bothers me with Steven's approach is that decoding
 jumps generated by the compiler seems fragile IMHO.
 
 x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :
 
 +static int make_nop_x86(void *map, size_t const offset)
 +{
 + unsigned char *op;
 + unsigned char *nop;
 + int size;
 +
 + /* Determine which type of jmp this is 2 byte or 5. */
 + op = map + offset;
 + switch (*op) {
 + case 0xeb: /* 2 byte */
 + size = 2;
 + nop = ideal_nop2_x86;
 + break;
 + case 0xe9: /* 5 byte */
 + size = 5;
 + nop = ideal_nop;
 + break;
 + default:
 + die(NULL, Bad jump label section (bad op %x)\n, *op);
 + __builtin_unreachable();
 + }
 
 My though is that the code above does not cover all jump encodings that
 can be generated by past, current and future x86 assemblers.
 

For unconditional jmp that should be pretty safe barring any fundamental
changes to the instruction set, in which case we can enable it as
needed, but for extra robustness it probably should skip prefix bytes.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 09:14 PM, Mathieu Desnoyers wrote:

 For unconditional jmp that should be pretty safe barring any fundamental
 changes to the instruction set, in which case we can enable it as
 needed, but for extra robustness it probably should skip prefix bytes.
 
 On x86-32, some prefixes are actually meaningful. AFAIK, the 0x66 prefix
 is used for:
 
 E9 cw   jmp rel16   relative jump, only in 32-bit
 
 Other prefixes can probably be safely skipped.
 

Yes.  Some of them are used as hints or for MPX.

 Another question is whether anything prevents the assembler from
 generating a jump near (absolute indirect), or far jump. The code above
 seems to assume that we have either a short or near relative jump.

Absolutely something prevents!  It would be a very serious error for the
assembler to generate such instructions.

-hpa






Re: Deprecate i386 for GCC 4.8?

2013-01-01 Thread H. Peter Anvin

On 12/12/2012 01:07 PM, David Brown wrote:


I believe it has been a very long time since any manufacturers made a
pure 386 chip.



I believe embedded 386 production ceased in 2007.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-06-28 Thread H. Peter Anvin

On 06/28/2012 02:03 PM, Mark Butler wrote:

On Tuesday, June 26, 2012 1:53:01 PM UTC-6, H. Peter Anvin wrote:

It's worth noting that there are *no* Linux platforms that are not
ILP32
or LP64, so adding a third memory model is likely to cause even more
problems...


Care to comment on what sort of things would be likely to cause a large
number of problems porting to an L64P32 model?  I understand that L32P64
(as in Windows 64 bit) causes lots of problems, because there is a lot
of code that assumes that a pointer can be converted to a long and back.
  That would not be a problem with L64P32 however, because there
pointers would be smaller than longs rather than larger.


Every time you introduce a new model you will have problems, but in 
Linux it is a strong assumption that sizeof(long) == sizeof(void *).


-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.





Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-06-26 Thread H. Peter Anvin
On 06/26/2012 12:47 PM, H.J. Lu wrote:

 May I ask why the decision was made to use ILP32 instead of L64P32?   The
 latter would seem to avoid lots of porting problems in particular.  And if
 porting difficulties are the major complained about x32, is it really too
 late to switch?  Thanks - mdb
 
 x32 is designed to replace ia32 where long is 32-bit, not x86-64.
 

It's worth noting that there are *no* Linux platforms that are not ILP32
or LP64, so adding a third memory model is likely to cause even more
problems...

-hpa



Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-05-14 Thread H. Peter Anvin
On 05/14/2012 10:31 AM, H.J. Lu wrote:
 Hi,
 
 Support for the x32 psABI:
 
 http://sites.google.com/site/x32abi/
 
 is added in Linux kernel 3.4-rc1.  X32 uses the ILP32 model for x86-64
 instruction set with size of long and pointers == 4 bytes.  X32 is
 already supported in GCC 4.7.0 and binutils 2.22.  I am now working
 to integrate x32 support into GLIBC 2.16 and GDB 7.5   Here is a
 patch to extend x86-64 psABI for x32.  Any comments?
 

As a minor nitpick, I have always used x32 with a lower case x.  The
capital X32 looks odd to me.

-hpa



Re: Add __ILP32 and __ILP32__ for X32 programming model

2012-04-13 Thread H. Peter Anvin
On 04/13/2012 09:18 AM, H.J. Lu wrote:
 Hi,
 
 We need a reliable way to tell if we are compiling for x32 through
 pre-defined preprocessor symbol.  __LP64/__LP64__ aren't
 specified by x86-64 psABI, although they have been added to
 GCC 3.3.  They can't be counted on to detect x32 since not x86-64
 compilers define them.   I updated x32 psABI:
 
 https://sites.google.com/site/x32abi/documents
 
 to define __ILP32 and __ILP32__ for X32 programming model.  I
 will submit a patch for GCC trunk and 4.7 branch.
 

Can we add __LP64__ to the psABI too?

-hpa




Re: [x32] PATCH: Remove ix86_promote_function_mode

2011-06-20 Thread H. Peter Anvin
On 06/20/2011 07:01 AM, H.J. Lu wrote:
 On Mon, Jun 20, 2011 at 6:53 AM, Bernd Schmidt ber...@codesourcery.com 
 wrote:
 On 06/20/2011 03:51 PM, H.J. Lu wrote:
 Promote pointers to Pmode when passing/returning in registers is
 a security concern.

No.  Promoting *NON*-pointers (or rather, requiring non-pointers to
having already been zero extended) is a security concern.  I thought I'd
made that point clear already.  This is a hideously critical distinction.

 Peter, do you think it is safe to assume upper 32bits are zero in
 user space for x32? Kernel isn't a problem since pointer is 64bit
 in kernel and we don't pass pointers on stack to kernel.

As I have already stated, if we *cannot* require pointers to be
zero-extended on entry to the kernel, we're going to have to have
special entry points for all the x32 system calls except the ones that
don't take pointers.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [x32] PATCH: Remove ix86_promote_function_mode

2011-06-20 Thread H. Peter Anvin
On 06/20/2011 07:43 AM, H.J. Lu wrote:
 On Mon, Jun 20, 2011 at 7:33 AM, H. Peter Anvin h...@zytor.com wrote:
 On 06/20/2011 07:01 AM, H.J. Lu wrote:
 On Mon, Jun 20, 2011 at 6:53 AM, Bernd Schmidt ber...@codesourcery.com 
 wrote:
 On 06/20/2011 03:51 PM, H.J. Lu wrote:
 Promote pointers to Pmode when passing/returning in registers is
 a security concern.

 No.  Promoting *NON*-pointers (or rather, requiring non-pointers to
 having already been zero extended) is a security concern.  I thought I'd
 made that point clear already.  This is a hideously critical distinction.
 
 I can promote pointers then.
 

Yes.  The issue comes when promoting non-pointers, like unsigned int.
 Pointers are the opposite because pointers in the kernel are 64 bits
and we'd like them to be pre-promoted.

 Peter, do you think it is safe to assume upper 32bits are zero in
 user space for x32? Kernel isn't a problem since pointer is 64bit
 in kernel and we don't pass pointers on stack to kernel.

 As I have already stated, if we *cannot* require pointers to be
 zero-extended on entry to the kernel, we're going to have to have
 special entry points for all the x32 system calls except the ones that
 don't take pointers.
 
 32bit pointers are zero-extended to 64bit when passing in registers to
 kernel.

Excellent, sounds like we have converged.

I saw you posted something about pointers on the stack, but it sounds
like you already realized that we don't point any stack arguments
whatsoever to the kernel.

-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [x32] PATCH: Remove ix86_promote_function_mode

2011-06-20 Thread H. Peter Anvin
On 06/20/2011 03:49 PM, Richard Henderson wrote:

 As I have already stated, if we *cannot* require pointers to be
 zero-extended on entry to the kernel, we're going to have to have
 special entry points for all the x32 system calls except the ones that
 don't take pointers.
 
 If it's a security concern, surely you have to do it in the kernel
 anyway, lest someone call into the kernel via their own assembly
 rather than something controlled by the compiler...
 

That was the point... right now we rely on the ABI to not have any
invalid representations (except, as far as I know, on s390).  This means
any arbitrary register image presented to the kernel will be a set of
valid C objects; we then accept or reject them as being semantically
valid using normal C code in the kernel.

The issue occurs when the kernel can be entered with something in the
register that is invalid according to the calling convention, and not
have it rejected.

The current x86-64 ABI rules, for example, imply that if
%rdi = 0x3fb8c9119537d37d and the type of the first argument is
uint32_t, that is a valid argument with the value 0x9537d37d.  The extra
upper bits are ignored, and so no security issue arises.

The issue with requiring the upper bits to be normalized occurs with
code like:

static const long foo_table[10] = { ... };

long sys_foo(unsigned int bar)
{
if (bar = 10)
return -EINVAL;

return foo_table[bar];
}


If the upper bits are required to be zero, gcc could validly translate
that to:

sys_foo:
cmpl$10, %edi
jae .L1

movqfoo_table(,%rdi,3), %rax
retq
.L1:
movq$-EINVAL, %rax
retq

Enter this function with a non-normalized %rdi and you have a security
hole even though the C is perfectly fine.

-hpa





Re: X32 project status update

2011-05-21 Thread H. Peter Anvin
On 05/21/2011 09:27 AM, H.J. Lu wrote:
 On Sat, May 21, 2011 at 8:34 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Sat, May 21, 2011 at 8:27 AM, Arnd Bergmann a...@arndb.de wrote:
 On Saturday 21 May 2011 17:01:33 H.J. Lu wrote:
 This is the x32 project status update:

 https://sites.google.com/site/x32abi/


 I've had another look at the kernel patch. It basically
 looks all good, but the system call table appears to
 diverge from the x86_64 list for no (documented) reason,
 in the calls above 302. Is that intentional?

 I can see why you might want to keep the numbers identical,
 but if they are already different, why not use the generic
 system call table from asm-generic/unistd.h for the new
 ABI?

 We can sort it out when we start merging x32 kernel changes.

 
 Peter, is that possible to use the single syscall table for
 both x86-64 and x32 system calls? Out of 300+ system
 calls, only 84 are different for x86-64 and x32.  That
 is additional 8*84 == 672 bytes in syscall table.
 

Sort of... remember we talked about merging system calls at the tail
end?  The problem with that is that some system calls (like read()!)
actually are different system calls in very subtle situations, due to
abuse in some subsystems of the is_compat() construct.  I think that may
mean we have to have an unambiguous flag after all...

Now, perhaps we can use a high bit for that and mask it before dispatch,
then we don't need the additional table.  A bit of a hack, but it should
work.

-hpa


Re: x32 psABI draft version 0.2

2011-02-17 Thread H. Peter Anvin
On 02/17/2011 10:06 AM, Jakub Jelinek wrote:
 On Thu, Feb 17, 2011 at 04:44:53PM +0100, Jan Hubicka wrote:
 According to Mozilla folks however REL+RELA scheme used by EABI leads
 to significandly smaller libxul.so size

 According to http://glandium.org/blog/?p=1177 the difference is about 4-5MB
 (out of approximately 20-30MB shared lib)

 This is orthogonal to x32 psABI.

 Understood.  I am just pointing out that x86-64 Mozilla suffers from startup
 problems (extra 5MB of disk read needed) compared to both x86 and ARM EABI
 because x86-64 ABI is RELA only. If x86-64 ABI was REL+RELA like EABI is, we
 would not have this problem here.
 
 libxul.so has  20 relocs, so 5MB is total size of .rela section in
 64-bit ELF, you don't magically save those 5MB by using REL.  You save
 just 1.5MB.  And for x32 we'd be talking about 2.5MB for RELA vs. 1.6MB for
 REL.  There might be better ways how to get the numbers down.
 

The size is, of course, half of that for the x32 ABI in the first place.

-hpa



Re: x32 psABI draft version 0.2

2011-02-17 Thread H. Peter Anvin
On 02/17/2011 02:49 PM, Jan Hubicka wrote:
 On Thu, Feb 17, 2011 at 04:44:53PM +0100, Jan Hubicka wrote:
 According to Mozilla folks however REL+RELA scheme used by EABI leads
 to significandly smaller libxul.so size

 According to http://glandium.org/blog/?p=1177 the difference is about 
 4-5MB
 (out of approximately 20-30MB shared lib)

 This is orthogonal to x32 psABI.

 Understood.  I am just pointing out that x86-64 Mozilla suffers from startup
 problems (extra 5MB of disk read needed) compared to both x86 and ARM EABI
 because x86-64 ABI is RELA only. If x86-64 ABI was REL+RELA like EABI is, we
 would not have this problem here.

 libxul.so has  20 relocs, so 5MB is total size of .rela section in
 64-bit ELF, you don't magically save those 5MB by using REL.  You save
 just 1.5MB.  And for x32 we'd be talking about 2.5MB for RELA vs. 1.6MB for
 
 The blog claims
 Architecture  libxul.so size  relocations size%
 x86   21,869,684  1,884,864   8.61%
 x86-6429,629,040  5,751,984   19.41%
 
 The REL encoding also grows twice for 64bit target?
 

REL would be twice the size for a 64-bit target (which x32 is not, from
an ELF point of view).  Keep in mind that REL cannot do error handing
very well, especially not on a 64-bit platform.

Elf32_Rel:   8 bytes
Elf32_Rela: 12 bytes
Elf64_Rel:  16 bytes
Elf64_Rela: 24 bytes

So 1,884,864 to 5,751,984 indicates a (very) small increase in
relocation count, the exactly equivalent numbers would be:

Elf32_Rel:  1,884,864 bytes
Elf32_Rela: 2,827,296 bytes
Elf64_Rel:  3,769,728 bytes
Elf64_Rela: 5,654,592 bytes

-hpa


Re: x32 psABI draft version 0.2

2011-02-16 Thread H. Peter Anvin
On 02/16/2011 11:22 AM, H.J. Lu wrote:
 Hi,
 
 I updated  x32 psABI draft to version 0.2 to change x32 library path
 from lib32 to libx32 since lib32 is used for ia32 libraries on Debian,
 Ubuntu and other derivative distributions. The new x32 psABI is
 available from:
 
 https://sites.google.com/site/x32abi/home
 

I'm wondering if we should define a section header flag (sh_flags)
and/or an ELF header flag (e_flags) for x32 for the people unhappy about
keying it to the ELF class...

-hpa



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:10 PM, H.J. Lu wrote:

 1. Kernel interface with syscall is close to be finalized.


I don't think calling it finalized is accurate... it is more
accurately described as prototyped.

 Really? I haven't seen this being posted for review yet ;-)

 The basic concept looks entirely reasonable to me, but I'm
 curious what drove the decision to start out with the x86_64
 system calls instead of the generic ones.

 Since tile was merged, we now have support for compat syscalls
 in the generic syscall ABI. I would have assumed that it
 was possible to just use those if you decide to do a new
 ABI in the first place.
 
 The other option that would have appeared natural to me is
 to just use the existing 32 bit compat ABI with the few
 necessary changes done based on the personality.

The actual idea is to use the i386 compat ABI for memory layout, but
with a 64-bit register convention.  That means that system calls that
don't make references to memory structures can simply use the 64-bit
system calls, otherwise we're planning to reuse the i386 compat system
calls, but invoke them via the syscall instruction (which requires a new
system call table) and to pass 64-bit arguments in single registers.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:28 PM, H.J. Lu wrote:
 
 That is is currently implemented on hjl/x32 branch.
 
 I also added
 
 __NR_sigaction
 __NR_sigpending
 __NR_sigprocmask
 __NR_sigsuspend
 
 to help the Bionic C library.
 

That seems a little redundant... even on the i386 front we want people
to use the rt_sig* system calls.  As a porting aid I can see it, but we
should avoid deprecated system calls in the final version.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:16 PM, H. Peter Anvin wrote:
 
 The actual idea is to use the i386 compat ABI for memory layout, but
 with a 64-bit register convention.  That means that system calls that
 don't make references to memory structures can simply use the 64-bit
 system calls, otherwise we're planning to reuse the i386 compat system
 calls, but invoke them via the syscall instruction (which requires a new
 system call table) and to pass 64-bit arguments in single registers.
 

Oh, and as to why not copy the i386 system call list straight off... we
don't really want to add a new ABI with crap like sys_socketcall.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 02:28 PM, Arnd Bergmann wrote:
 On Sunday 13 February 2011, H. Peter Anvin wrote:
 The actual idea is to use the i386 compat ABI for memory layout, but
 with a 64-bit register convention.  That means that system calls that
 don't make references to memory structures can simply use the 64-bit
 system calls, otherwise we're planning to reuse the i386 compat system
 calls, but invoke them via the syscall instruction (which requires a new
 system call table) and to pass 64-bit arguments in single registers.
 
 As far as I know, any task can already call both the 32 and 64 bit syscall
 entry points on x86. Is there anything you can't do just as well by
 using a combination of the two methods, without introducing a third one?

We prototyped using the int $0x80 system call entry point.  However,
there are two disadvantages:

a. the int $0x80 instruction is much slower than syscall.  An actual
   i386 process can use the syscall instruction which is disambiguated
   by the CPU based on mode, but an x32 process is in the same CPU mode
   as a normal 64-bit process.
b. 64-bit arguments have to be split between two registers for the
   i386 entry points, requiring user-space stubs.

All in all, the cost of an extra system call table is quite modest.  The
cost of an entire different ABI layer (supporting a new memory layout)
would be enormous, a.k.a. not worth it, which is why the memory layout
of kernel objects needs to be compatible with i386.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin

On 02/13/2011 01:33 PM, Alan Cox wrote:


Who actually needs this new extra API - whats the justification for
everyone having more crud dumping their kernels, more syscall paths
(which are one of the most security critical areas) and the like.

What are the benchmark numbers to justify this versus just using the
existing kernel interfaces ?



That's what the prototype is meant to show.

-hpa


Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin

On 02/13/2011 03:39 PM, Alan Cox wrote:

a. the int $0x80 instruction is much slower than syscall.  An actual
i386 process can use the syscall instruction which is disambiguated
by the CPU based on mode, but an x32 process is in the same CPU mode
as a normal 64-bit process.


So set a flag, whoopee


That's what we're doing, functionally.


b. 64-bit arguments have to be split between two registers for the
i386 entry points, requiring user-space stubs.


Diddums. Given you've yet to explain why everyone desperately needs this
extra interface why do we care ?


All in all, the cost of an extra system call table is quite modest.


And the cost of not doing it is a gloriously wonderful zero. Yo've still
not explained the justification or what large number of apps are going to
use it.

It's a simple question - why do we care, why do we want the overhead and
the hassle, what do users get in return ?


The target applications are an embedded (closed or mostly closed) 
environment, and the question is if the performance gain is worth it. 
It is an open question at this stage and we'll see what the numbers look 
like and, if it turns out to be worthwhile, what exactly the final 
implementation will look like.


-hpa


Re: X32 psABI status

2011-02-12 Thread H. Peter Anvin
On 02/12/2011 01:10 PM, Florian Weimer wrote:
 Why is the ia32 compatiblity kernel interface used?

Because there is no way in hell we're designing in a second
compatibility ABI in the kernel (and it has to be a compatibility ABI,
because of the pointer size difference.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2011-01-05 Thread H. Peter Anvin
On 01/04/2011 11:46 PM, Jan Beulich wrote:

 Oh god, please, no.

 I have to say I'm highly questioning to Jan's statement in the first
 place.  Crossing 32- and 64-bit ELF like that sounds like a kernel
 security hole waiting to happen.
 
 A particular OS/kernel has the freedom to not implement support for
 other than the default format. But having the ABI disallow it
 altogether certainly isn't the right choice. And yes, we had been
 allowing cross-bitness ELF in an experimental (long canceled) OS
 of ours.
 
 Yeah, and there are other targets where the elf class determines ABI
 too (e.g. EM_S390 is used for both 31-bit and 64-bit binaries and
 the ELF class determines which).
 
 So the usual thing is going to happen - someone made a mistake (I'm
 convinced the ELF class was never meant to affect anything but the
 file format), and this gets taken as an excuse to let the mistake
 spread.
 

I don't think it's all that unreasonable to say the ELF class affects
the ABI.  After all, there are lots of things about the ABI that is
related to the ELF class -- the format of the GOT and PLT, for one thing.

-hpa


Re: RFC: Add 32bit x86-64 support to binutils

2011-01-04 Thread H. Peter Anvin
On 01/04/2011 09:56 AM, H.J. Lu wrote:

 I think it is a gross misconception to tie the ABI to the ELF class of
 an object. Specifying the ABI should imo be done via e_flags or
 one of the unused bytes of e_ident, and in all reality the ELF class
 should *only* affect the file layout (and 64-bit should never have
 forbidden to use 32-bit ELF containers; similarly 64-bit ELF objects
 may have uses for 32-bit architectures/ABIs, e.g. when debug
 information exceeds the 4G boundary).
 
 I agree with you in principle. But I think it should be done via
 a new attribute section, similar to ARM.
 

Oh god, please, no.

I have to say I'm highly questioning to Jan's statement in the first
place.  Crossing 32- and 64-bit ELF like that sounds like a kernel
security hole waiting to happen.

-hpa



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-31 Thread H. Peter Anvin
On 12/31/2010 02:03 AM, Jakub Jelinek wrote:
 On Thu, Dec 30, 2010 at 01:42:05PM -0800, H. Peter Anvin wrote:
 On 12/30/2010 11:57 AM, Jakub Jelinek wrote:

 Would be nice if LFS would be mandatory on the new ABI, thus
 off_t being 64bits.

 And avoid ambiguous cases that x86-64 ABI has, e.g. whether
 caller or callee is responsible for sign/zero extension of arguments, to
 avoid the need to sign/zero extend twice, etc.


 Ehwhat?  x86-64 is completely unambiguous on that point; the i386 one is
 not.
 
 It is not, sadly, see http://gcc.gnu.org/PR46942
 From what I can see the psABI doesn't talk about it, GCC usually sign/zero
 extends on both sides (exception is 32-bit arguments into 64-bit isn't
 apparently sign/zero extended on the caller side when doing tail calls),
 from what I gathered LLVM expects the caller to sign/zero extend (which is
 incompatible with GCC tail calls then), not sure about ICC, and kernel
 probably expects for security reasons that the callee sign/zero extends.
 

This is weird... we had long discussions about this when the psABI was
originally written, and the decision was that any bits outside the
fundamental type was undefined -- callee extends (caller in the case of
a return value.)  Yet somehow that (and several other discussions) seem
to either never have made it into the document or otherwise have
disappeared somewhere in the process.

There seems to have been problems with closing the loop on a number of
things, and in some cases the compiler writers have gone off and
implemented something completely different from the written document,
yet failed to get the documentation updated to match reality (it took
many years until the definition of _Bool matched what the compilers
actually implemented.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 10:59 AM, H.J. Lu wrote:
 
 (If you could arrange for the syscall ABI always to be the same as the
 existing 64-bit ABI, rather than needing to handle three different syscall
 ABIs in the kernel, that might be one solution, but it could have its own
 complexities in ensuring that none of the types whose layout forms part of
 the kernel/userspace interface have layout differing between n32 and the
 existing ABI; without any action, structures would tend to get layout
 similar to that of the existing 32-bit ABI, though quite possibly not the
 same depending on alignment peculiarities - I'm guessing that the new ABI
 will use natural alignment - while long long arguments would tend to be
 passed in a single register, resulting in the complicated hybrid syscall
 ABI present on MIPS.  If you do have an all-new syscall ABI rather than
 sharing the existing 64-bit one, I imagine it would need to follow the
 cut-down set of syscalls for new ports, so involving the issue of how to
 build glibc for that set of syscalls discussed three months ago in the
 Tilera context.)

 
 You are right.  Add ILP32 support to Linux kernel may be tricky.
 We did some experiment to use IA32 syscall interface for ILP32:
 

The current plan is to simply use the 32-bit kernel ABI more or less
unmodified, although probably with a different entry point using syscall
rather than int 0x80 for performance.  In order for the ABI to map 1:1,
there needs to be a few concessions:

a) 64-bit arguments will need to be split in user space.
b) The Linux kernel  exported __u64 type will need to be declared
   __attribute__((aligned(4))).  This will only affect a handful of
   structures in practice since implicit padding is frowned upon.

(a) could also be fixed by a different syscall dispatch table, it's not
the hard part of this.  We definitely want to avoid adding a different
memory ABI; that's the part that hurts.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:53 AM, Richard Guenther wrote:
 
 Would be nice if LFS would be mandatory on the new ABI, thus
 off_t being 64bits.
 

Yes, although that's a higher-order thing.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:34 AM, David Daney wrote:
 
 My suggestion:  Since people already spend a great deal of effort
 maintaining the existing i386 compatible Linux syscall infrastructure,
 make your new 32-bit x86-64 Linux syscall ABI identical to the existing
 i386 syscall ABI.  This means that the psABI must use the same size and
 alignment rules for in-memory structures as the i386 does.
 

No, it doesn't.  It just means it need to do so *for the types used by
the kernel*.  The kernel uses types like __u64, which would indeed have
to be declared aligned(4).

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
I believe it covers all cases *relevant for this particular situation* (unlike, 
say, MIPS) and that any deviation is a bug which can and should be fixed.

David Daney dda...@caviumnetworks.com wrote:

On 12/30/2010 12:12 PM, H. Peter Anvin wrote:
 On 12/30/2010 11:34 AM, David Daney wrote:

 My suggestion:  Since people already spend a great deal of effort
 maintaining the existing i386 compatible Linux syscall
infrastructure,
 make your new 32-bit x86-64 Linux syscall ABI identical to the
existing
 i386 syscall ABI.  This means that the psABI must use the same size
and
 alignment rules for in-memory structures as the i386 does.


 No, it doesn't.  It just means it need to do so *for the types used
by
 the kernel*.  The kernel uses types like __u64, which would indeed
have
 to be declared aligned(4).


Some legacy interfaces don't use fixed width types.  There almost 
certainly are some ioctls that don't use your fancy __u64.

Then there are things like ppoll() that take a pointer to:

struct timespec {
longtv_sec; /* seconds */
longtv_nsec;/* nanoseconds */
};

There are no fields in there that are controlled by __u64 either. 
Admittedly this case might not differ between the two 32-bit ABIs, but 
it shows that __u64/__u32 are not universally used in the Linux syscall

ABIs.

If you are happy with potential memory layout differences between the 
two 32-bit ABIs, then don't specify that they are the same.  But don't 
claim that use of __u64/__u32 covers all cases.

David Daney

-- 
Sent from my mobile phone.  Please pardon any lack of formatting.


Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
We do have a slightly more extensive patch already implemented.

Robert Millan r...@gnu.org wrote:

Hi folks,

I had this unsubmitted patch in my local filesystem.  It makes Linux
detect ELF32 AMD64 binaries and sets a flag to restrict them to
32-bit address space.

It's not rocket science but can save you some work in case you
haven't implemented this already.

Best regards

-- 
Robert Millan

-- 
Sent from my mobile phone.  Please pardon any lack of formatting.


Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 12:39 PM, David Daney wrote:
 
 Really I don't care one way or the other.  The necessity of syscall
 wrappers is actually probably beneficial to me.  It will create a
 greater future employment demand for people with the necessary skills to
 write them.
 

Or perhaps automatic generation will actually get implemented.  I wrote
an automatic syscall wrapper generator for klibc; one of the best design
decisions I made for that project.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:57 AM, Jakub Jelinek wrote:

 Would be nice if LFS would be mandatory on the new ABI, thus
 off_t being 64bits.
 
 And avoid ambiguous cases that x86-64 ABI has, e.g. whether
 caller or callee is responsible for sign/zero extension of arguments, to
 avoid the need to sign/zero extend twice, etc.
 

Ehwhat?  x86-64 is completely unambiguous on that point; the i386 one is
not.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 02:18 PM, Robert Millan wrote:
 2010/12/30 H.J. Lu hjl.to...@gmail.com:
 I also have a patch for gcc 4.4 which works on simple codes.

 H.J.
 On Thu, Dec 30, 2010 at 1:31 PM, H. Peter Anvin h...@zytor.com wrote:
 We do have a slightly more extensive patch already implemented.
 
 Could you make those patches available somewhere?  It'd be
 interesting to play with them.
 
 Btw, I recommend against 8-byte longs.  In the tests I did in
 2009, I recall glibc source was extremely unhappy due to
 sizeof(long)==sizeof(void *) assumptions.
 

Yes, it's ILP32.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 02:21 PM, Robert Millan wrote:
 2010/12/30 Richard Guenther richard.guent...@gmail.com:
 Would be nice if LFS would be mandatory on the new ABI, thus
 off_t being 64bits.
 
 Please do also consider time_t.
 

Changing the kernel-facing time_t might completely wreck the reuse of
the i386 kernel ABI; I'm not sure.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 01:08 PM, Robert Millan wrote:
 Hi folks,
 
 I had this unsubmitted patch in my local filesystem.  It makes Linux
 detect ELF32 AMD64 binaries and sets a flag to restrict them to
 32-bit address space.
 
 It's not rocket science but can save you some work in case you
 haven't implemented this already.
 

I have pushed my old kernel patches to a git tree at:

git://git.kernel.org//pub/scm/linux/kernel/git/hpa/linux-2.6-ilp32.git

They are currently based on 2.6.31 since that was the released version
when I first did this work; they are not intended to be mergeble but
rather as a prototype.

Note that we have no intention of supporting this ABI for the kernel
itself.  The kernel will be a normal x86-64 kernel.

-hpa



Re: ld -r on mixed IR/non-IR objects (

2010-12-08 Thread H. Peter Anvin
On 12/08/2010 01:19 AM, Andi Kleen wrote:

 Quite possibly a better way to deal with that is to provide a mechanism
 for encapsulating arbitrary binary code objects inside the LTO IR.
 
 Then you would need to teach your assembler and everything
 else that may generate ELF objects to generate this magic object. But why
 not just ELF directly? that is what it is after all.
 

No.  You just need to teach the linker to generate it when you're doing
a ld -r on mixed objects.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: ld -r on mixed IR/non-IR objects (

2010-12-08 Thread H. Peter Anvin
On 12/08/2010 01:19 AM, Andi Kleen wrote:
 
 To be honest I don't really see the point of all this complexity you
 guys are proposing just to save fat LTO. Fat LTO is always a bad idea
 because it's slow and  does lots of redundant work. If LTO is to become
 a more wide spread mode it has to go simply because of the poor
 performance.
 

As someone who encountered slim LTO on Unix 17 years ago (on MIPS) I can
promise you that unless fat LTO is supported, there will never be a
successful transition.  The amount of work to deal with the make
environment every time simply made it not worth it.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: ld -r on mixed IR/non-IR objects (

2010-12-07 Thread H. Peter Anvin
On 12/07/2010 04:20 PM, Andi Kleen wrote:
 
 The only problem left is mixing of lto and non lto objects. this right
 now is not handled. IMHO still the best way to handle it is to use
 slim lto and then simply separate link the left overs after deleting
 the LTO objects. This can be actually done with objcopy (with some
 limitations), doesn't even need linker support.
 

Quite possibly a better way to deal with that is to provide a mechanism
for encapsulating arbitrary binary code objects inside the LTO IR.

-hpa


Re: ld -r on mixed IR/non-IR objects (

2010-12-07 Thread H. Peter Anvin
On 12/07/2010 03:58 PM, Dave Korn wrote:
 On 07/12/2010 23:15, Cary Coutant wrote:
 
   ○ Object-only section:
   § Section name won't be generated by any tools, something like
 .objectonly\004.
   § Contains non-IR object file.
   § Input is discarded after link.

 Please -- use a special section type, not a magic name.
 
   We're still gonna have to use a magic name on non-ELF platforms.
 

Yes, but it probably should still be a special section type on ELF.

-hpa


Re: ld -r on mixed IR/non-IR objects (

2010-12-06 Thread H. Peter Anvin
On 12/06/2010 02:30 PM, H.J. Lu wrote:
 Hi,
 
 ld -r doesn't work with mixed IR/non-IR objects:
 
 http://www.sourceware.org/bugzilla/show_bug.cgi?id=12291
 
 Some compilers support it. Should it be supported?
 

As we discussed in person, I think it would be user friendly to support
it, otherwise you'll break any build which uses ld -r and includes
assembly objects.

-hpa


Re: Bug in x86-64 psABI or in gcc?

2009-12-09 Thread H. Peter Anvin
On 12/09/2009 06:44 AM, H.J. Lu wrote:
 
 Aren't bits in the _Bool byte ofbar specified by the psABI or the C
 language standard already?
 

The psABI, yes.  They are obviously not defined by the C language standard.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-09 Thread H. Peter Anvin
On 12/09/2009 06:56 AM, Michael Matz wrote:

 Aren't bits in the _Bool byte ofbar specified by the psABI
 
 Right now they are specified in the psABI, you suggested to remove that 
 specification.
 

The intent of H.J.'s proposal is to require bits 7:1 == 0 in all cases
(and higher bits as don't cares, the same way a char is passed), as
opposed to the current text which requires 63:1 == 0 when passed as
registers or on the stack (and 7:1 == 0 when stored in a memory
object.)  Furthermore, the current psABI text is inconsistent for
arguments are return values; this is a bug in the wordsmithing of the
text rather than intentional, if I remember the original discussions
correctly.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-08 Thread H. Peter Anvin
On 12/08/2009 10:24 AM, H.J. Lu wrote:
 
 We should just drop
 
 ---
 When a value of type _Bool is passed in a register or on the stack,
 the upper 63 bits of the eightbyte shall be zero.
 ---
 
 from psABI. Since _Bool has one byte in size with values of 0 and 1.
 Compilers have to clear upper 7 bits in one byte.
 

What about the Solaris compiler?  It's probably the only other
significant user of the x86-64 ABI (the Qlogic and LLVM compilers I
presume will follow gcc.)

-hpa



Re: Bug in x86-64 psABI or in gcc?

2009-12-07 Thread H. Peter Anvin
On 12/07/2009 10:33 AM, H.J. Lu wrote:
 Hi,
 
 
 x86-64 psABI says _Bool has 1 byte and aligned at 1 byte. It also says:
 
 ---
 When a value of type _Bool is passed in a register or on the stack,
 the upper 63 bits of the eightbyte shall be zero.
 ---
 
 However, gcc treats _Bool as char:
 
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42324
 
 Given that gcc never zeros the upper 63 bits in register nor
 on stack, should we update x86-64 psABI to reflect what gcc
 really does?
 

Keep in mind it's not just gcc but at least also icc, the Solaris
compiler, and the Qlogic compiler... possibly others.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-07 Thread H. Peter Anvin
On 12/07/2009 10:33 AM, H.J. Lu wrote:
 Hi,
 
 
 x86-64 psABI says _Bool has 1 byte and aligned at 1 byte. It also says:
 
 ---
 When a value of type _Bool is passed in a register or on the stack,
 the upper 63 bits of the eightbyte shall be zero.
 ---
 
 However, gcc treats _Bool as char:
 
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42324
 
 Given that gcc never zeros the upper 63 bits in register nor
 on stack, should we update x86-64 psABI to reflect what gcc
 really does?
 

Another thing to check is to see if there are failure scenarios where
gcc 3.4.6 expects a fully zeroed register (it produces them, I don't
know if it expects to consume them that way, too.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread H. Peter Anvin
On 11/24/2009 09:30 AM, Steven Rostedt wrote:
 
 For other archs, Linus showed some examples:
 
 http://lkml.org/lkml/2009/11/19/349
 

Yes; the key here is that the ABI-defined entry state is readily
mappable onto the state on entry to the __fentry__ function.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread H. Peter Anvin
On 11/25/2009 08:44 AM, Jakub Jelinek wrote:
 
 If you compile kernels 90%+ people out there run with -p on i?86/x86_64,
 then certainly coming up with a new gcc switch and new profiling ABI is
 desirable.  -p on i?86/x86_64 e.g. forces -fno-omit-frame-pointer, which
 makes code on these register starved arches significantly worse.
 Making GCC output profiling call before prologue instead of after prologue
 is a 4 liner in generic code and a few lines in target specific code.
 The important thing is that we shouldn't have 100 different profiling ABIs,
 so it is desirable to agree on something that will be generally useful not
 just for the kernel, but perhaps for other purposes.
 

There is really just one that makes sense, which is providing the
ABI-defined entry state, which means intercepting at the point of entry.

Anything else is/was a mistake.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-24 Thread H. Peter Anvin
On 11/24/2009 07:46 AM, Andrew Haley wrote:

 Yes, a lot.  The difference is that -maccumulate-outgoing-args allocates
 space for arguments of the callee with most arguments in the prologue, using
 subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
 and the stack pointer doesn't usually change within the function (except for
 alloca/VLAs).
 With -mno-accumulate-outgoing-args args are pushed using push instructions
 and stack pointer is constantly changing.
 
 Alright.  So, it is possible in theory for gcc to generate code that
 only uses -maccumulate-outgoing-args when it needs to realign SP.
 And, therefore, we could have a nice option for the kernel: one with
 (mostly) good code density and never generates the bizarre code
 sequence in the prologue.
 

If we're changing gcc anyway, then let's add the option of intercepting
the function at the point where the machine state is well-defined by
ABI, which is before the function stack frame is set up.

-maccumulate-outgoing-args sounds like it would be painful on x86 (not
using its cheap push/pop instructions), but I guess since it's only when
tracing it's less of an issue.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-24 Thread H. Peter Anvin
On 11/24/2009 09:12 AM, Andrew Haley wrote:

 If we're changing gcc anyway, then let's add the option of intercepting
 the function at the point where the machine state is well-defined by
 ABI, which is before the function stack frame is set up.
 
 Hmm.  On the x86 I suppose we could just inject a naked call instruction,
 but not all aeches allow us to call anything before we've saved the return
 address.  Or are you talking x86 only?
 

For x86, we should use a naked call.

For architectures where that is not possible, we should use a minimal
sequence such that the ABI state at the invocation point is 100% derivable.

On MIPS, for example, we could use a sequence such as:

mov at, ra
jal __fentry__

It would be up to __fentry__ to save the value in at and to restore it
back into ra before resuming, meaning that __fentry__ has a nonstandard
calling convention.

-hpa


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-20 Thread H. Peter Anvin
On 11/20/2009 09:00 AM, Steven Rostedt wrote:
 Ingo, Thomas and Linus,
 
 I know Thomas did a patch to force the -mtune=generic, but just in case
 gcc decides to do something crazy again, this patch will catch it.
 
 Should we try to get this in now?
 

Sounds like a very good idea to me.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-20 Thread H. Peter Anvin
On 11/20/2009 11:46 AM, Steven Rostedt wrote:
 
 Yes a gcc test suite will help new instances of gcc. But we need to
 worry about the instances of gcc that people have on their desktops now.
 This test case will catch the discrepancy between gcc and the function
 graph tracer. I'm not 100% convince that just adding -mtune=generic will
 help in all cases. If we miss another instance, then the function graph
 tracer may crash someone's kernel.
 

Furthermore, for future gcc instances what we really want is the early
interception support anyway.

-hpa


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 07:37 AM, Thomas Gleixner wrote:
 
 modified function start on a handful of functions only seen with gcc
 4.4.x on x86 32 bit:
 
   push   %edi
   lea0x8(%esp),%edi
   and$0xfff0,%esp
   pushl  -0x4(%edi)
   push   %ebp
   mov%esp,%ebp
   ...
   call   mcount
 

The real questions is why we're aligning the stack in the kernel.  It is
probably not what we want -- we don't use SSE for anything but a handful
of special cases in the kernel, and we don't want the overhead.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 07:44 AM, Andrew Haley wrote:
 
 We're aligning the stack properly, as per the ABI requirements.  Can't
 you just fix the tracer?
 

Per the ABI requirements?  We're talking 32 bits, here.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 08:02 AM, Steven Rostedt wrote:
 On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote:
 Thomas Gleixner wrote:
 
 We're aligning the stack properly, as per the ABI requirements.  Can't
 you just fix the tracer?
 
 And how do we do that? The hooks that are in place have no idea of what
 happened before they were called?
 

Furthermore, it is nonsense -- ABI stack alignment on *32 bits* is 4
bytes, not 16.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 10:33 AM, Steven Rostedt wrote:
 
 It has to align the entire stack? Why not just the variable within the
 stack?
 

Because if the stack pointer isn't aligned, it won't know where it can
stuff the variable.  It has to pad *somewhere*, and since you may have
more than one such variable, the most efficient way -- and by far least
complex -- is for the compiler to align the stack when it sets up the
stack frame.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 11:28 AM, Steven Rostedt wrote:
 
 Hehe, scratch register on i686 ;-)
 
 i686 has no extra regs. It just has:
 
 %eax, %ebx, %ecx, %edx - as the general purpose regs
 %esp - stack
 %ebp - frame pointer
 %edi, %esi - counter regs
 
 That's just 8 regs, and half of those are special.
 

For a modern ABI it is better described as:

%eax, %edx, %ecx- argument/return/scratch registers
%ebx, %esi, %edi- saved registers
%esp- stack pointer
%ebp- frame pointer (saved)

 Perhaps we could create another profiler? Instead of calling mcount,
 call a new function: __fentry__ or something. Have it activated with
 another switch. This could make the performance of the function tracer
 even better without all these exceptions.
 
   function:
   call __fentry__
   [...]
 

Calling the profiler immediately at the entry point is clearly the more
sane option.  It means the ABI is well-defined, stable, and independent
of what the actual function contents are.  It means that ABI isn't the
normal C ABI (the __fentry__ function would have to preserve all
registers), but that's fine...

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On i386, if we call __fentry__ immediately on entry the return address will be 
in 4(%esp), so I fail to see how you could not reliably have the return 
address.  Other arches would have different constraints, of course.

Frederic Weisbecker fweis...@gmail.com wrote:

On Thu, Nov 19, 2009 at 03:05:41PM -0500, Steven Rostedt wrote:
 On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote:
  On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote:
 
function:
call __fentry__
[...]
   

   -- Steve
  
  
  I would really like this. So that we can forget about other possible
  further suprises due to sophisticated function prologues beeing before
  the mcount call.
  
  And I guess that would fix it in every archs.
 
 Well, other archs use a register to store the return address. But it
 would also be easy to do (pseudo arch assembly):
 
  function:
  mov lr, (%sp)
  add 8, %sp
  blr __fentry__
  sub 8, %sp
  mov (%sp), lr
 
 
 That way the lr would have the current function, and the parent would
 still be at 8(%sp)
 


Yeah right, we need at least such very tiny prologue for
archs that store return addresses in a reg.

   
  
  That said, Linus had a good point about the fact there might other uses
  of mcount even more tricky than what does the function graph tracer,
  outside the kernel, and those may depend on the strict ABI assumption
  that 4(ebp) is always the _real_ return address, and that through all
  the previous stack call. This is even a concern that extrapolates the
  single mcount case.
 
 As I am proposing a new call. This means that mcount stay as is for
 legacy reasons. Yes I know there exists the -finstrument-functions but
 that adds way too much bloat to the code. One single call to the
 profiler is all I want.


Sure, the purpose is not to change the existing -mcount thing.
What I meant is that we could have -mcount and -real-ra-before-fp
at the same time to guarantee fp + 4 is really what we want while
using -mcount.

The __fentry__ idea is more neat, but the guarantee of a real pointer
to the return address is still something that lacks.


  
  So I wonder that actually the real problem is the lack of something that
  could provide this guarantee. We may need a -real-ra-before-fp (yeah
  I suck in naming).
 
 Don't worry, so do the C compiler folks, I mean, come on mcount?


I guess it has been first created for the single purpose of counting
specific functions but then it has been used for wider, unpredicted uses :)


--
Sent from my mobile phone. Please excuse any lack of formatting.

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
Hence a new unconstrained option...

Jeff Law l...@redhat.com wrote:

On 11/19/09 12:50, H. Peter Anvin wrote:

 Calling the profiler immediately at the entry point is clearly the more
 sane option.  It means the ABI is well-defined, stable, and independent
 of what the actual function contents are.  It means that ABI isn't the
 normal C ABI (the __fentry__ function would have to preserve all
 registers), but that's fine...

Note there are targets (even some old x86 variants) that required the 
profiling calls to occur after the prologue.  Unfortunately, nobody 
documented *why* that  was the case.   Sigh.

Jeff

--
Sent from my mobile phone. Please excuse any lack of formatting.

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 04:59 PM, Linus Torvalds wrote:
 
 [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT 
   totally unrelated numbers? Very confusing. ]
 

Yes, there is another thread to clean up that particular mess; it is
already in -tip:

http://git.kernel.org/tip/350f8f5631922c7848ec4b530c111cb8c2ff7caa

-hpa


Re: Storing 16bit values in upper part of 32bit registers

2009-11-05 Thread H. Peter Anvin
On 10/15/2009 08:56 AM, Richard Henderson wrote:
 On 10/15/2009 07:41 AM, Markus L wrote:
 However the IS is designed so that it is beneficial to to store 16bit
 values in the high part of the registers (rNh) and also the calling
 conventions that we want follow require 16bit values to be passed and
 returned in rNh.

 What would be the proper way make the compiler use the upper parts
 of these registers for the 16bit operands?
 
 This feature is going to be difficult, but not impossible, and unless 
 your ISA has some really odd features I won't vouch for the code quality.
 
 You say you want to canonically represent HImode in the high-part of the 
 register.  Additionally, you'll have to represent QImode in the 
 high-part (if not further in the high byte).
 
 You'll need to follow the mips port and define TRULY_NOOP_TRUNCATION and 
 the associated truncMN2 patterns.
 
 If you do all this, you won't have to do anything with FUNCTION_VALUE 
 etc at all.
 

Sorry for a *way* *late* reply to this, but wouldn't it also work to
model the register file as a set of 16-bit registers (since that's what
you really have -- individually addressable 16-bit registers) and
exclude SImode values from register pairs which are not aligned.  Then
one can simply prefer the high 16-bit registers to the low 16-bit
registers in the register priority sequence.

I'm assuming there is something wrong with this, but I'm kind of curious
as to what it would be.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Jakub Jelinek wrote:

On Thu, Mar 06, 2008 at 09:44:05AM +0100, Andi Kleen wrote:

H. Peter Anvin [EMAIL PROTECTED] writes:


Richard Guenther wrote:

We didn't yet run into this issue and build openSUSE with 4.3 since
more than
three month.


Well, how often do you take a trap inside an overlapping memmove()?

That was the state with older gcc, but with newer gcc it does not necessarily
reset the flag before the next function call.


If so, that's a much worse bug.


so e.g. if you have

memmove(...)
for (... very long loop  ) {
/* no function calls */
/* signals happen */
}

the signal could see the direction flag


memmove is supposed to (and does) do a cld insn after it finishes the
backward copying.


You can still take a signal inside memmove() itself, of course.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:

I agree with it. There is no  right or wrong here Let's start from
scratch and figure out
what is the best way to handle this, assuming we are defining a new psABI.


No, I believe the right way to approach this is by applying the good 
old-fashioned principle from Ask Mr. Protocol:


Be liberal in what you receive, conservative in what you send

In other words:

a. Fix the kernel.  Already in progress.
b. Do *not* make gcc assume DF is clean for now.  Adding a
   switch would be a useful thing, since if nothing else it
   would benefit embedded environments.  We might assume
   DF is clean on 64 bits, since it appears it is rarely used
   anyway, and 64 bits is more important in the long run.
c. Once fixed kernels have been out long enough, we can
   flip the default of the switch, one platform at a time if
   need be (e.g. there may never be another SCO OpenServer.)

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:

On Thu, Mar 6, 2008 at 8:23 AM, Jakub Jelinek [EMAIL PROTECTED] wrote:

On Thu, Mar 06, 2008 at 07:50:12AM -0800, H. Peter Anvin wrote:
  H.J. Lu wrote:
  I agree with it. There is no  right or wrong here Let's start from
  scratch and figure out
  what is the best way to handle this, assuming we are defining a new psABI.

 BTW, just tested icc and icc doesn't generate cld either (so it matches the
 new gcc behavior).
 char buf1[32], buf2[32];
 void bar (void);
 void foo (void)
 {
  __builtin_memset (buf1, 0, 32);
  bar ();
  __builtin_memset (buf2, 0, 32);
 }



Icc follows the psABI. If we are saying icc/gcc 4.3 need a fix, we'd
better define
a new psABI first.



Not a fix, an (optional) workaround for a system bug.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:

H.J. Lu wrote:

So that is the bug in the Linux kernel. Since fixing kernel is much 
easier

than providing a workaround in compilers, I think kernel should be fixed
and no need for icc/gcc fix.


Fixing a bug in the Linux kernel is not much easier. You are taking
a purely engineering viewpoint, but life is not like that. There are
lots of copies of Linux kernels around and in use. The issue is not
fixing the kernel per se, it is propagating that change to all
Linux kernels in use -- THAT'S another matter entirely, and is
far far more difficult than making sure that a kernel fix is
qualified and widely proopagated.



Not really, it's just a matter of time.  Typical distro cycles are on 
the order of 3 years.


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:


 Not a fix, an (optional) workaround for a system bug.


So that is the bug in the Linux kernel. Since fixing kernel is much easier
than providing a workaround in compilers, I think kernel should be fixed
and no need for icc/gcc fix.



The problem is, you're going to have to be able to produce binaries 
compatible with old kernels for a *long* time for come.  Are you 
honestly saying you'll tell those people use gcc 4.2 or earlier?  If 
so, I think most distros will have to freeze gcc for the next several years.


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:


Sounds good, but has almost nothing to do with the real world. I 
remember back in Realia COBOL days, we had to carefully copy IBM

bugs in the IBM mainframe COBOL compiler. Doing things right and
fixing the bug would have been the right thing to do, but no one
would have used Realia COBOL :-)

Another story, the sad story of the intel chip (I think it was
the 80188) where Intel made use of Int 5, which was documented
as reserved. Unfortunately, Microsoft/IBM had used this for
print screen or some such. Intel was absolutely right that
their documentation was clear and it was wrong to have used
these interrupts .. but the result was a warehouse of unused
chips.


IBM used it for print screen (and other calls), because Microsoft 
cassette BASIC used all the non-reserved INT instructions as byte codes 
(they cut it down to *only* half the interrupt vectors in the disk version.)


We're still stuck with the consequences of that hack.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:


Not really, it's just a matter of time.  Typical distro cycles are on 
the order of 3 years.


-hpa


again, in the real world, there are MANY projects that are nothing
like this interactive when it comes to moving to new versions of
operating systems.


This is true, but beyond a certain point projects generally accept that 
they have to monitor their toolchain dependencies.


-hpa


Re: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Aurelien Jarno wrote:

Hi all,

Since version 4.3, gcc changed its behaviour concerning the x86/x86-64 
ABI and the direction flag, that is it now assumes that the direction 
flag is cleared at the entry of a function and it doesn't clear once 
more if needed.


This causes some problems with the Linux kernel which does not clear
the direction flag when entering a signal handler. The small code below
(for x86-64) demonstrates that. 


If the signal handler is using code that need the direction flag cleared
(for example bzero() or memset()), the code is incorrectly executed.

I guess this has to be fixed on the kernel side, but also gcc-4.3 could
revert back to the old behaviour, that is clearing the direction flag
when entering a routine that touches it until most people are running a
fixed kernel.



Linux should definitely follow the ABI.  This is a bug, and a pretty 
serious such.


-hpa


  1   2   >