On Wednesday, 17 December 2014 at 14:11:32 UTC, Foo wrote:
asm {
mov EAX, n;
mov [arr + 8], ESP;
sub [ESP], EAX;
mov [arr + 0], EAX;
}
but that does not work...
That wouldn't work even with malloc.... remember, an integer
more than one byte long, so your subtract is 1/4 the size it
needs to be! Also, since the stack grows downward, you're
storing the pointer to the end of the array instead of the
beginning of it.
NOTE: I've never actually done this before, so I'm figuring it
out as I go too. This might be buggy or otherwise mistaken at
points. (Personally, I prefer to use a static array sized to
the max thing I'll probably need that I slice instead of
alloca...)
Here's some code that runs successfully (in 32 bit!):
void vla(int n) {
int[] arr;
asm {
mov EAX, [n];
// the first word in an array is the length,
store that
mov [arr], EAX;
shl EAX, 2; // number of bytes == n * int.sizeof
sub ESP, EAX; // allocate the bytes
mov [arr + size_t.sizeof], ESP; // store the beginning of it
in the arr.ptr
}
import std.stdio;
writeln(arr.length);
writeln(arr.ptr);
// initialize the data...
foreach(i, ref a; arr)
a = i;
writeln(arr); // and print it back out
}
void main() {
vla(8);
}
This looks right.... but isn't, we changed the stack and didn't
put it back. That's usually a no-no. If we disassemble the
function, we can take a look at the end and see something scary:
8084ec6: e8 9d 6a 00 00 call 808b968
<_D3std5stdio15__T7writelnTAiZ7writelnFAiZv> // our final
writeln call
8084ecb: 5e pop esi // uh oh
8084ecc: 5b pop ebx
8084ecd: c9 leave
8084ece: c3 ret
Before the call to leave, which puts the stack back how it was
at the beginning of the function - which saves us from a random
EIP being restored upon the ret instruction - the compiler put
in a few pop instructions.
main() will have different values in esi and ebx than it
expects! Running it in the debugger shows these values changed
too:
before
(gdb) info registers
[...]
ebx 0xffffd4f4 -11020
[...]
esi 0x80916e8 134813416
after
ebx 0x1 1
esi 0x0 0
It popped the values of our array. According to the ABI: "EBX,
ESI, EDI, EBP must be preserved across function calls."
http://dlang.org/abi.html
They are pushed for a reason - the compiler assumes they remain
the same.
In this little test program, nothing went wrong because no more
code was run after vla returned. But, if we were using, say a
struct, it'd probably fault when it tried to access `this`.
It'd probably mess up other local variables too. No good!
So, we'll need to store and restore the stack pointer... can we
use the stack's push and pop instructions? Nope, we're changing
the stack! Our own pop would grab the wrong data too.
We could save it in a local variable. How do we restore it
though? scope(exit) won't work, it won't happen at the right
time and will corrupt the stack even worse.
Gotta do it ourselves - which means we can't do the alloca even
as a single mixin, since it needs code added before any return
point too!
(There might be other, better ways to do this... and indeed,
there is, as we'll see later on. I peeked at the druntime
source code and it does it differently. Continue reading...)
Here's code that we can verify in the debugger leaves
everything how it should be and doesn't crash:
void vla(int n) {
int[] arr;
void* saved_esp;
asm {
mov EAX, [n];
mov [arr], EAX;
shl EAX, 2; // number of bytes == n * int.sizeof
// NEW LINE
mov [saved_esp], ESP; // save it for later
sub ESP, EAX;
mov [arr + size_t.sizeof], ESP;
}
import std.stdio;
writeln(arr.length);
writeln(arr.ptr);
foreach(i, ref a; arr)
a = i;
writeln(arr);
// NEW LINE
asm { mov ESP, [saved_esp]; } // restore it before we return
}
Note that this still isn't quite right - the allocated size
should be aligned too. It works for the simple case of 8 ints
since that's coincidentally aligned, but if we were doing like
3 bytes, it would mess things up. Gotta be rounded up to a
multiple of 4 or 16 on some systems.
hmm, I'm looking at the alloca source and there's a touch of a
guard page on Windows too. Check out the file:
dmd2/src/druntime/src/rt/alloca.d, it is written in mostly
inline asm.
Note the comment though:
* This is a 'magic' function that needs help from the compiler
to
* work right, do not change its name, do not call it from
other compilers.
So, how does this compare with alloca? Let's make a really
simple example to compare and contrast with malloc to make the
asm more readable:
import core.stdc.stdlib;
void vla(int n) {
int[] arr;
arr = (cast(int*)alloca(n * int.sizeof))[0 .. n];
}
Program runs, let's see the code.
0805f3f0 <_D3vla3vlaFiZv>:
805f3f0: 55 push ebp
805f3f1: 8b ec mov ebp,esp
805f3f3: 83 ec 10 sub esp,0x10
805f3f6: c7 45 f0 10 00 00 00 mov DWORD PTR
[ebp-0x10],0x10
805f3fd: 89 45 fc mov DWORD PTR
[ebp-0x4],eax
805f400: c7 45 f4 00 00 00 00 mov DWORD PTR
[ebp-0xc],0x0
805f407: c7 45 f8 00 00 00 00 mov DWORD PTR
[ebp-0x8],0x0
805f40e: 8b 45 fc mov eax,DWORD PTR
[ebp-0x4]
805f411: 50 push eax
805f412: c1 e0 02 shl eax,0x2
805f415: 50 push eax
805f416: 8d 4d f0 lea ecx,[ebp-0x10]
805f419: e8 e2 01 00 00 call 805f600
<__alloca>
805f41e: 89 c1 mov ecx,eax
805f420: 83 c4 04 add esp,0x4
805f423: 58 pop eax
805f424: 89 45 f4 mov DWORD PTR
[ebp-0xc],eax
805f427: 89 4d f8 mov DWORD PTR
[ebp-0x8],ecx
805f42a: c9 leave
805f42b: c3 ret
Change alloca to malloc:
0805f3f0 <_D3vla3vlaFiZv>:
805f3f0: 55 push ebp
805f3f1: 8b ec mov ebp,esp
805f3f3: 83 ec 0c sub esp,0xc
805f3f6: 89 45 fc mov DWORD PTR
[ebp-0x4],eax
805f3f9: c7 45 f4 00 00 00 00 mov DWORD PTR
[ebp-0xc],0x0
805f400: c7 45 f8 00 00 00 00 mov DWORD PTR
[ebp-0x8],0x0
805f407: 8b 45 fc mov eax,DWORD PTR
[ebp-0x4]
805f40a: 50 push eax
805f40b: c1 e0 02 shl eax,0x2
805f40e: 50 push eax
805f40f: e8 0c fc ff ff call 805f020
<malloc@plt>
805f414: 89 c1 mov ecx,eax
805f416: 83 c4 04 add esp,0x4
805f419: 58 pop eax
805f41a: 89 45 f4 mov DWORD PTR
[ebp-0xc],eax
805f41d: 89 4d f8 mov DWORD PTR
[ebp-0x8],ecx
805f420: c9 leave
805f421: c3 ret
Differences?
We can see on line 3 that there's an extra word allocated for a
local variable with alloca. It is loaded with the size of the
local variables - 0x10. A pointer to that is passed to alloca.
If we go back to the druntime source code:
* ECX address of variable with # of bytes in locals
* This is adjusted upon return to reflect the
additional
* size of the stack frame.
It is used in that function:
// Copy down to [ESP] the temps on the stack.
// The number of temps is (EBP - ESP - locals).
// snip
sub ECX,[EDX] ; // ECX = number of temps (bytes) to
move.
add [EDX],ESI ; // adjust locals by nbytes
for next call to alloca()
// snip
rep ;
movsd ;
So, instead of restoring the stack pointer upon function return
like I did, this copies the relevant data that was pushed onto
the stack to the new location, so a subsequent pop will find
what it expects, then it adjusts the hidden local size variable
so next time, it can repeat the process. Cool - that's
something my solution wouldn't have done super easily (it
totally could, just don't overwrite that variable once it is
initialized).
I guess there is a better way than I had figured above :)
We can use that same trick the compiler did by declaring a
local variable and moving the magic __LOCAL_SIZE (see:
http://dlang.org/iasm.html ) value into it up front, then
calling alloca exactly as the C does. The implementation can be
the same as from druntime too.
That's why it is a magic function: it needs to put the stack
how it expects, somehow. My way was to add a store. The way
actually used in druntime is to store the size of the locals in
a hidden variable. Either way, if you do an iasm alloca
yourself, you'll have to account for it as well.
Otherwise, remember to store the right pointer and allocate the
right number of bytes and you've got it.