Re: VLA in Assembler

Foo via Digitalmars-d-learn Wed, 17 Dec 2014 08:25:45 -0800

On Wednesday, 17 December 2014 at 16:10:40 UTC, Adam D. Ruppewrote:

On Wednesday, 17 December 2014 at 14:11:32 UTC, Foo wrote:
        asm {
                mov EAX, n;
                mov [arr + 8], ESP;
                sub [ESP], EAX;
                mov [arr + 0], EAX;
        }
but that does not work...
That wouldn't work even with malloc.... remember, an integermore than one byte long, so your subtract is 1/4 the size itneeds to be! Also, since the stack grows downward, you'restoring the pointer to the end of the array instead of thebeginning of it.
NOTE: I've never actually done this before, so I'm figuring itout as I go too. This might be buggy or otherwise mistaken atpoints. (Personally, I prefer to use a static array sized tothe max thing I'll probably need that I slice instead ofalloca...)
Here's some code that runs successfully (in 32 bit!):

void vla(int n) {
        int[] arr;

        asm {
                mov EAX, [n];
// the first word in an array is the length,store that
                mov [arr], EAX;
                shl EAX, 2; // number of bytes == n * int.sizeof
                sub ESP, EAX; // allocate the bytes
mov [arr + size_t.sizeof], ESP; // store the beginning of itin the arr.ptr
        }

        import std.stdio;
        writeln(arr.length);
        writeln(arr.ptr);

        // initialize the data...
        foreach(i, ref a; arr)
                a = i;

        writeln(arr); // and print it back out
}

void main() {
        vla(8);
}
This looks right.... but isn't, we changed the stack and didn'tput it back. That's usually a no-no. If we disassemble thefunction, we can take a look at the end and see something scary:
8084ec6: e8 9d 6a 00 00 call 808b968<_D3std5stdio15__T7writelnTAiZ7writelnFAiZv> // our finalwriteln call
 8084ecb:       5e                      pop    esi  // uh oh
 8084ecc:       5b                      pop    ebx
 8084ecd:       c9                      leave
 8084ece:       c3                      ret
Before the call to leave, which puts the stack back how it wasat the beginning of the function - which saves us from a randomEIP being restored upon the ret instruction - the compiler putin a few pop instructions.
main() will have different values in esi and ebx than itexpects! Running it in the debugger shows these values changedtoo:
before

(gdb) info registers
[...]
ebx            0xffffd4f4       -11020
[...]
esi            0x80916e8        134813416


after

ebx            0x1      1
esi            0x0      0
It popped the values of our array. According to the ABI: "EBX,ESI, EDI, EBP must be preserved across function calls."http://dlang.org/abi.html
They are pushed for a reason - the compiler assumes they remainthe same.
In this little test program, nothing went wrong because no morecode was run after vla returned. But, if we were using, say astruct, it'd probably fault when it tried to access `this`.It'd probably mess up other local variables too. No good!
So, we'll need to store and restore the stack pointer... can weuse the stack's push and pop instructions? Nope, we're changingthe stack! Our own pop would grab the wrong data too.
We could save it in a local variable. How do we restore itthough? scope(exit) won't work, it won't happen at the righttime and will corrupt the stack even worse.
Gotta do it ourselves - which means we can't do the alloca evenas a single mixin, since it needs code added before any returnpoint too!
(There might be other, better ways to do this... and indeed,there is, as we'll see later on. I peeked at the druntimesource code and it does it differently. Continue reading...)
Here's code that we can verify in the debugger leaveseverything how it should be and doesn't crash:
void vla(int n) {
        int[] arr;
        void* saved_esp;

        asm {
                mov EAX, [n];
                mov [arr], EAX;
                shl EAX, 2; // number of bytes == n * int.sizeof

                // NEW LINE
                mov [saved_esp], ESP; // save it for later

                sub ESP, EAX;
                mov [arr + size_t.sizeof], ESP;
        }

        import std.stdio;
        writeln(arr.length);
        writeln(arr.ptr);

        foreach(i, ref a; arr)
                a = i;

        writeln(arr);

        // NEW LINE
        asm { mov ESP, [saved_esp]; } // restore it before we return
}
Note that this still isn't quite right - the allocated sizeshould be aligned too. It works for the simple case of 8 intssince that's coincidentally aligned, but if we were doing like3 bytes, it would mess things up. Gotta be rounded up to amultiple of 4 or 16 on some systems.
hmm, I'm looking at the alloca source and there's a touch of aguard page on Windows too. Check out the file:dmd2/src/druntime/src/rt/alloca.d, it is written in mostlyinline asm.
Note the comment though:
* This is a 'magic' function that needs help from the compilerto* work right, do not change its name, do not call it fromother compilers.
So, how does this compare with alloca? Let's make a reallysimple example to compare and contrast with malloc to make theasm more readable:
import core.stdc.stdlib;

void vla(int n) {
        int[] arr;
        arr = (cast(int*)alloca(n * int.sizeof))[0 .. n];
}


Program runs, let's see the code.

0805f3f0 <_D3vla3vlaFiZv>:
 805f3f0:       55                      push   ebp
 805f3f1:       8b ec                   mov    ebp,esp
 805f3f3:       83 ec 10                sub    esp,0x10
805f3f6: c7 45 f0 10 00 00 00 mov DWORD PTR[ebp-0x10],0x10805f3fd: 89 45 fc mov DWORD PTR[ebp-0x4],eax805f400: c7 45 f4 00 00 00 00 mov DWORD PTR[ebp-0xc],0x0805f407: c7 45 f8 00 00 00 00 mov DWORD PTR[ebp-0x8],0x0805f40e: 8b 45 fc mov eax,DWORD PTR[ebp-0x4]
 805f411:       50                      push   eax
 805f412:       c1 e0 02                shl    eax,0x2
 805f415:       50                      push   eax
 805f416:       8d 4d f0                lea    ecx,[ebp-0x10]
805f419: e8 e2 01 00 00 call 805f600<__alloca>
 805f41e:       89 c1                   mov    ecx,eax
 805f420:       83 c4 04                add    esp,0x4
 805f423:       58                      pop    eax
805f424: 89 45 f4 mov DWORD PTR[ebp-0xc],eax805f427: 89 4d f8 mov DWORD PTR[ebp-0x8],ecx
 805f42a:       c9                      leave
 805f42b:       c3                      ret


Change alloca to malloc:

0805f3f0 <_D3vla3vlaFiZv>:
 805f3f0:       55                      push   ebp
 805f3f1:       8b ec                   mov    ebp,esp
 805f3f3:       83 ec 0c                sub    esp,0xc
805f3f6: 89 45 fc mov DWORD PTR[ebp-0x4],eax805f3f9: c7 45 f4 00 00 00 00 mov DWORD PTR[ebp-0xc],0x0805f400: c7 45 f8 00 00 00 00 mov DWORD PTR[ebp-0x8],0x0805f407: 8b 45 fc mov eax,DWORD PTR[ebp-0x4]
 805f40a:       50                      push   eax
 805f40b:       c1 e0 02                shl    eax,0x2
 805f40e:       50                      push   eax
805f40f: e8 0c fc ff ff call 805f020<malloc@plt>
 805f414:       89 c1                   mov    ecx,eax
 805f416:       83 c4 04                add    esp,0x4
 805f419:       58                      pop    eax
805f41a: 89 45 f4 mov DWORD PTR[ebp-0xc],eax805f41d: 89 4d f8 mov DWORD PTR[ebp-0x8],ecx
 805f420:       c9                      leave
 805f421:       c3                      ret


Differences?
We can see on line 3 that there's an extra word allocated for alocal variable with alloca. It is loaded with the size of thelocal variables - 0x10. A pointer to that is passed to alloca.
If we go back to the druntime source code:

 *      ECX     address of variable with # of bytes in locals
* This is adjusted upon return to reflect theadditional
 *              size of the stack frame.


It is used in that function:
        // Copy down to [ESP] the temps on the stack.
        // The number of temps is (EBP - ESP - locals).
 // snip
sub ECX,[EDX] ; // ECX = number of temps (bytes) tomove.add [EDX],ESI ; // adjust locals by nbytesfor next call to alloca()
 // snip
        rep                     ;
        movsd                   ;
So, instead of restoring the stack pointer upon function returnlike I did, this copies the relevant data that was pushed ontothe stack to the new location, so a subsequent pop will findwhat it expects, then it adjusts the hidden local size variableso next time, it can repeat the process. Cool - that'ssomething my solution wouldn't have done super easily (ittotally could, just don't overwrite that variable once it isinitialized).
I guess there is a better way than I had figured above :)
We can use that same trick the compiler did by declaring alocal variable and moving the magic __LOCAL_SIZE (see:http://dlang.org/iasm.html ) value into it up front, thencalling alloca exactly as the C does. The implementation can bethe same as from druntime too.
That's why it is a magic function: it needs to put the stackhow it expects, somehow. My way was to add a store. The wayactually used in druntime is to store the size of the locals ina hidden variable. Either way, if you do an iasm allocayourself, you'll have to account for it as well.
Otherwise, remember to store the right pointer and allocate theright number of bytes and you've got it.


That is an awesome explanation! :)
Thank you for your time, I will experiment with your code.

Re: VLA in Assembler

Reply via email to