Re: D on lm32-CPU: string argument on stack instead of register
On Tuesday, 4 August 2020 at 17:36:53 UTC, Michael Reese wrote: Thanks for suggesting! I tried, and the union works as well, i.e. the function args are registered. But I noticed another thing about all workarounds so far: Even if calls are inlined and arguments end up on the stack, the linker puts code of the wrapper function in my final binary event if it is never explicitly called. So until I find a way to strip of uncalled functions from the binary (not sure the linker can do it), the workarounds don't solve the size problem. But they still make the code run faster. Try -ffunction-sections -Wl,--gc-sections. That should remove all unreferenced functions. It removes all unreferenced sections, and writes every function into a separate section.
Re: D on lm32-CPU: string argument on stack instead of register
On Saturday, 1 August 2020 at 23:08:38 UTC, Chad Joan wrote: Though if the compiler is allowed to split a single uint64_t into two registers, I would expect it to split struct/string into two registers as well. At least, the manual doesn't seem to explicitly mention higher-level constructs like structs. It does suggest a one-to-one relationship between arguments and registers (up to a point), but GCC seems to have decided otherwise for certain uint64_t's. (Looking at Table 3...) It even gives you two registers for a return value: enough for a string or an array. And if the backend/ABI weren't up for it, it would be theoretically possible to have the frontend to lower strings (dynamic arrays) and small structs into their components before function calls and then also insert code on the other side to cast them back into their original form. I'm not sure if anyone would want to write it, though. o.O Right, I think at some point one should fix the backend. C programs would also benefit from it when passing structs as arguments. However in C it is more common to just pass pointers and they go into registers. I guess this is why I never noticed before that struct passing is needlessly expensive. Getting from pointer-length to string might be pretty easy: string foo = ptr[0 .. len]; Ah cool! I did know about array slicing, but wasn't aware that it works on pointers, too. It ended up being a little more complicated than I thought it would be. Hope I didn't ruin the fun. ;) https://pastebin.com/y6e9mxre Thanks :) I'll have to look into that more closely. But this is the kind of stuff that I hope to make use of in the future on the embedded CPU. But for now I cannot use it yet because I don't have phobos and druntime in my toolchain right now... just naked D. Also, that part where you mentioned a 64-bit integer being passed as a pair of registers made me start to wonder if unions could be (ab)used to juke the ABI: https://pastebin.com/eGfZN0SL Thanks for suggesting! I tried, and the union works as well, i.e. the function args are registered. But I noticed another thing about all workarounds so far: Even if calls are inlined and arguments end up on the stack, the linker puts code of the wrapper function in my final binary event if it is never explicitly called. So until I find a way to strip of uncalled functions from the binary (not sure the linker can do it), the workarounds don't solve the size problem. But they still make the code run faster.
Re: D on lm32-CPU: string argument on stack instead of register
On Saturday, 1 August 2020 at 08:58:03 UTC, Michael Reese wrote: [...] So the compiler knows how to use r1 and r2 for arguments. I checked again in the lm32 manual (https://www.latticesemi.com/view_document?document_id=52077), and it says: "As illustrated in Table 3 on page 8, the first eight function arguments are passed in registers. Any remaining arguments are passed on the stack, as illustrated in Figure 12." So strings and structs should be passed on the stack and this seems to be more an issue of the gcc lm32 backend than a D issue. Nice find! Though if the compiler is allowed to split a single uint64_t into two registers, I would expect it to split struct/string into two registers as well. At least, the manual doesn't seem to explicitly mention higher-level constructs like structs. It does suggest a one-to-one relationship between arguments and registers (up to a point), but GCC seems to have decided otherwise for certain uint64_t's. (Looking at Table 3...) It even gives you two registers for a return value: enough for a string or an array. And if the backend/ABI weren't up for it, it would be theoretically possible to have the frontend to lower strings (dynamic arrays) and small structs into their components before function calls and then also insert code on the other side to cast them back into their original form. I'm not sure if anyone would want to write it, though. o.O But I just found a workaround using a wrapper function. void write_to_host(in string msg) { write_to_hostC(msg.ptr, msg.length); } I checked the assembly code on the caller side, and the call write_to host("Hello, World!\n") is inlined. There is only one call to write_to_hostC. This is still not nice, but I think I can live with that for now. Now I have to figure out how make the cast back from from pointer-length pair into a string. I'm sure I read that somewhere before, but forgot it and was unable to find it now on a quick google search... That's pretty clever. I like it. Getting from pointer-length to string might be pretty easy: string foo = ptr[0 .. len]; D allows pointers to be indexed, like in C. But unlike C, D has slices, and pointers can be "sliced". The result of a slice operation, at least for primitive arrays+pointers, is always an array (the "dynamic" ptr+length kind). Hope that helps. And since this is D: is there maybe some CTFE magic that allows to create these wrappers automatically? Somthing like fix_stack!write_to_host("Hello, World!\n"); It ended up being a little more complicated than I thought it would be. Hope I didn't ruin the fun. ;) https://pastebin.com/y6e9mxre Also, that part where you mentioned a 64-bit integer being passed as a pair of registers made me start to wonder if unions could be (ab)used to juke the ABI: https://pastebin.com/eGfZN0SL Good luck with your lm32/FPGA coding. That sounds like cool stuff! I'm doing this mainly to improve my understanding of how embedded processors work, and how to write linker scripts for a given environment. Although I do have actual hardware where I can see if everything runs in the real world, I mainly use simulations. The coolest thing in my opinion is, nowadays it can be done using only open source tools (mainly ghdl and verilator, the lm32 source code is open source, too). The complete system is simulated and you can look at every logic signal in the cpu or in the ram while the program executes. Thanks for the insights; I've done just a little hobby electrical stuff here-and-there, and having some frame of reference for tool and component choice makes me feel good, even if I don't plan on buying any lm32s or FPGAs anytime soon :) Maybe I can Google some of that later and geek out at images of other people's debugging sessions or something. I'm curious how they manage the complexity that happens when circuits and massive swarms of logic gates do their, uh, complexity thing. o.O
Re: D on lm32-CPU: string argument on stack instead of register
On Friday, 31 July 2020 at 15:13:29 UTC, Chad Joan wrote: THAT SAID, I think there are things to try and I hope we can get you what you want. If you're willing to entertain more experimentation, here are my thoughts: Thanks a lot for the suggestions and explanations. I tried them all but the only time I got different assembly code was your suggestion (3) which produced the following. (3) Try a different type of while-loop in the D-style version: // ideomatic D version void write_to_host(in string msg) { // a fixed address to get bytes to the host via usb char *usb_slave = cast(char*)BaseAdr.ft232_slave; size_t i = 0; while(i < msg.length) { *usb_slave = msg[i++]; } } _D10firmware_d13write_to_hostFAyaZv: addi sp, sp, -8 addi r4, r0, 4096 sw (sp+8), r2 sw (sp+4), r1 addi r2, r0, 0 .L3: or r5, r2, r0 be r2,r1,.L1 lw r3, (sp+8) addi r2, r2, 1 add r3, r3, r5 lbu r3, (r3+0) sb (r4+0), r3 bi .L3 .L1: addi sp, sp, 8 bra At any rate, I don't think your code is larger or less efficient due to utf-8 decoding, because I don't see the utf-8 decoding. Agreed, I think there's no code for autodecoding being generated. I did some more experiments: Trying to put pointer and length into a struct and pass that to the function. Same result, the argument ended up on the stack. Then, I wrote the function in C and compiled it with the C-compiler of lm32-elf-gcc. It also puts the 64-bit POD structure on the stack. It seems the only arguments passed in registers are primitive data types. However, if I pass a uint64_t argument, it is registered using registers r1 and r2. So the compiler knows how to use r1 and r2 for arguments. I checked again in the lm32 manual (https://www.latticesemi.com/view_document?document_id=52077), and it says: "As illustrated in Table 3 on page 8, the first eight function arguments are passed in registers. Any remaining arguments are passed on the stack, as illustrated in Figure 12." So strings and structs should be passed on the stack and this seems to be more an issue of the gcc lm32 backend than a D issue. But I just found a workaround using a wrapper function. void write_to_host(in string msg) { write_to_hostC(msg.ptr, msg.length); } I checked the assembly code on the caller side, and the call write_to host("Hello, World!\n") is inlined. There is only one call to write_to_hostC. This is still not nice, but I think I can live with that for now. Now I have to figure out how make the cast back from from pointer-length pair into a string. I'm sure I read that somewhere before, but forgot it and was unable to find it now on a quick google search... And since this is D: is there maybe some CTFE magic that allows to create these wrappers automatically? Somthing like fix_stack!write_to_host("Hello, World!\n"); Good luck with your lm32/FPGA coding. That sounds like cool stuff! I'm doing this mainly to improve my understanding of how embedded processors work, and how to write linker scripts for a given environment. Although I do have actual hardware where I can see if everything runs in the real world, I mainly use simulations. The coolest thing in my opinion is, nowadays it can be done using only open source tools (mainly ghdl and verilator, the lm32 source code is open source, too). The complete system is simulated and you can look at every logic signal in the cpu or in the ram while the program executes.
Re: D on lm32-CPU: string argument on stack instead of register
On Friday, 31 July 2020 at 10:22:20 UTC, Michael Reese wrote: My question: Is there a way I can tell the D compiler to use registers instead of stack for string arguments, or any other trick to reduce code size while maintaining an ideomatic D codestyle? A D string is a "slice", which is a struct (pointer + length). Depending on the function call ABI, structs are passed in registers or on the stack. On x86, the D calling convention is to put small POD structs in registers, similar to the C++ calling convention. I don't know whether GDC has attributes to change the calling convention of functions (besides extern(C/C++/Windows/etc.)), but that's where you'd need to look. Otherwise, file a bug with GDC. Slice arguments are common enough to require enregistering them in the D calling convention on your lm32 platform. -Johan
Re: D on lm32-CPU: string argument on stack instead of register
On Friday, 31 July 2020 at 10:22:20 UTC, Michael Reese wrote: Hi all, at work we put embedded lm32 soft-core CPUs in FPGAs and write the firmware in C. At home I enjoy writing small projects in D from time to time, but I don't consider myself a D expert. Now, I'm trying to run some toy examples in D on the lm32 cpu. I'm using a recent gcc-elf-lm32. I succeeded in compiling and running some code and it works fine. But I noticed, when calling a function with a string argument, the string is not stored in registers, but on the stack. Consider a simple function (below) that writes bytes to a peripheral (that forwards the data to the host computer via USB). I've two versions, an ideomatic D one, and another version where pointer and length are two distinct function parameters. I also show the generated assembly code. The string version is 4 instructions longer, just because of the stack manipulation. In addition, it is also slower because it need to access the ram, and it needs more stack space. My question: Is there a way I can tell the D compiler to use registers instead of stack for string arguments, or any other trick to reduce code size while maintaining an ideomatic D codestyle? Best regards Michael // ideomatic D version void write_to_host(in string msg) { // a fixed address to get bytes to the host via usb char *usb_slave = cast(char*)BaseAdr.ft232_slave; foreach(ch; msg) { *usb_slave = ch; } } // resulting assembly code (compiled with -Os) 12 instructions _D10firmware_d13write_to_hostFxAyaZv: addi sp, sp, -8 addi r3, r0, 4096 sw (sp+4), r1 sw (sp+8), r2 add r1, r2, r1 .L3: be r2,r1,.L1 lbu r4, (r2+0) addi r2, r2, 1 sb (r3+0), r4 bi .L3 .L1: addi sp, sp, 8 bra // C-like version void write_to_hostC(const char *msg, int len) { char *ptr = cast(char*)msg; char *usb_slave = cast(char*)BaseAdr.ft232_slave; while (len--) { *usb_slave = *ptr++; } } // resulting assembly code (compiled with -Os) 8 instructions _D10firmware_d14write_to_hostCFxPaiZv: add r2, r1, r2 addi r3, r0, 4096 .L7: be r1,r2,.L5 lbu r4, (r1+0) addi r1, r1, 1 sb (r3+0), r4 bi .L7 .L5: bra Hi Michael! Last time I checked, D doesn't have any specific type attributes or special ways to force variables to enregister. But I could be poorly informed. Maybe there are GDC-specific hints or something. I hope that if anyone else knows better, they will toss in an answer. THAT SAID, I think there are things to try and I hope we can get you what you want. If you're willing to entertain more experimentation, here are my thoughts: --- (1) Try writing "in string" as "in const(char)[]" instead: // ideomatic D version void write_to_host(in const(char)[] msg) { // a fixed address to get bytes to the host via usb char *usb_slave = cast(char*)BaseAdr.ft232_slave; foreach(ch; msg) { *usb_slave = ch; } } Explanation: The "string" type is an alias for "immutable(char)[]". In D, "immutable" is a stronger guarantee than "const". The "const" modifier, like in C, tells the compiler that this function shall not modify the data referenced by this pointer/array/whatever. The "immutable" modifier is a bit different, as it says that NO ONE will modify the data referenced by this pointer/array/whatever, including other functions that may or may not be concurrently executing alongside the one you're in. So "const" constraints the callee, while "immutable" constrains both the callee AND the caller. This makes it more useful for some multithreaded code, because if you can accept the potential inefficiency of needing to do more copying of data (if you can't modify, usually you must copy instead), then you can have more deterministic behavior and sometimes even much better total efficiency by way of parallelization. This might not be a guarantee you care about though, at which point you can just toss it out completely and see if the compiler generates better code now that it sees the same type qualifier as in the other example. I'd actually be surprised if using "immutable" causes /less/ efficient code in this case, because it should be even /safer/ to use the argument as-is. But it IS a difference between the two examples, and one that might not be benefiting your cause (though that's totally up to you). --- (2) Try keeping the string argument, but make the function more closely identical in semantics: // ideomatic D version void write_to_host(string msg) { // a fixed address to get bytes to the host via usb char *usb_