Re: Preprocessor for assembler macros?
> gcc -S tmp.S for some reason prints to stdout, so gcc -S tmp.S > tmp.s > is what you need Thank you very much, I'll take a look. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Preprocessor for assembler macros?
Hello everybody, I already asked that on gcc-help@ but got no answer, so I'm trying again here. I'm looking for a way to get inbetween the assembler macro processor and the assembler. I'd like to get the assembler sources mostly as-is, but with the macros used therein already expanded. I've already taken a look at the "-a" command line option, but this does assembling as well - so I'd have to filter the real assembler from there, and try this way. Is there something easier? Thank you for all answers. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
[ANN]: Redundancy remover
Hello everybody, the idea I presented last year [1], and which I said in January that I thought how to realize [2], has come true. I'd like to show you a tool that removes a bit of redundancy off your binaries, without needing to change the sources, by identifying repeated code blocks, and substituting them in the assembler listing by jumps to a common block. This is certainly not finished; depending on the AI that's put in here it could even write completely new functions, that combine small differences in the callers by using parameters. I have a prototype, which can prove the principle. I'm just inserting the README here; before more people join in hacking that, I'd like to ask whether I should open a new project for this at some OSS portal, or whether that should live in the GCC tree. Depending on that answer I'd put the code somewhere, and post the link here; I'd like to avoid having people sending me patches, because they're certain to conflict sooner or later, and I'm already a bit short on time anyway. Regards, Phil 1: http://gcc.gnu.org/ml/gcc/2008-03/msg00410.html 2: http://marc.info/?l=gcc&m=123272618027709&w=2 -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)! -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!Redundancy Remover == (C)opyright 2009 by Ph. Marek, philipp <%40> marek.priv.at Released under the GPLv3. What is this? = RR is a set of scripts to find duplicated code in binary files (like executables and shared objects), and to eliminate them by translating the assembler output of gcc before it's being converted to object code. And this works? === Yes, it does. See test results below. I chose to test FSVS because - it's my pet project, and so I know exactly how to compile it; - it has extensive tests, which I can use to verify the translation; - it's of a representative size. A bigger (partly done) test is with libgcj9; see below, too. How does this work? === Please see the POD documentation in the rr-translate perl script. Note about the savings == The values given by the script are just predictions for perfect translations, which won't be really seen. FSVS results The compilation took a bit more than twice as long; without translation 14.8 sec to 35.2 sec. The prediction was 2424 bytes, the real size difference 2128 bytes (0.8%): size: textdata bss dec hex filename 2598247960 11136 278920 44188 fsvs 2576967960 11136 276792 43938 fsvs.translated Linux kernel results I ran this on my custom linux kernel: text data bss dec hex filename 4651138 2149868 6551208 13352214 cbbd16 /usr/src/linux/vmlinux and got a prediction of about 37kB: max savings: 159513 real savings approximate 37917 bytes. Writing files to /tmp. 645 files written. In real life a bit less is to be expected; but 32kB should be possible. libgcj9 results === As I didn't have the sources available (at this moment), I only looked at the analyse results. I took /usr/lib/libgcj.so.90 from debian: libgcj9-0 4.3.3-3 Java runtime library for use with gcj -rw-r--r-- 1 root root 46903400 31. Jän 05:23 /usr/lib/libgcj.so.90 which, I believe, is one of the best cases, because it includes a high number of functions, where tail packing (;-) can be done. Analysing takes about 500MB RAM, and runs on a 800MHz machine for real 1m36.873s user 1m36.426s sys 0m3.636s The (unbelieveable) output is (surely wrong; but a straight compile of this library would give real hard numbers, so I stopped my bug hunting) max savings: 736313 real savings approximate 1561073 bytes. As an example of an output dump: # Redundancy Remover generated file. # This code part was originally seen at 0x016a9750 # got 129 hits in /usr/lib/libgcj.so.90, and was 158 bytes long. .globl RRGh7K3nv6Ocjw48SB5SACg .type RRGh7K3nv6Ocjw48SB5SACg, @function mov $0x4,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0x5,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0x6,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0x7,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0x8,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0x9,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0xa,%edi callq _Jv_ThrowBadArrayIndex nopw 0x0(%rax,%rax,1) mov $0xb,%edi callq _Jv_ThrowBadArrayIndex mov $0xc,%edi callq _Jv_ThrowBadArrayIndex mov $0xd,%edi callq _Jv_ThrowBadArrayIndex xchg %ax,%ax sub $0x8,%rsp callq _ZN4java4util18ListResourceBundleC1Ev add $0x8,%rsp retq
Re: RFC: Idea for code size reduction
Hello everybody, On Friday 07 March 2008 Philipp Marek wrote: > Here you are. > > > code_overlap.pl - disassembles a binary, and outputs a list > (address, name, instruction, bytes) to STDOUT. > > bytes_saved.pl - takes such a list, and tries to estimate > the amount of bytes that could be saved. > > > I normally use them this way: > #!/bin/sh > perl code_overlap.pl "$@" > overlap.txt > perl bytes_saved.pl has anyone acted on that already? I've got some more ideas, which I'd like to test (and implement) here. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Re: Idea for code size reduction
On Friday 07 March 2008 Ian Lance Taylor wrote: > "Philipp Marek" <[EMAIL PROTECTED]> writes: > >> Shouldn't this be done in the linker instead? > > > > Well, can the linker change the instruction sequences? Ie. put a JMP > > instead of other code? > > Sure. The linker can do whatever it likes. The usual problem is that > by the time the linker sees the code, it no longer knows about PC > relative references within the same section. So if you have > > call foo > ... > sequence to remove > ... > foo: > > and the call is PC-relative, then when the linker removes the sequence > of instructions it will not know that it needs to adjust the call to > foo. Right. > What you are describing is a type of linker relaxation, and the same > issues arise there. The usual solution is to force the assembler to > emit all PC-relative relocations, so that the linker knows about them > and is able to adjust them. But what we'd need above is to have the compiler make a reference to [removed_sequence + offset] instead, and that could be done by the linker again. So when the compiler removes a piece of code by a "jump ", it can use offsets in there as usual. > Linker relaxation is used routinely in special sections, such as the > .eh_frame section used to hold stack unwind information for throwing > exceptions. It is also used for code sequences on some processors, > but normally not on the i386/x86_64. Thank you very much for this information! Another idea: if we find out that such sequences are *very* common (eg. by comparing many different binaries and libraries), libc (or a libcommon or something like that) could have a set of these "abbreviations" - eg. indexed by symbol, which is defined as MD5_hex(removed section). -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Re: RFC: Idea for code size reduction
BTW - It gets much better: $ ls -la /usr/lib/libgcj.so.90.0.0 -rw-r--r-- 1 root root 32844984 3. Feb 16:03 /usr/lib/libgcj.so.90.0.0 $ ./run /usr/lib/libgcj.so.90.0.0 Approx. 470045 bytes saved. That's 1.4% :-) If my script isn't buggy, that is -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Re: RFC: Idea for code size reduction
Hello Jakub! >> When wouldn't that possible? My script currently splits on an >> instruction-level -- although I would see no problem that some branch >> jumps into a "half" opcode of another branch, if the byte sequence >> matches. > > Consider: > : >0: b8 a4 00 00 00 mov$0xa4,%eax >5: ba fc 04 00 00 mov$0x4fc,%edx >a: f7 e2 mul%edx >c: 05 d2 04 00 00 add$0x4d2,%eax > 11: c3 ret > ... > > 00020012 : >20012: 39 d2 cmp%edx,%edx >20014: 75 07 jne2001d >20016: ba fc 04 00 00 mov$0x4fc,%edx >2001b: f7 e2 mul%edx >2001d: 05 d2 04 00 00 add$0x4d2,%eax >20022: c3 ret That's not an example of jumping into the *middle* of an instruction. I'd think about something like (please ignore the wrong addresses, just copy/paste): 5: ba fc 04 00 00 mov$0x4fc,%edx ---> a: f7 e2 mul%edx c: 05 d2 04 00 00 add$0x4d2,%eax 11: c3 ret ... 20016: ba 00 ba f7 e2 mov$0xe2f7ba00,%edx 2001b: f7 e2 mul%edx 2001d: 05 d2 04 00 00 add$0x4d2,%eax 20022: c3 ret Now we could place a jump at 0x4, which goes to 0x20018 - and should work, shouldn't it? > If you merge the mov/mul/add/ret sequences by replacing the foo tail > sequence with jmp bar+5, then the jne will branch to wrong place, or > if you try to adjust it, it is too far to reach the target. Yes. But see below. >> > but even jmp argument is relative, not absolute. >> That's why I take only jumps with 32bit arguments - these are absolute. > > No, they are relative. > > : >0: e9 05 00 02 00 jmp2000a >5: e9 00 00 02 00 jmp2000a > ... > > 0002000a : >2000a: 90 nop > > See how they are encoded. Yes, ok. But how does the compiler encode jumps? Via symbols, and the correct destination address gets put there by the linker. If there are (sizeof(void*)) operand bytes, there's enough space for *any* jump. (Is there some way for the linker to compress small jumps later?) > BTW, have you tried to compile the whole kernel with --combine, or at > least > e.g. each kernel directory with --combine? I guess that will give you > bigger savings than 30K. Also, stop defining inline to inline > __attribute__((always_inline)), I think Ingo also added such patch > recently > and it saved 120K. No ... but these are independent changes. I'm currently compiling with --combine. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Re: RFC: Idea for code size reduction
Hello Jakub! > You need to be very careful with it, as if there are any jumps > into the middle of the to be abstracted tail sequences, you can't > abstract them or would need to adjust also the jumps into them (if > possible, which not always is). When wouldn't that possible? My script currently splits on an instruction-level -- although I would see no problem that some branch jumps into a "half" opcode of another branch, if the byte sequence matches. As I use the disassembly to start from they *have* to converge later - else I wouldn't have the same end point! So I think it's just a matter of defining another symbol at that address, and use that to jump into. > Also the tail sequence shouldn't contain any PC relative > references (say you can't merge this was > movl (.+0xabcd)(%rip), %eax > ret > which is byte identical, yet would reference different variables), That's right ... I didn't think of that. The script could "forget" the current byte sequences if it encounters %rip relative adressing. Note: I only know a bit about i686 and amd64 - for other processors that would have to be different, of course. > but even jmp argument is relative, not absolute. That's why I take only jumps with 32bit arguments - these are absolute. > You can't do this kind of optimization in the linker, as that's too > late, many relocations are already relocated and lost during assembly > and you need to understand the instructions anyway. > Doing it in GCC would be terribly costly, if that would mean > expanding everything into RTL, running all RTL passes and instead of > emitting them, remember the pre-final RTL sequence for each emitted > function, then do an IPA pass over all the sequences. > Perhaps doing that in assembler, combined with --combine so that > assembler can see everything together... Yes ... I think doing a second compile pass might be the easiest way, and not much slower than other solutions. (We could always remember which object files could be optimized, and only recompile *those*. After all, it's just an additional optimization.) Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
RE: Idea for code size reduction
Hello Dave! > One achitectural problem here is that GCC doesn't emit bytes. It emits > ASCII text, in the form of assembly instructions, and it's > not always easy to predict how they'll look by the time they've been > through the assembler and then had relocs applied by the > linker. (Indeed, to know what particular bytes you'd end up with you'd > need to do everything the assembler and linker does, which > is why we have an assembler and linker - the stuff GCC emits isn't > directly executable...) I know that. So maybe my current way (compile -> find duplicates -> define duplicates -> reuse them in fresh compile) is the easiest way to do that. > Shouldn't this be done in the linker instead? Well, can the linker change the instruction sequences? Ie. put a JMP instead of other code? Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Re: RFC: Idea for code size reduction
Hello Michael! > Can I test your script in my embedded system? > Can you send to me? Here you are. code_overlap.pl - disassembles a binary, and outputs a list (address, name, instruction, bytes) to STDOUT. bytes_saved.pl - takes such a list, and tries to estimate the amount of bytes that could be saved. I normally use them this way: #!/bin/sh perl code_overlap.pl "$@" > overlap.txt perl bytes_saved.pl Which just outputs the number of bytes. Hope that helps ... please let my know whether that works for you. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)! bytes_saved.pl Description: Perl program code_overlap.pl Description: Perl program
Re: RFC: Idea for code size reduction
Hello Richard! [ I took linux-tiny out - it's moderated, and I don't want to spam them. ] > Sounds like what -frtl-abstract-sequences is trying to do. Yes, thank you. I didn't know that; that should be close. *But*: I think it doesn't work. $ size vmlinux-as vmlinux-Os textdata bss dec hex filename 4797637 446554 606208 5850399 59451f vmlinux-as 4802166 446554 606208 5854928 5956d0 vmlinux-Os If I don't err there's only 4500 bytes difference ... which is *a lot* different to my 30kB. That doesn't mean that my scripts are correct - they might simply be buggy ... but I verified some by hand. Well, see the other message for the scripts. Thank you for your answer! Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
RFC: Idea for code size reduction
Hello everybody, I have a feature request. I'd like to (manually) define some byte blocks, eg. as functions with an identifier. Then, if GCC would emit exactly these bytes, it puts a JMP identifier there instead. This would help by sharing many identical (text/code) byte sequences, to make a smaller image - like -fmerge-constants does for data - and therefor helps for cache locality, too. I wrote some perl scripts to test this. I took a "alldefconfig" i686 kernel, let objdump disassemble it, and on "iret", "ret", "ljmp" or "jmp" with a 4 byte address I store the last few bytes. Another script goes through these block-ends, and estimates the number of bytes saved, if identical sequences get changed into a 5 byte opcode (jump with 32bit address). The original kernel has $ size vmlinux textdata bss dec hex filename 4802166 446554 606208 5854928 5956d0 vmlinux and my scripts emit space savings of about 30kB. Now that's not that much, but for embedded systems it means another userspace binary more. [ If anyone's interested, I can post my scripts here ... they're not that large. ] And, of course, the same can be used for other binaries too - which could save other space as well. (blender-bin, text size 8664025, gives 97KB savings) I'd imagine the use that such: - Built the binary - Run my scripts, to find some common code sequences - Give them to gcc - Rebuild, using the optimizations I don't know whether some new optimize switch could be used, to do all that internally in the compiler and linker. What do you think about that? [ Please keep at least *me* cc'ed - don't know whether that's that interesting for linux-tiny. ] Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!
Dynamic macro expansion through a pipe?
Hello everybody, I'm looking for a nice solution. One of my programs has a lot of strings in it; about 20% of binary size, or something like that. Now most of them are rarely used - error messages like "Cannot write to file", "No space left while allocating memory", etc. I tried to compress just the help texts in the binary - and saved 11k without any other modification. Now I'd like to wrap that in some magic macro, so that it looks like that: char *text=C("text"); Then there could be #if USE_COMPRESSED_TEXT #define C(x) ... #else #define C(x) x #endif but that's where I'm stuck. It's easy to write a function that takes a pointer to some compressed space and returns a pointer to the uncompressed data (re-entry is no problem). I don't know how to get the transformation while compiling. - I could try #define C(x) uncompress(buffer##__LINE__) and write some script that looks for the C() calls and generates a C file with the buffer constants - which means parsing (some of) C. - Or these texts get put in some special segment, that gets compressed into another before linking, and the original segment gets removed. - Or, what I'd like best: Is there some way to tell GCC "whenever this macro is used, call program X, and substitute the macro value with the output"? That would be the easiest way for me. Now you might say that it isn't that interesting - just use a compressed filesystem, or whatever. But these are all different tradeoffs; I'd just like to reduce the storage (and typical memory) footprint of my application by storing rarely used data compressed. I'd imagine that other projects would use that too, if it's easily available - and then the space needed for a small installation would decrease by some non-trivial amount. And, not at least, allowing dynamic macro expansion would allow some neat hacks :-) Please keep me CC:ed; thank you for all ideas, tips and other feedback. Regards, Phil -- Versioning your /etc, /home or even your whole installation? Try fsvs (fsvs.tigris.org)!