Re: Preprocessor for assembler macros?

2009-03-09 Thread Philipp Marek

> gcc -S tmp.S for some reason prints to stdout, so gcc -S tmp.S > tmp.s
> is what you need
Thank you very much, I'll take a look.


Regards,

Phil

-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Preprocessor for assembler macros?

2009-03-08 Thread Philipp Marek
Hello everybody,

I already asked that on gcc-help@ but got no answer, so I'm trying again here.


I'm looking for a way to get inbetween the assembler macro processor and the 
assembler.
I'd like to get the assembler sources mostly as-is, but with the macros used 
therein already expanded.

I've already taken a look at the "-a" command line option, but this does 
assembling as well - so I'd have to filter the real assembler from there, and 
try this way.

Is there something easier?


Thank you for all answers.


Regards,

Phil

-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


[ANN]: Redundancy remover

2009-02-27 Thread Philipp Marek

Hello everybody,

the idea I presented last year [1], and which I said in January that I thought 
how to
realize [2], has come true.


I'd like to show you a tool that removes a bit of redundancy off your binaries, 
without
needing to change the sources, by identifying repeated code blocks, and 
substituting
them in the assembler listing by jumps to a common block.

This is certainly not finished; depending on the AI that's put in here it could 
even
write completely new functions, that combine small differences in the callers 
by using
parameters.
I have a prototype, which can prove the principle.


I'm just inserting the README here; before more people join in hacking that, 
I'd like to
ask whether I should open a new project for this at some OSS portal, or whether 
that
should live in the GCC tree.

Depending on that answer I'd put the code somewhere, and post the link here; 
I'd like to
avoid having people sending me patches, because they're certain to conflict 
sooner or
later, and I'm already a bit short on time anyway.


Regards,

Phil

1: http://gcc.gnu.org/ml/gcc/2008-03/msg00410.html
2: http://marc.info/?l=gcc&m=123272618027709&w=2

-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!Redundancy Remover
==

(C)opyright 2009 by Ph. Marek, philipp <%40> marek.priv.at

Released under the GPLv3.


What is this?
=

RR is a set of scripts to find duplicated code in binary files (like 
executables and shared objects), and to eliminate them by translating the 
assembler output of gcc before it's being converted to object code.


And this works?
===

Yes, it does. See test results below.

I chose to test FSVS because
- it's my pet project, and so I know exactly how to compile it;
- it has extensive tests, which I can use to verify the translation;
- it's of a representative size.

A bigger (partly done) test is with libgcj9; see below, too.


How does this work?
===

Please see the POD documentation in the rr-translate perl script.


Note about the savings
==

The values given by the script are just predictions for perfect 
translations, which won't be really seen.


FSVS results


The compilation took a bit more than twice as long; without translation 
14.8 sec to 35.2 sec.

The prediction was 2424 bytes, the real size difference 2128 bytes (0.8%):
  size:
 textdata bss dec hex filename
   2598247960   11136  278920   44188 fsvs
   2576967960   11136  276792   43938 fsvs.translated


Linux kernel results


I ran this on my custom linux kernel:
 text	   data	bss	dec	hex	filename
  4651138	2149868	6551208	13352214	 cbbd16	/usr/src/linux/vmlinux

and got a prediction of about 37kB:
  max savings: 159513
  real savings approximate 37917 bytes.
  Writing files to /tmp.
  645 files written.

In real life a bit less is to be expected; but 32kB should be possible.


libgcj9 results
===

As I didn't have the sources available (at this moment), I only looked at 
the analyse results.

I took /usr/lib/libgcj.so.90 from debian:
  libgcj9-0 4.3.3-3   Java runtime library for use with gcj
  -rw-r--r-- 1 root root 46903400 31. Jän 05:23 /usr/lib/libgcj.so.90

which, I believe, is one of the best cases, because it includes a high 
number of functions, where tail packing (;-) can be done.

Analysing takes about 500MB RAM, and runs on a 800MHz machine for
  real  1m36.873s
  user  1m36.426s
  sys   0m3.636s

The (unbelieveable) output is (surely wrong; but a straight compile of this 
library would give real hard numbers, so I stopped my bug hunting)
  max savings: 736313
  real savings approximate 1561073 bytes.


As an example of an output dump:

  # Redundancy Remover generated file.
  # This code part was originally seen at 0x016a9750
  # got 129 hits in /usr/lib/libgcj.so.90, and was 158 bytes long.
  
  .globl RRGh7K3nv6Ocjw48SB5SACg
  .type RRGh7K3nv6Ocjw48SB5SACg, @function
  
mov $0x4,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0x5,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0x6,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0x7,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0x8,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0x9,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0xa,%edi
callq _Jv_ThrowBadArrayIndex
nopw 0x0(%rax,%rax,1)
mov $0xb,%edi
callq _Jv_ThrowBadArrayIndex
mov $0xc,%edi
callq _Jv_ThrowBadArrayIndex
mov $0xd,%edi
callq _Jv_ThrowBadArrayIndex
xchg %ax,%ax
sub $0x8,%rsp
callq _ZN4java4util18ListResourceBundleC1Ev
add $0x8,%rsp
retq 
  

Re: RFC: Idea for code size reduction

2009-01-23 Thread Philipp Marek
Hello everybody,

On Friday 07 March 2008 Philipp Marek wrote:
> Here you are.
>
>
> code_overlap.pl - disassembles a binary, and outputs a list
>   (address, name, instruction, bytes) to STDOUT.
>
> bytes_saved.pl - takes such a list, and tries to estimate
>   the amount of bytes that could be saved.
>
>
> I normally use them this way:
>   #!/bin/sh
>   perl code_overlap.pl "$@" > overlap.txt
>   perl bytes_saved.pl
has anyone acted on that already?

I've got some more ideas, which I'd like to test (and implement) here.


Regards,

Phil

-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Re: Idea for code size reduction

2008-03-07 Thread Philipp Marek
On Friday 07 March 2008 Ian Lance Taylor wrote:
> "Philipp Marek" <[EMAIL PROTECTED]> writes:
> >>   Shouldn't this be done in the linker instead?
> >
> > Well, can the linker change the instruction sequences? Ie. put a JMP
> > instead of other code?
>
> Sure.  The linker can do whatever it likes.  The usual problem is that
> by the time the linker sees the code, it no longer knows about PC
> relative references within the same section.  So if you have
>
> call foo
> ...
> sequence to remove
> ...
> foo:
>
> and the call is PC-relative, then when the linker removes the sequence
> of instructions it will not know that it needs to adjust the call to
> foo.
Right.

> What you are describing is a type of linker relaxation, and the same
> issues arise there.  The usual solution is to force the assembler to
> emit all PC-relative relocations, so that the linker knows about them
> and is able to adjust them.
But what we'd need above is to have the compiler make a reference to 
[removed_sequence + offset] instead, and that could be done by the linker 
again.

So when the compiler removes a piece of code by a "jump ", it can use 
offsets in there as usual.

> Linker relaxation is used routinely in special sections, such as the
> .eh_frame section used to hold stack unwind information for throwing
> exceptions.  It is also used for code sequences on some processors,
> but normally not on the i386/x86_64.
Thank you very much for this information!


Another idea: if we find out that such sequences are *very* common (eg. by 
comparing many different binaries and libraries), libc (or a libcommon or 
something like that) could have a set of these "abbreviations" - eg. indexed 
by symbol, which is defined as MD5_hex(removed section).


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Re: RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
BTW - It gets much better:

  $ ls -la /usr/lib/libgcj.so.90.0.0
  -rw-r--r-- 1 root root 32844984  3. Feb 16:03 /usr/lib/libgcj.so.90.0.0

  $ ./run /usr/lib/libgcj.so.90.0.0
  Approx. 470045 bytes saved.

That's 1.4% :-)
If my script isn't buggy, that is 


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Re: RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello Jakub!

>> When wouldn't that possible? My script currently splits on an
>> instruction-level -- although I would see no problem that some branch
>> jumps into a "half" opcode of another branch, if the byte sequence
>> matches.
>
> Consider:
>  :
>0:   b8 a4 00 00 00  mov$0xa4,%eax
>5:   ba fc 04 00 00  mov$0x4fc,%edx
>a:   f7 e2   mul%edx
>c:   05 d2 04 00 00  add$0x4d2,%eax
>   11:   c3  ret
> ...
>
> 00020012 :
>20012:   39 d2   cmp%edx,%edx
>20014:   75 07   jne2001d 
>20016:   ba fc 04 00 00  mov$0x4fc,%edx
>2001b:   f7 e2   mul%edx
>2001d:   05 d2 04 00 00  add$0x4d2,%eax
>20022:   c3  ret
That's not an example of jumping into the *middle* of an instruction.

I'd think about something like (please ignore the wrong addresses, just
copy/paste):

   5:   ba fc 04 00 00  mov$0x4fc,%edx
--->   a:   f7 e2   mul%edx
   c:   05 d2 04 00 00  add$0x4d2,%eax
  11:   c3  ret
...

   20016:   ba 00 ba f7 e2  mov$0xe2f7ba00,%edx
   2001b:   f7 e2   mul%edx
   2001d:   05 d2 04 00 00  add$0x4d2,%eax
   20022:   c3  ret
Now we could place a jump at 0x4, which goes to 0x20018 - and should work,
shouldn't it?


> If you merge the mov/mul/add/ret sequences by replacing the foo tail
> sequence with jmp bar+5, then the jne will branch to wrong place, or
> if you try to adjust it, it is too far to reach the target.
Yes. But see below.

>> > but even jmp argument is relative, not absolute.
>> That's why I take only jumps with 32bit arguments - these are absolute.
>
> No, they are relative.
>
>  :
>0:   e9 05 00 02 00  jmp2000a 
>5:   e9 00 00 02 00  jmp2000a 
> ...
>
> 0002000a :
>2000a:   90  nop
>
> See how they are encoded.
Yes, ok.
But how does the compiler encode jumps? Via symbols, and the correct
destination address gets put there by the linker. If there are
(sizeof(void*)) operand bytes, there's enough space for *any* jump.

(Is there some way for the linker to compress small jumps later?)

> BTW, have you tried to compile the whole kernel with --combine, or at
> least
> e.g. each kernel directory with --combine?  I guess that will give you
> bigger savings than 30K.  Also, stop defining inline to inline
> __attribute__((always_inline)), I think Ingo also added such patch
> recently
> and it saved 120K.
No ... but these are independent changes.

I'm currently compiling with --combine.


Regards,

Phil


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Re: RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello Jakub!

> You need to be very careful with it, as if there are any jumps
> into the middle of the to be abstracted tail sequences, you can't
> abstract them or would need to adjust also the jumps into them (if
> possible, which not always is).
When wouldn't that possible? My script currently splits on an
instruction-level -- although I would see no problem that some branch
jumps into a "half" opcode of another branch, if the byte sequence
matches.

As I use the disassembly to start from they *have* to converge later -
else I wouldn't have the same end point!

So I think it's just a matter of defining another symbol at that address,
and use that to jump into.

> Also the tail sequence shouldn't contain any PC relative
> references (say you can't merge this was
> movl (.+0xabcd)(%rip), %eax
> ret
> which is byte identical, yet would reference different variables),
That's right ... I didn't think of that.
The script could "forget" the current byte sequences if it encounters %rip
relative adressing.

Note: I only know a bit about i686 and amd64 - for other processors that
would have to be different, of course.

> but even jmp argument is relative, not absolute.
That's why I take only jumps with 32bit arguments - these are absolute.

> You can't do this kind of optimization in the linker, as that's too
> late, many relocations are already relocated and lost during assembly
> and you need to understand the instructions anyway.
> Doing it in GCC would be terribly costly, if that would mean
> expanding everything into RTL, running all RTL passes and instead of
> emitting them, remember the pre-final RTL sequence for each emitted
> function, then do an IPA pass over all the sequences.
> Perhaps doing that in assembler, combined with --combine so that
> assembler can see everything together...
Yes ... I think doing a second compile pass might be the easiest way, and
not much slower than other solutions.
  (We could always remember which object files could be optimized, and
  only recompile *those*. After all, it's just an additional
  optimization.)


Regards,

Phil


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


RE: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello Dave!

> One achitectural problem here is that GCC doesn't emit bytes. It emits
> ASCII text, in the form of assembly instructions, and it's
> not always easy to predict how they'll look by the time they've been
> through the assembler and then had relocs applied by the
> linker.  (Indeed, to know what particular bytes you'd end up with you'd
> need to do everything the assembler and linker does, which
> is why we have an assembler and linker - the stuff GCC emits isn't
> directly executable...)
I know that.
So maybe my current way (compile -> find duplicates -> define duplicates
-> reuse them in fresh compile) is the easiest way to do that.

>   Shouldn't this be done in the linker instead?
Well, can the linker change the instruction sequences? Ie. put a JMP
instead of other code?


Regards,

Phil


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Re: RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello Michael!


> Can I test your script in my embedded system?
> Can you send to me?
Here you are.


code_overlap.pl - disassembles a binary, and outputs a list
  (address, name, instruction, bytes) to STDOUT.

bytes_saved.pl - takes such a list, and tries to estimate
  the amount of bytes that could be saved.


I normally use them this way:
  #!/bin/sh
  perl code_overlap.pl "$@" > overlap.txt
  perl bytes_saved.pl

Which just outputs the number of bytes.


Hope that helps ... please let my know whether that works for you.


Regards,

Phil


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!

bytes_saved.pl
Description: Perl program


code_overlap.pl
Description: Perl program


Re: RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello Richard!

[ I took linux-tiny out - it's moderated, and I don't want to spam them. ]

> Sounds like what -frtl-abstract-sequences is trying to do.
Yes, thank you. I didn't know that; that should be close.

*But*: I think it doesn't work.

  $ size vmlinux-as vmlinux-Os
 textdata bss dec hex filename
  4797637  446554  606208 5850399  59451f vmlinux-as
  4802166  446554  606208 5854928  5956d0 vmlinux-Os

If I don't err there's only 4500 bytes difference ... which is *a lot*
different to my 30kB.
That doesn't mean that my scripts are correct - they might simply be buggy
... but I verified some by hand.

Well, see the other message for the scripts.

Thank you for your answer!


Regards,

Phil

-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


RFC: Idea for code size reduction

2008-03-07 Thread Philipp Marek
Hello everybody,

I have a feature request.


I'd like to (manually) define some byte blocks, eg. as functions with an
identifier.  Then, if GCC would emit exactly these bytes, it puts a JMP
identifier there instead.

This would help by sharing many identical (text/code) byte sequences, to
make a smaller image - like -fmerge-constants does for data - and therefor
helps for cache locality, too.


I wrote some perl scripts to test this.  I took a "alldefconfig" i686
kernel, let objdump disassemble it, and on "iret", "ret", "ljmp" or "jmp"
with a 4 byte address I store the last few bytes.  Another script goes
through these block-ends, and estimates the number of bytes saved, if
identical sequences get changed into a 5 byte opcode (jump with 32bit
address).


The original kernel has
  $ size vmlinux
 textdata bss dec hex  filename
   4802166  446554  606208 5854928  5956d0 vmlinux

and my scripts emit space savings of about 30kB.  Now that's not
that much, but for embedded systems it means another userspace
binary more.
  [ If anyone's interested, I can post my scripts here ... they're not
that large. ]

And, of course, the same can be used for other binaries too - which could
save other space as well.  (blender-bin, text size 8664025, gives 97KB
savings)


I'd imagine the use that such:
- Built the binary
- Run my scripts, to find some common code sequences
- Give them to gcc
- Rebuild, using the optimizations

I don't know whether some new optimize switch could be used, to do all
that internally in the compiler and linker.


What do you think about that?
[ Please keep at least *me* cc'ed - don't know whether that's that
  interesting for linux-tiny. ]


Regards,

Phil


-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!


Dynamic macro expansion through a pipe?

2008-02-20 Thread Philipp Marek
Hello everybody,

I'm looking for a nice solution.


One of my programs has a lot of strings in it; about 20% of binary size, or
something like that.  Now most of them are rarely used - error messages
like "Cannot write to file", "No space left while allocating memory", etc.

I tried to compress just the help texts in the binary - and saved 11k
without any other modification.

Now I'd like to wrap that in some magic macro, so that it looks like that:

char *text=C("text");

Then there could be
#if USE_COMPRESSED_TEXT
#define C(x) ...
#else
#define C(x) x
#endif

but that's where I'm stuck.

It's easy to write a function that takes a pointer to some compressed space
and returns a pointer to the uncompressed data (re-entry is no problem).

I don't know how to get the transformation while compiling.
- I could try #define C(x) uncompress(buffer##__LINE__) and write some
  script that looks for the C() calls and generates a C file with the
  buffer constants - which means parsing (some of) C.
- Or these texts get put in some special segment, that gets compressed into
  another before linking, and the original segment gets removed.
- Or, what I'd like best: Is there some way to tell GCC "whenever this
  macro is used, call program X, and substitute the macro value with the
  output"? That would be the easiest way for me.


Now you might say that it isn't that interesting - just use a compressed
filesystem, or whatever.  But these are all different tradeoffs; I'd just
like to reduce the storage (and typical memory) footprint of my application
by storing rarely used data compressed.  I'd imagine that other projects
would use that too, if it's easily available - and then the space needed
for a small installation would decrease by some non-trivial amount.

And, not at least, allowing dynamic macro expansion would allow some neat
hacks :-)


Please keep me CC:ed; thank you for all ideas, tips and other feedback.


Regards,

Phil



-- 
Versioning your /etc, /home or even your whole installation?
 Try fsvs (fsvs.tigris.org)!