Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 7:46 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, This patch is something I was playing around with assistance of Ian Taylor. It seems I need bit more help though :) It adds support for direct output of SLIM LTO files to the compiler binary. It works as proof of concept, but there are two key parts missing 1) extension of libiberty's simple file to handle output symbols into COMMON. This is needed to output __gnu_lto_v1 and __gnu_lto_slim Search for TODO in the patch bellow. 2) Support in driver to properly execute *1 binary. I also disabled outputting ident directive, but I think that one may not be necessary because the files are identified by the gnu_lto_v1 symbols. We could add it later. Currently the path bypassing asm stage can be tested as follows: jan@linux-ujxe:~/trunk/build/gcc cat a.c main () { printf (Hello world\n); } jan@linux-ujxe:~/trunk/build/gcc ./xgcc -B ./ -O3 a.c -flto -S -fbypass-asm=crtbegin.o -o a.o jan@linux-ujxe:~/trunk/build/gcc ./xgcc -B ./ -O2 a.o -flto jan@linux-ujxe:~/trunk/build/gcc ./a.out Hello world The implementation is pretty straighforward except for -fbypass-asm requiring one existing OBJ file to fetch target's file attributes from. This is definitly not optimal, but libiberty currently can't build output files from scratch. As Ian suggested, I plan to simply arrange the driver to pass crtbegin around at least to start with. We may want to bypass this later and storing proper attributes into the binary. Ian, would you be so kind and implement ability to output those two symbols into lto-object-simple? I think we can start with ELF only support. The large chunk just moves lto-object around with very small changes in it, so the patch is fairly easy. I did just quick benchmark with unoptimized cc1 binary compiling the file above. For 1000 invocations with bypass I get: real0m14.186s user0m10.957s sys 0m2.424s While the default path gets: real0m21.913s user0m13.856s sys 0m5.705s With OpenSUSE 13.1 default GCC 4.8.3 build: real 0m15.160s user 0m8.481s sys0m5.159s (the difference here is most likely optimizer WRT unoptimized binary, perf shows contains_struct_check quite top, so startup overhead still dominates) And with clang-3.4: real 0m30.097s user 0m22.012s sys0m6.649s That is fairly nice speedup IMO. With optimized build the difference should be more visible because CC1 startup issues will become less important. I definitely see ASM file overhead as mesaurable issue with real world benchmarks (libreoffice build). Clearly we produce several GBs of object file going through crappy and bloated text encoding just for sake of doing it. Shouldn't -fbypass-asm be simply mangled by the driver? That is, the user simply specifies -fbypass-asm and via spec magic the driver substitutes this with -fbypass-asm=crtbegin.o? That way at least the user interface should be stable (as we're supposedly removing the requirement for that existing object file at some point). Btw, with early debug info we also need to store dwarf somewhere. Either we drop the support for fat LTO objects and thus can store the dwarf alongside the GIMPLE IL and simply link with these files at the end or we need to support a separate set of files to store the DWARF. If we need separate files then why not store the GIMPLE IL data into separate objects in the first place and output a reference to it into the main object file? That way we don't need any special attributes - the linker plugin simply opens the main object file, extracts the reference to the IL file and passes that along. Btw, the patch is very hard to read as it moves (and modifies?) files at the same time. What's this magic file attributes we need? Thanks, Richard. Honza Index: Makefile.in === --- Makefile.in (revision 215518) +++ Makefile.in (working copy) @@ -1300,6 +1300,7 @@ lto-section-out.o \ lto-opts.o \ lto-compress.o \ + lto-object.o \ mcf.o \ mode-switching.o \ modulo-sched.o \ Index: common.opt === --- common.opt (revision 215518) +++ common.opt (working copy) @@ -923,6 +923,9 @@ Common Report Var(flag_btr_bb_exclusive) Optimization Restrict target load migration not to re-use registers in any basic block +fbypass-asm= +Common Joined Var(flag_bypass_asm) + fcall-saved- Common Joined RejectNegative Var(common_deferred_options) Defer -fcall-saved-registerMark register as being preserved across functions Index: langhooks.c === --- langhooks.c (revision 215518) +++ langhooks.c (working copy) @@ -40,6 +40,10
Re: Skipping assembler when producing slim LTO files
Richard Biener richard.guent...@gmail.com writes: Btw, the patch is very hard to read as it moves (and modifies?) files at the same time. What's this magic file attributes we need? The file attributes issue is the ELF machine number, class, OSABI, flags, and endianness. When generating an ELF file it has to have this information, and it has to match the objects generated by the assembler. If it doesn't, the linker won't accept it and pass it to the plugin as we require. We could of course build a large table of those numbers and keep it updated for each target. But it's simpler to extract the numbers from an existing object file that we know must be valid. Ian
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 2:40 PM, Ian Lance Taylor i...@airs.com wrote: Richard Biener richard.guent...@gmail.com writes: Btw, the patch is very hard to read as it moves (and modifies?) files at the same time. What's this magic file attributes we need? The file attributes issue is the ELF machine number, class, OSABI, flags, and endianness. When generating an ELF file it has to have this information, and it has to match the objects generated by the assembler. If it doesn't, the linker won't accept it and pass it to the plugin as we require. We could of course build a large table of those numbers and keep it updated for each target. But it's simpler to extract the numbers from an existing object file that we know must be valid. I see. Thanks for the explanation. Richard. Ian
Re: Skipping assembler when producing slim LTO files
Shouldn't -fbypass-asm be simply mangled by the driver? That is, the user simply specifies -fbypass-asm and via spec magic the driver substitutes this with -fbypass-asm=crtbegin.o? That way at least the user interface should be stable (as we're supposedly removing the requirement for that existing object file at some point). The idea is to make -fbypass-asm internal and never exposed to user. That is, default to it with slim LTO unless user asks for assembler via -S Btw, with early debug info we also need to store dwarf somewhere. Either we drop the support for fat LTO objects and thus can store I think fat LTO files are useful for LIPO that hopefully will once hit mainline and for other tricks, so I think we want to keep it. Hopefully pickling dwarf so two of them can coexists won't be that hard. the dwarf alongside the GIMPLE IL and simply link with these files at the end or we need to support a separate set of files to store the DWARF. If we need separate files then why not store the GIMPLE IL data into separate objects in the first place and output a reference to it into the main object file? That way we don't need any special attributes - the linker plugin simply opens the main object file, extracts the reference to the IL file and passes that along. I do not like much the iea of separate files, as make clean will not be happy. Having evertyhing in one file seems to make sense. The attributes are needed to make the file acceptable for the linker/archiver. Btw, the patch is very hard to read as it moves (and modifies?) files Basically no modifications there (I believe I did try to set attributes there and then reverted the change), I will send explicit diff to that file. at the same time. What's this magic file attributes we need? What type of ELF you produce (32bit/64bit etc.) Honza Thanks, Richard. Honza Index: Makefile.in === --- Makefile.in (revision 215518) +++ Makefile.in (working copy) @@ -1300,6 +1300,7 @@ lto-section-out.o \ lto-opts.o \ lto-compress.o \ + lto-object.o \ mcf.o \ mode-switching.o \ modulo-sched.o \ Index: common.opt === --- common.opt (revision 215518) +++ common.opt (working copy) @@ -923,6 +923,9 @@ Common Report Var(flag_btr_bb_exclusive) Optimization Restrict target load migration not to re-use registers in any basic block +fbypass-asm= +Common Joined Var(flag_bypass_asm) + fcall-saved- Common Joined RejectNegative Var(common_deferred_options) Defer -fcall-saved-registerMark register as being preserved across functions Index: langhooks.c === --- langhooks.c (revision 215518) +++ langhooks.c (working copy) @@ -40,6 +40,10 @@ #include cgraph.h #include timevar.h #include output.h +#include tree-ssa-alias.h +#include gimple-expr.h +#include gimple.h +#include lto-streamer.h /* Do nothing; in many cases the default hook. */ @@ -653,6 +657,19 @@ { section *section; + if (flag_bypass_asm) +{ + static int initialized = false; + if (!initialized) + { + gcc_assert (asm_out_file == NULL); + lto_set_current_out_file (lto_obj_file_open (asm_file_name, true)); + initialized = true; + } + lto_obj_begin_section (name); + return; +} + /* Save the old section so we can restore it in lto_end_asm_section. */ gcc_assert (!saved_section); saved_section = in_section; @@ -669,8 +686,13 @@ implementation just calls assemble_string. */ void -lhd_append_data (const void *data, size_t len, void *) +lhd_append_data (const void *data, size_t len, void *v) { + if (flag_bypass_asm) +{ + lto_obj_append_data (data, len, v); + return; +} if (data) assemble_string ((const char *)data, len); } @@ -683,6 +705,11 @@ void lhd_end_section (void) { + if (flag_bypass_asm) +{ + lto_obj_end_section (); + return; +} if (saved_section) { switch_to_section (saved_section); Index: lto/Make-lang.in === --- lto/Make-lang.in(revision 215518) +++ lto/Make-lang.in(working copy) @@ -22,7 +22,7 @@ # The name of the LTO compiler. LTO_EXE = lto1$(exeext) # The LTO-specific object files inclued in $(LTO_EXE). -LTO_OBJS = lto/lto-lang.o lto/lto.o lto/lto-object.o attribs.o lto/lto-partition.o lto/lto-symtab.o +LTO_OBJS = lto/lto-lang.o lto/lto.o attribs.o lto/lto-partition.o lto/lto-symtab.o lto_OBJS = $(LTO_OBJS) # Rules Index: lto/lto-object.c
Re: Skipping assembler when producing slim LTO files
Jan Hubicka hubi...@ucw.cz writes: Nice patch. The implementation is pretty straighforward except for -fbypass-asm requiring one existing OBJ file to fetch target's file attributes from. This is definitly not optimal, but libiberty currently can't build output files from scratch. As Ian suggested, I plan to simply arrange the driver to pass crtbegin around at least to start with. We may want to bypass this later and storing proper attributes into the binary. I wonder how hard it would be to fix simple-object to be able to create from scratch. From a quick look it would be mostly adding the right values into the header? That would need some defines per target. -Andi -- a...@linux.intel.com -- Speaking for myself only
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 7:47 AM, Andi Kleen a...@firstfloor.org wrote: I wonder how hard it would be to fix simple-object to be able to create from scratch. From a quick look it would be mostly adding the right values into the header? That would need some defines per target. It could be done, of course. It would mean maintaining a new set of tables and updating them for each target. The specific table to use would depend on the command line options. It turns into yet another data structure to update. Ian
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 7:47 AM, Andi Kleen a...@firstfloor.org wrote: I wonder how hard it would be to fix simple-object to be able to create from scratch. From a quick look it would be mostly adding the right values into the header? That would need some defines per target. It could be done, of course. It would mean maintaining a new set of tables and updating them for each target. The specific table to use would depend on the command line options. It turns into yet another data structure to update. Yep, i think the crtstuff hack is pretty good for now (well under assumption I won't have too hard time to get it working in the driver). I think the only real blocker is the lack of simple-object API to create the two common symbols we need to make the object fiels compliant. I really hope Ian will help me on this, please;) Honza Ian
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 7:47 AM, Andi Kleen a...@firstfloor.org wrote: I wonder how hard it would be to fix simple-object to be able to create from scratch. From a quick look it would be mostly adding the right values into the header? That would need some defines per target. It could be done, of course. It would mean maintaining a new set of tables and updating them for each target. The specific table to use would depend on the command line options. It turns into yet another data structure to update. Yep, i think the crtstuff hack is pretty good for now (well under assumption I won't have too hard time to get it working in the driver). I think the only real blocker is the lack of simple-object API to create the two common symbols we need to make the object fiels compliant. I really hope Ian will help me on this, please;) Just for some data, I did compile time comparsion at libreoffce http://hubicka.blogspot.ca/2014/09/linktime-optimization-in-gcc-part-3.html and firefox http://hubicka.blogspot.ca/2014/04/linktime-optimization-in-gcc-2-firefox.html My general plan is to try to make LTO compile time faster than non-LTO and possibly clang's on my setup (i.e. with WHOPR parallelism). It is already faster than clan'g LTO. Also SPEC build times are now faster than non-LTO ones. Libreoffice shows that GCC needs about twice as much of system time. According to profiles, good part is the ugly way we pass stuff down to assembler and other part is memory use during the copmilation stage. I fixed most of the botlenecks seen in GCC 4.9 - ineffeciencies in hashing for streaming, unnecesary initialization of the backend, inliner and other stuff. Funilly enough I benchmarked LTO build with mainline and GCC 4.9 and the times are almost exactly the same on both Firefox and libreoffice. There are some slowdowns too - the speculative devirtualization issues I plan to fix today, extra streaming needed, and slowdowns in C++ FE/preprocessor... I will bechmark last two bit more curefuly ;) But this also means that non-LTO got slower in 5.0 so I am probably closer to reaching the goal. Honza Honza Ian
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 6:32 PM, Jan Hubicka hubi...@ucw.cz wrote: Libreoffice shows that GCC needs about twice as much of system time. According to profiles, good part is the ugly way we pass stuff down to assembler and other part is memory use during the copmilation stage. Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. Ciao! Steven
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 6:32 PM, Jan Hubicka hubi...@ucw.cz wrote: Libreoffice shows that GCC needs about twice as much of system time. According to profiles, good part is the ugly way we pass stuff down to assembler and other part is memory use during the copmilation stage. Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. I use libreoffice's default flags. Will check what they do. Given that -pipe is around for many years and works well, what about making it defualt to justify GCC 5 release? honza Ciao! Steven
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 10:04 AM, Steven Bosscher stevenb@gmail.com wrote: On Wed, Sep 24, 2014 at 6:32 PM, Jan Hubicka hubi...@ucw.cz wrote: Libreoffice shows that GCC needs about twice as much of system time. According to profiles, good part is the ugly way we pass stuff down to assembler and other part is memory use during the copmilation stage. Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. Is that true even when TMPDIR is on a ram disk? There's no obvious reason that it should be true in a parallel build. Using -pipe effectively constrains communication between the compiler and the assembler to work in PIPE_BUF blocks. Using TMPDIR introduces no such constraints, and in a big program a parallel build should obscure the fact that the compiler and assembler are serialized for each individual compilation unit. Ian
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 10:04 AM, Steven Bosscher stevenb@gmail.com wrote: On Wed, Sep 24, 2014 at 6:32 PM, Jan Hubicka hubi...@ucw.cz wrote: Libreoffice shows that GCC needs about twice as much of system time. According to profiles, good part is the ugly way we pass stuff down to assembler and other part is memory use during the copmilation stage. Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. Is that true even when TMPDIR is on a ram disk? There's no obvious reason that it should be true in a parallel build. Using -pipe effectively constrains communication between the compiler and the assembler to work in PIPE_BUF blocks. Using TMPDIR introduces no such constraints, and in a big program a parallel build should obscure the fact that the compiler and assembler are serialized for each individual compilation unit. Actually I mount /tmp as tmpfs, so this should not be an issue. Oviously for slim LTO we get more benefits from outputting binary data directly rather than spending time to printf and scanf them ;) Honza
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 11:47 PM, Ian Lance Taylor wrote: On Wed, Sep 24, 2014 at 10:04 AM, Steven Bosscher wrote: Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. Is that true even when TMPDIR is on a ram disk? There's no obvious reason that it should be true in a parallel build. Using -pipe effectively constrains communication between the compiler and the assembler to work in PIPE_BUF blocks. Using TMPDIR introduces no such constraints, and in a big program a parallel build should obscure the fact that the compiler and assembler are serialized for each individual compilation unit. I've done my most recent timings on a machine that has /dev/md3 mounted on /tmp. That's gcc110 on the compile farm. With/without -pipe made a significant difference. If TMPDIR is a tmpfs or other kind of ram disk, I suppose the benefits would be less (to the point of vanishing). Unfortunately I can't test it... Ciao! Steven
Re: Skipping assembler when producing slim LTO files
On Wed, Sep 24, 2014 at 11:47 PM, Ian Lance Taylor wrote: On Wed, Sep 24, 2014 at 10:04 AM, Steven Bosscher wrote: Are you using -pipe? AFAIR this still isn't the default, even on GNU/Linux, but it is typically a lot faster than without. Is that true even when TMPDIR is on a ram disk? There's no obvious reason that it should be true in a parallel build. Using -pipe effectively constrains communication between the compiler and the assembler to work in PIPE_BUF blocks. Using TMPDIR introduces no such constraints, and in a big program a parallel build should obscure the fact that the compiler and assembler are serialized for each individual compilation unit. I've done my most recent timings on a machine that has /dev/md3 mounted on /tmp. That's gcc110 on the compile farm. With/without -pipe made a significant difference. If TMPDIR is a tmpfs or other kind of ram disk, I suppose the benefits would be less (to the point of vanishing). Unfortunately I can't test it... OK, I tried it on my hello world benchmark with tmpfs and -fpipe really seems like a small loss. I wonder if we can work out better defaults that works for most of people. I use tmpfs as I am worried about my notebook ssd still being alive and well in 3 years, but it is still far from mainstream. Honza
Re: Skipping assembler when producing slim LTO files
On Thu, Sep 25, 2014 at 12:47 AM, Ian Lance Taylor i...@google.com wrote: Is that true even when TMPDIR is on a ram disk? There's no obvious reason that it should be true in a parallel build. Using -pipe effectively constrains communication between the compiler and the assembler to work in PIPE_BUF blocks. Using TMPDIR introduces no such constraints, and in a big program a parallel build should obscure the fact that the compiler and assembler are serialized for each individual compilation unit. As an aside, I think what matters is the capacity of the pipe rather than PIPE_BUF. PIPE_BUF is the largest chunk that can be written atomically, but since we don't have a case of multiple processes writing to the same pipe(???), it doesn't matter. On a typical x86(-64) Linux system, PIPE_BUF is 4k while the capacity is by default 64k (can be increased with fcntl(fd, F_SETPIPE_SZ, ...), perhaps worth trying to see if it makes any difference?). Still, it seems to me that making -pipe the default would make sense, if the tradeoff appears to be a small loss in case when /tmp is a tmpfs vs. a much larger gain when /tmp is a normal fs. -- Janne Blomqvist
Skipping assembler when producing slim LTO files
Hi, This patch is something I was playing around with assistance of Ian Taylor. It seems I need bit more help though :) It adds support for direct output of SLIM LTO files to the compiler binary. It works as proof of concept, but there are two key parts missing 1) extension of libiberty's simple file to handle output symbols into COMMON. This is needed to output __gnu_lto_v1 and __gnu_lto_slim Search for TODO in the patch bellow. 2) Support in driver to properly execute *1 binary. I also disabled outputting ident directive, but I think that one may not be necessary because the files are identified by the gnu_lto_v1 symbols. We could add it later. Currently the path bypassing asm stage can be tested as follows: jan@linux-ujxe:~/trunk/build/gcc cat a.c main () { printf (Hello world\n); } jan@linux-ujxe:~/trunk/build/gcc ./xgcc -B ./ -O3 a.c -flto -S -fbypass-asm=crtbegin.o -o a.o jan@linux-ujxe:~/trunk/build/gcc ./xgcc -B ./ -O2 a.o -flto jan@linux-ujxe:~/trunk/build/gcc ./a.out Hello world The implementation is pretty straighforward except for -fbypass-asm requiring one existing OBJ file to fetch target's file attributes from. This is definitly not optimal, but libiberty currently can't build output files from scratch. As Ian suggested, I plan to simply arrange the driver to pass crtbegin around at least to start with. We may want to bypass this later and storing proper attributes into the binary. Ian, would you be so kind and implement ability to output those two symbols into lto-object-simple? I think we can start with ELF only support. The large chunk just moves lto-object around with very small changes in it, so the patch is fairly easy. I did just quick benchmark with unoptimized cc1 binary compiling the file above. For 1000 invocations with bypass I get: real0m14.186s user0m10.957s sys 0m2.424s While the default path gets: real0m21.913s user0m13.856s sys 0m5.705s With OpenSUSE 13.1 default GCC 4.8.3 build: real 0m15.160s user 0m8.481s sys0m5.159s (the difference here is most likely optimizer WRT unoptimized binary, perf shows contains_struct_check quite top, so startup overhead still dominates) And with clang-3.4: real 0m30.097s user 0m22.012s sys0m6.649s That is fairly nice speedup IMO. With optimized build the difference should be more visible because CC1 startup issues will become less important. I definitely see ASM file overhead as mesaurable issue with real world benchmarks (libreoffice build). Clearly we produce several GBs of object file going through crappy and bloated text encoding just for sake of doing it. Honza Index: Makefile.in === --- Makefile.in (revision 215518) +++ Makefile.in (working copy) @@ -1300,6 +1300,7 @@ lto-section-out.o \ lto-opts.o \ lto-compress.o \ + lto-object.o \ mcf.o \ mode-switching.o \ modulo-sched.o \ Index: common.opt === --- common.opt (revision 215518) +++ common.opt (working copy) @@ -923,6 +923,9 @@ Common Report Var(flag_btr_bb_exclusive) Optimization Restrict target load migration not to re-use registers in any basic block +fbypass-asm= +Common Joined Var(flag_bypass_asm) + fcall-saved- Common Joined RejectNegative Var(common_deferred_options) Defer -fcall-saved-registerMark register as being preserved across functions Index: langhooks.c === --- langhooks.c (revision 215518) +++ langhooks.c (working copy) @@ -40,6 +40,10 @@ #include cgraph.h #include timevar.h #include output.h +#include tree-ssa-alias.h +#include gimple-expr.h +#include gimple.h +#include lto-streamer.h /* Do nothing; in many cases the default hook. */ @@ -653,6 +657,19 @@ { section *section; + if (flag_bypass_asm) +{ + static int initialized = false; + if (!initialized) + { + gcc_assert (asm_out_file == NULL); + lto_set_current_out_file (lto_obj_file_open (asm_file_name, true)); + initialized = true; + } + lto_obj_begin_section (name); + return; +} + /* Save the old section so we can restore it in lto_end_asm_section. */ gcc_assert (!saved_section); saved_section = in_section; @@ -669,8 +686,13 @@ implementation just calls assemble_string. */ void -lhd_append_data (const void *data, size_t len, void *) +lhd_append_data (const void *data, size_t len, void *v) { + if (flag_bypass_asm) +{ + lto_obj_append_data (data, len, v); + return; +} if (data) assemble_string ((const char *)data, len); } @@ -683,6 +705,11 @@ void lhd_end_section (void) { + if (flag_bypass_asm) +{ + lto_obj_end_section (); + return; +} if (saved_section)