Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Jan Hubicka Mon, 31 Aug 2020 04:16:06 -0700

> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> <giuliano.belina...@usp.br> wrote:
> >
> > Hi,
> >
> > This is the final report of the "Automatic Parallel Compilation
> > Viability" project.  Please notice that this report is pretty
> > similar to the delivered from the 2nd evaluation, as this phase
> > consisted of mostly rebasing and bug fixing.
> >
> > Please, reply this message for any question or suggestion.
> 
> Thank you for your great work Giuliano!


Indeed, it is quite amazing work :)
> 
> It's odd that LTO emulated parallelism is winning here,
> I'd have expected it to be slower.  One factor might
> be different partitioning choices and the other might
> be that the I/O required is faster than the GC induced
> COW overhead after forking.  Note you can optimize
> one COW operation by re-using the main process for
> compiling the last partition.  I suppose you tested
> this on a system with a fast SSD so I/O overhead is
> small?

At the time I implemented fork based parallelism for WPA (which I think
we could recover by bit generalizing guiliano's patches), I had same
outcome: forked ltranses was simply running slower than those after
streaming.  This was however tested on Firefox in my estimate sometime
around 2013. I never tried it on units comparable to insn-emit (which
would be differnt at that time anyway). I was mostly aiming to get it
fully transparent with streaming but never quite finished it since, at
that time, it I tought time is better spent on optimizing LTO data
layout.

I suppose we want to keep both mechanizms in both WPA and normal
compilation and make compiler to choose fitting one.

Honza
> 
> Thanks again,
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > --- 8< -----------
> >
> > # Automatic Parallel Compilation Viability: Final Report
> >
> > ## Complete Tasks
> >
> > For the third evaluation, we expected to deliver the product as a
> > series of patches for trunk.  The patch series were in fact delivered
> > [1], but several items must be fixed before merge.
> >
> >
> > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > Bootstrap is working, and therefore this can be used in an experimental
> > state.
> >
> > ## How to use
> >
> > 1. Clone the autopar_devel branch:
> > ```
> > git clone --single-branch --branch devel/autopar_devel \
> >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > ```
> > 2. Follow the standard compilation options provided in the Compiling
> > GCC page, and install it on some directory. For instance:
> >
> > ```
> > cd gcc_autopar_devel
> > mkdir build && cd build
> > ../configure --disable-bootstrap --enable-languages=c,c++
> > make -j 8
> > make DESTDIR=/tmp/gcc11_autopar install
> > ```
> >
> > 3. If you want to test whether your version is working, just launch
> > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> >
> > 5. If you want to compile a project with this version it uses GNU
> > Makefiles, you must modify the compilation rule command and prepend a
> > `+` token to it. For example, in Git's Makefile, Change:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> >         $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > to:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> >         +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > as well as point the CC variable to the installed gcc, and
> > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> >
> > # How the parallelism works in this project
> >
> > In LTO, the Whole Program Analysis decides how to partition the
> > callgraph for running the LTRANS stage in parallel.  This project
> > works very similar to this, however with some changes.
> >
> > The first was to modify the LTO structure so that it accepts
> > the compilation without IR streaming to files.  This avoid an IO
> > overhead when compiling in parallel.
> >
> > The second was to use a custom partitioner to find which nodes
> > should be in the same partition.  This was mainly done to bring COMDAT
> > together, as well as symbols that are part of other symbols, and even
> > private symbols so that we do not output hidden global symbols.
> >
> > However, experiment showed that bringing private symbols together did
> > not yield a interesting speedup on some large files, and therefore
> > we implemented two modes of partitioning:
> >
> > 1. Partition without static promotion. This is the safer method to use,
> > as we do not modify symbols in the Compilation Unit. This may lead to
> > speedups in files that have multiple entries points with low
> > connectivity between then (such as insn-emit.c), however this will not
> > provide speedups when this hypothesis is not true (gimple-match.c is an
> > example of this). This is the default mode.
> >
> > 2. Partition with static promotion to global. This is a more aggressive
> > method, as we can decide to promote some functions to global to increase
> > parallelism opportunity. This also will change the final assembler name
> > of the promoted function to avoid collision with functions of others
> > Compilation Units. To use this mode, the user has to manually specify
> > --param=promote-statics=1, as they must be aware of this.
> >
> > Currently, partitioner 2. do not account the number of nodes to be
> > promoted.  Implementing this certainly will reduce impact on produced
> > code.
> >
> > ## Jobserver Integration
> >
> > We implemented a interface to communicate with the GNU Make's Jobserver
> > that is able to detect when the GNU Make Jobserver is active, thanks to
> > Nathan Sidwell. This works as follows:
> >
> > When -fparallel-jobs=jobserver is provided, GCC will try to detect if
> > there is a running Jobserver in which we can communicate to. If true,
> > we return the token that Make originally gave to us, then we wait for
> > make for a new token that, when provided, will launch a forked child
> > process with the partition information; or fall back to default
> > compilation if Jobserver is not detected with a warning. Now if
> > -fparallel-jobs=auto, this warning will not be provided and the default
> > number of jobs will be set to 2, unless a single core CPU is detected.
> > For more information about the implementation, check gcc/jobserver.cc
> > file in the autopar_devel branch.
> >
> > ## Speedups
> >
> > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U,
> > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376.
> > As a comparison, the theoretical maximum speedup by parallelizing this
> > part of compilation is 4x, so there are things that can be improved
> > here.
> >
> > Results are shown in the following table. There are three columns:
> >
> > * Sequential, which means running the compilation with just "-c -O2".
> > * Without Static Promotion, which means appending 
> > -fparallel-jobs=<NUM_THREADS>
> > * With Static Promotion, which means appending -fparallel-jobs=<NUM_THREADS>
> > and --param=promote-statics=1
> > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static 
> > Promotion)
> > * LTO Emulated Parallelism, which consist of running a two step:
> >    g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \
> >    g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel -flto=<NUM_THREADS>
> >
> > The test was the result of a single execution with a previous warm up
> > execution. The compiled GCC had checking enabled, and therefore release
> > version might have better timings in both sequential and parallel, but the
> > speedup may remain the same.
> >
> > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads)
> > |                |            | Without Static | With Static |         | 
> > LTO Emulated |
> > | File           | Sequential |    Promotion   |  Promotion  | Speedup |  
> > Parallelism |
> > |----------------|------------|----------------|-----------------------|--------------|
> > | gimple-match.c |     60s    |       63s      |     34s     |   1.7x  |    
> >  32s      |
> > | insn-emit.c    |     37s    |       19s      |     20s     |   1.9x  |    
> >  20s      |
> >
> > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads)
> > |                |            | Without Static | With Static |         | 
> > LTO Emulated |
> > | File           | Sequential |    Promotion   |  Promotion  | Speedup |  
> > Parallelism |
> > |----------------|------------|----------------|-----------------------|--------------|
> > | gimple-match.c |    2m42s   |      2m43s     |     49s     |   3.3x  |    
> >  46s      |
> > | insn-emit.c    |    1m35s   |       30s      |     30s     |   3.1x  |    
> >  32s      |
> >
> > There was also a reduction from 35m32s to 29m30s when bootstrapping
> > GCC when -fparallel-jobs=8 on the Opteron machine.
> >
> > ## Conclusions
> >
> > This project can still be improved on a lot of places. For example,
> > just by looking at the table above, we can conclude that:
> >
> > * Promoting statics to globals can increase parallelism opportunity on
> > compilation.
> > * Even with the IO Streaming overhead, the LTO Emulated Parallelism
> > was faster than with our method. This means that either our
> > partitioner is slow or we are losing time into something which can
> > be improved.
> > * This can be used as a way to reduce overall compilation time in
> > many-core machines.
> >
> > ## Fixed bugs from phase 2:
> >
> > * Make -fparallel-jobs=jobserver automatically detect if jobserver is
> > present.
> > * Fix a bug which caused the resulting object file to not be the same
> > in every compilation, therefore making bootstrap comparison work as
> > expected.
> > * Fix a bug which caused LTO to crash due to a uninitialized output
> > FILE pointer.
> >
> > ## TODOs:
> >
> > * Maybe increase minimal partition size (--param=lto-min-partitions).
> > * Improve checking for unreasonable bias (unbalanced partitons).
> > * Use cost of partitioning symbols into account.
> > * Avoid forking for the first (or last) partitions.
> > * Modify the driver to use the SPEC language instead of injecting
> > commands in the execute () function.
> > * Use the main process to write the assembler files instead of the
> > worker process.  This removes the requirement for sorting later in
> > the driver.
> > * Remove hidden "-fPIC" and "fPIE" from the driver.
> > * Check for race condition in additional_asm_file, as it is created
> > later in the compilation process.
> > * Replace all array allocations of size greater or equal the number
> > of nodes in the graph from the stack to the heap (i.e. avoid alloca).
> > * Use 64-bit integers when computing sizes.
> > * Check for ways to reduce the time spent inside the kernel.
> > * Add support to every GCC frontend.
> > * Rely on ipa_summary instead of setting a custom cost to nodes
> > without summary.
> > * Merge the current partitioning logic with the LTO logic.
> > * Merge compute_boundary function with what is already done in LTO.
> >
> > Cosmetic:
> >
> > * Change n to num_nodes when that is related to the number of nodes of
> > the graph.
> > * Explain better what the compression[] array does in lto-partition.c
> >
> > [1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Reply via email to