Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Jan Hubicka Mon, 31 Aug 2020 09:40:16 -0700

> > I guess before investigating the current state in detail
> > it might be worth exploring Honzas wish of sharing
> > the actual partitioning code between LTO and -fparallel-jobs.
> > 
> > Note that larger objects take a bigger hit from the GC COW
> > issue so at some point that becomes dominant because the
> > first GC walk for each partition is the same as doing a GC
> > walk for the whole object.  Eventually it makes sense to
> > turn off GC completely for smaller partitions.
> 
> Just a side note, I added a ggc_collect () before start forking
> and it did not improve things.


We was discussing it on IRC. Your problem is similar to what precompiled
headers are hitting, too. They essentially make a memory dump of GGC
memory that is mmaped back at the time the compiler recognizes corresponding
#include.

There is mechanism in garbage collector to push and pop context which
was partly removed in meantime.  Pushing context makes all data in
garbage collector permanent until you popo it again and the marking is
not supposed to dirtify the pages.  It is still done for PCH, see 
ggc_pch_read and the line setting contxt_depth to 1.

In theory setting context-depth to 1 before forking and fixing paces we
dirtify the pages should help both the fork based parallelism and the
PCHs. As richi mentioned the ggc seems to set in_use_p anyway that may
trigger unwanted COW.

I would also trim memory pool and malloc before forking, that should be
cheap and work. It is already done by WPA before fork as well.

Honza
> 
> Thank you,
> Giuliano.
> 
> > 
> > Richard.
> > 
> > > Honza
> > > >
> > > > Thanks again,
> > > > Richard.
> > > >
> > > > > Thank you,
> > > > > Giuliano.
> > > > >
> > > > > --- 8< -----------
> > > > >
> > > > > # Automatic Parallel Compilation Viability: Final Report
> > > > >
> > > > > ## Complete Tasks
> > > > >
> > > > > For the third evaluation, we expected to deliver the product as a
> > > > > series of patches for trunk.  The patch series were in fact delivered
> > > > > [1], but several items must be fixed before merge.
> > > > >
> > > > >
> > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > > Bootstrap is working, and therefore this can be used in an 
> > > > > experimental
> > > > > state.
> > > > >
> > > > > ## How to use
> > > > >
> > > > > 1. Clone the autopar_devel branch:
> > > > > ```
> > > > > git clone --single-branch --branch devel/autopar_devel \
> > > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > > ```
> > > > > 2. Follow the standard compilation options provided in the Compiling
> > > > > GCC page, and install it on some directory. For instance:
> > > > >
> > > > > ```
> > > > > cd gcc_autopar_devel
> > > > > mkdir build && cd build
> > > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > > make -j 8
> > > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > > ```
> > > > >
> > > > > 3. If you want to test whether your version is working, just launch
> > > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > > > >
> > > > > 5. If you want to compile a project with this version it uses GNU
> > > > > Makefiles, you must modify the compilation rule command and prepend a
> > > > > `+` token to it. For example, in Git's Makefile, Change:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > >         $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > to:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > >         +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > as well as point the CC variable to the installed gcc, and
> > > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> > > > >
> > > > > # How the parallelism works in this project
> > > > >
> > > > > In LTO, the Whole Program Analysis decides how to partition the
> > > > > callgraph for running the LTRANS stage in parallel.  This project
> > > > > works very similar to this, however with some changes.
> > > > >
> > > > > The first was to modify the LTO structure so that it accepts
> > > > > the compilation without IR streaming to files.  This avoid an IO
> > > > > overhead when compiling in parallel.
> > > > >
> > > > > The second was to use a custom partitioner to find which nodes
> > > > > should be in the same partition.  This was mainly done to bring COMDAT
> > > > > together, as well as symbols that are part of other symbols, and even
> > > > > private symbols so that we do not output hidden global symbols.
> > > > >
> > > > > However, experiment showed that bringing private symbols together did
> > > > > not yield a interesting speedup on some large files, and therefore
> > > > > we implemented two modes of partitioning:
> > > > >
> > > > > 1. Partition without static promotion. This is the safer method to 
> > > > > use,
> > > > > as we do not modify symbols in the Compilation Unit. This may lead to
> > > > > speedups in files that have multiple entries points with low
> > > > > connectivity between then (such as insn-emit.c), however this will not
> > > > > provide speedups when this hypothesis is not true (gimple-match.c is 
> > > > > an
> > > > > example of this). This is the default mode.
> > > > >
> > > > > 2. Partition with static promotion to global. This is a more 
> > > > > aggressive
> > > > > method, as we can decide to promote some functions to global to 
> > > > > increase
> > > > > parallelism opportunity. This also will change the final assembler 
> > > > > name
> > > > > of the promoted function to avoid collision with functions of others
> > > > > Compilation Units. To use this mode, the user has to manually specify
> > > > > --param=promote-statics=1, as they must be aware of this.
> > > > >
> > > > > Currently, partitioner 2. do not account the number of nodes to be
> > > > > promoted.  Implementing this certainly will reduce impact on produced
> > > > > code.
> > > > >
> > > > > ## Jobserver Integration
> > > > >
> > > > > We implemented a interface to communicate with the GNU Make's 
> > > > > Jobserver
> > > > > that is able to detect when the GNU Make Jobserver is active, thanks 
> > > > > to
> > > > > Nathan Sidwell. This works as follows:
> > > > >
> > > > > When -fparallel-jobs=jobserver is provided, GCC will try to detect if
> > > > > there is a running Jobserver in which we can communicate to. If true,
> > > > > we return the token that Make originally gave to us, then we wait for
> > > > > make for a new token that, when provided, will launch a forked child
> > > > > process with the partition information; or fall back to default
> > > > > compilation if Jobserver is not detected with a warning. Now if
> > > > > -fparallel-jobs=auto, this warning will not be provided and the 
> > > > > default
> > > > > number of jobs will be set to 2, unless a single core CPU is detected.
> > > > > For more information about the implementation, check gcc/jobserver.cc
> > > > > file in the autopar_devel branch.
> > > > >
> > > > > ## Speedups
> > > > >
> > > > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U,
> > > > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376.
> > > > > As a comparison, the theoretical maximum speedup by parallelizing this
> > > > > part of compilation is 4x, so there are things that can be improved
> > > > > here.
> > > > >
> > > > > Results are shown in the following table. There are three columns:
> > > > >
> > > > > * Sequential, which means running the compilation with just "-c -O2".
> > > > > * Without Static Promotion, which means appending 
> > > > > -fparallel-jobs=<NUM_THREADS>
> > > > > * With Static Promotion, which means appending 
> > > > > -fparallel-jobs=<NUM_THREADS>
> > > > > and --param=promote-statics=1
> > > > > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static 
> > > > > Promotion)
> > > > > * LTO Emulated Parallelism, which consist of running a two step:
> > > > >    g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \
> > > > >    g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel 
> > > > > -flto=<NUM_THREADS>
> > > > >
> > > > > The test was the result of a single execution with a previous warm up
> > > > > execution. The compiled GCC had checking enabled, and therefore 
> > > > > release
> > > > > version might have better timings in both sequential and parallel, 
> > > > > but the
> > > > > speedup may remain the same.
> > > > >
> > > > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads)
> > > > > |                |            | Without Static | With Static |        
> > > > >  | LTO Emulated |
> > > > > | File           | Sequential |    Promotion   |  Promotion  | 
> > > > > Speedup |  Parallelism |
> > > > > |----------------|------------|----------------|-----------------------|--------------|
> > > > > | gimple-match.c |     60s    |       63s      |     34s     |   1.7x 
> > > > >  |     32s      |
> > > > > | insn-emit.c    |     37s    |       19s      |     20s     |   1.9x 
> > > > >  |     20s      |
> > > > >
> > > > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads)
> > > > > |                |            | Without Static | With Static |        
> > > > >  | LTO Emulated |
> > > > > | File           | Sequential |    Promotion   |  Promotion  | 
> > > > > Speedup |  Parallelism |
> > > > > |----------------|------------|----------------|-----------------------|--------------|
> > > > > | gimple-match.c |    2m42s   |      2m43s     |     49s     |   3.3x 
> > > > >  |     46s      |
> > > > > | insn-emit.c    |    1m35s   |       30s      |     30s     |   3.1x 
> > > > >  |     32s      |
> > > > >
> > > > > There was also a reduction from 35m32s to 29m30s when bootstrapping
> > > > > GCC when -fparallel-jobs=8 on the Opteron machine.
> > > > >
> > > > > ## Conclusions
> > > > >
> > > > > This project can still be improved on a lot of places. For example,
> > > > > just by looking at the table above, we can conclude that:
> > > > >
> > > > > * Promoting statics to globals can increase parallelism opportunity on
> > > > > compilation.
> > > > > * Even with the IO Streaming overhead, the LTO Emulated Parallelism
> > > > > was faster than with our method. This means that either our
> > > > > partitioner is slow or we are losing time into something which can
> > > > > be improved.
> > > > > * This can be used as a way to reduce overall compilation time in
> > > > > many-core machines.
> > > > >
> > > > > ## Fixed bugs from phase 2:
> > > > >
> > > > > * Make -fparallel-jobs=jobserver automatically detect if jobserver is
> > > > > present.
> > > > > * Fix a bug which caused the resulting object file to not be the same
> > > > > in every compilation, therefore making bootstrap comparison work as
> > > > > expected.
> > > > > * Fix a bug which caused LTO to crash due to a uninitialized output
> > > > > FILE pointer.
> > > > >
> > > > > ## TODOs:
> > > > >
> > > > > * Maybe increase minimal partition size (--param=lto-min-partitions).
> > > > > * Improve checking for unreasonable bias (unbalanced partitons).
> > > > > * Use cost of partitioning symbols into account.
> > > > > * Avoid forking for the first (or last) partitions.
> > > > > * Modify the driver to use the SPEC language instead of injecting
> > > > > commands in the execute () function.
> > > > > * Use the main process to write the assembler files instead of the
> > > > > worker process.  This removes the requirement for sorting later in
> > > > > the driver.
> > > > > * Remove hidden "-fPIC" and "fPIE" from the driver.
> > > > > * Check for race condition in additional_asm_file, as it is created
> > > > > later in the compilation process.
> > > > > * Replace all array allocations of size greater or equal the number
> > > > > of nodes in the graph from the stack to the heap (i.e. avoid alloca).
> > > > > * Use 64-bit integers when computing sizes.
> > > > > * Check for ways to reduce the time spent inside the kernel.
> > > > > * Add support to every GCC frontend.
> > > > > * Rely on ipa_summary instead of setting a custom cost to nodes
> > > > > without summary.
> > > > > * Merge the current partitioning logic with the LTO logic.
> > > > > * Merge compute_boundary function with what is already done in LTO.
> > > > >
> > > > > Cosmetic:
> > > > >
> > > > > * Change n to num_nodes when that is related to the number of nodes of
> > > > > the graph.
> > > > > * Explain better what the compression[] array does in lto-partition.c
> > > > >
> > > > > [1] - 
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Reply via email to