On August 31, 2020 6:21:27 PM GMT+02:00, Giuliano Belinassi <giuliano.belina...@usp.br> wrote: >Hi, Richi. > >On 08/31, Richard Biener wrote: >> On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka <hubi...@ucw.cz> wrote: >> > >> > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi >> > > <giuliano.belina...@usp.br> wrote: >> > > > >> > > > Hi, >> > > > >> > > > This is the final report of the "Automatic Parallel Compilation >> > > > Viability" project. Please notice that this report is pretty >> > > > similar to the delivered from the 2nd evaluation, as this phase >> > > > consisted of mostly rebasing and bug fixing. >> > > > >> > > > Please, reply this message for any question or suggestion. >> > > >> > > Thank you for your great work Giuliano! >> > >> > Indeed, it is quite amazing work :) >> > > >> > > It's odd that LTO emulated parallelism is winning here, >> > > I'd have expected it to be slower. One factor might >> > > be different partitioning choices and the other might >> > > be that the I/O required is faster than the GC induced >> > > COW overhead after forking. Note you can optimize >> > > one COW operation by re-using the main process for >> > > compiling the last partition. I suppose you tested >> > > this on a system with a fast SSD so I/O overhead is >> > > small? >> > >> > At the time I implemented fork based parallelism for WPA (which I >think >> > we could recover by bit generalizing guiliano's patches), I had >same >> > outcome: forked ltranses was simply running slower than those after >> > streaming. This was however tested on Firefox in my estimate >sometime >> > around 2013. I never tried it on units comparable to insn-emit >(which >> > would be differnt at that time anyway). I was mostly aiming to get >it >> > fully transparent with streaming but never quite finished it since, >at >> > that time, it I tought time is better spent on optimizing LTO data >> > layout. >> > >> > I suppose we want to keep both mechanizms in both WPA and normal >> > compilation and make compiler to choose fitting one. >> >> I repeated Giulianos experiment on gimple-match.ii and >> producing LTO bytecode takes 5.3s and the link step >> 9.5s with two jobs, 6.6s with three and 5.0s with four >> and 2.4s with 32. >> >> With -fparallel-jobs=N and --param promote-statics=1 I >> see 14.8s, 13.9s and 13.5s here. With 8 jobs the reduction >> is to 11s. >> >> It looks like LTO much more aggressively partitions >> this - I see 36 partitions generated for gimple-match.c >> while -fparallel-jobs creates "only" 27. -fparallel-jobs >> doesn't seem to honor the various lto-partition >> --params btw? The relevant ones would be >> --param lto-partitions (the max. number of partitions >> to generate) and --param lto-min-partition >> (the minimum size for a partition). I always thought >> that lto-min-partition should be higher for >> -fparallel-jobs (which I envisioned to be enabled by >> default). > >There is a partition balancing mechanism that can be disabled >with --param=balance-partitions=0.
Ah, I used =1 for this... >Assuming that you used -fparallel-jobs=32, it may be possible >that it merged small partitions until it reached the average >size of max_size / 33, which resulted in 27 partitions. Note that the partitioning shouldn't depend on the argument to -fparallel-jobs for the sake of reproducible builds. >The only lto parameter that I use is --param=lto-min-partition >controlling the minimum size in which that it will run >in parallel. > >> >> I guess before investigating the current state in detail >> it might be worth exploring Honzas wish of sharing >> the actual partitioning code between LTO and -fparallel-jobs. >> >> Note that larger objects take a bigger hit from the GC COW >> issue so at some point that becomes dominant because the >> first GC walk for each partition is the same as doing a GC >> walk for the whole object. Eventually it makes sense to >> turn off GC completely for smaller partitions. > >Just a side note, I added a ggc_collect () before start forking >and it did not improve things. You need to force collection, ggc_collect () is usually a no-op. Also see Honzas response here. Some experiments are needed here. Richard. >Thank you, >Giuliano. > >> >> Richard. >> >> > Honza >> > > >> > > Thanks again, >> > > Richard. >> > > >> > > > Thank you, >> > > > Giuliano. >> > > > >> > > > --- 8< ----------- >> > > > >> > > > # Automatic Parallel Compilation Viability: Final Report >> > > > >> > > > ## Complete Tasks >> > > > >> > > > For the third evaluation, we expected to deliver the product as >a >> > > > series of patches for trunk. The patch series were in fact >delivered >> > > > [1], but several items must be fixed before merge. >> > > > >> > > > >> > > > Overall, the project works and speedups ranges from 0.95x to >3.3x. >> > > > Bootstrap is working, and therefore this can be used in an >experimental >> > > > state. >> > > > >> > > > ## How to use >> > > > >> > > > 1. Clone the autopar_devel branch: >> > > > ``` >> > > > git clone --single-branch --branch devel/autopar_devel \ >> > > > git://gcc.gnu.org/git/gcc.git gcc_autopar_devel >> > > > ``` >> > > > 2. Follow the standard compilation options provided in the >Compiling >> > > > GCC page, and install it on some directory. For instance: >> > > > >> > > > ``` >> > > > cd gcc_autopar_devel >> > > > mkdir build && cd build >> > > > ../configure --disable-bootstrap --enable-languages=c,c++ >> > > > make -j 8 >> > > > make DESTDIR=/tmp/gcc11_autopar install >> > > > ``` >> > > > >> > > > 3. If you want to test whether your version is working, just >launch >> > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c. >> > > > >> > > > 5. If you want to compile a project with this version it uses >GNU >> > > > Makefiles, you must modify the compilation rule command and >prepend a >> > > > `+` token to it. For example, in Git's Makefile, Change: >> > > > ``` >> > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) >> > > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) >$(EXTRA_CPPFLAGS) $< >> > > > ``` >> > > > to: >> > > > ``` >> > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) >> > > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) >$(EXTRA_CPPFLAGS) $< >> > > > ``` >> > > > as well as point the CC variable to the installed gcc, and >> > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable. >> > > > >> > > > # How the parallelism works in this project >> > > > >> > > > In LTO, the Whole Program Analysis decides how to partition the >> > > > callgraph for running the LTRANS stage in parallel. This >project >> > > > works very similar to this, however with some changes. >> > > > >> > > > The first was to modify the LTO structure so that it accepts >> > > > the compilation without IR streaming to files. This avoid an >IO >> > > > overhead when compiling in parallel. >> > > > >> > > > The second was to use a custom partitioner to find which nodes >> > > > should be in the same partition. This was mainly done to bring >COMDAT >> > > > together, as well as symbols that are part of other symbols, >and even >> > > > private symbols so that we do not output hidden global symbols. >> > > > >> > > > However, experiment showed that bringing private symbols >together did >> > > > not yield a interesting speedup on some large files, and >therefore >> > > > we implemented two modes of partitioning: >> > > > >> > > > 1. Partition without static promotion. This is the safer method >to use, >> > > > as we do not modify symbols in the Compilation Unit. This may >lead to >> > > > speedups in files that have multiple entries points with low >> > > > connectivity between then (such as insn-emit.c), however this >will not >> > > > provide speedups when this hypothesis is not true >(gimple-match.c is an >> > > > example of this). This is the default mode. >> > > > >> > > > 2. Partition with static promotion to global. This is a more >aggressive >> > > > method, as we can decide to promote some functions to global to >increase >> > > > parallelism opportunity. This also will change the final >assembler name >> > > > of the promoted function to avoid collision with functions of >others >> > > > Compilation Units. To use this mode, the user has to manually >specify >> > > > --param=promote-statics=1, as they must be aware of this. >> > > > >> > > > Currently, partitioner 2. do not account the number of nodes to >be >> > > > promoted. Implementing this certainly will reduce impact on >produced >> > > > code. >> > > > >> > > > ## Jobserver Integration >> > > > >> > > > We implemented a interface to communicate with the GNU Make's >Jobserver >> > > > that is able to detect when the GNU Make Jobserver is active, >thanks to >> > > > Nathan Sidwell. This works as follows: >> > > > >> > > > When -fparallel-jobs=jobserver is provided, GCC will try to >detect if >> > > > there is a running Jobserver in which we can communicate to. If >true, >> > > > we return the token that Make originally gave to us, then we >wait for >> > > > make for a new token that, when provided, will launch a forked >child >> > > > process with the partition information; or fall back to default >> > > > compilation if Jobserver is not detected with a warning. Now if >> > > > -fparallel-jobs=auto, this warning will not be provided and the >default >> > > > number of jobs will be set to 2, unless a single core CPU is >detected. >> > > > For more information about the implementation, check >gcc/jobserver.cc >> > > > file in the autopar_devel branch. >> > > > >> > > > ## Speedups >> > > > >> > > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 >8565U, >> > > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron >6376. >> > > > As a comparison, the theoretical maximum speedup by >parallelizing this >> > > > part of compilation is 4x, so there are things that can be >improved >> > > > here. >> > > > >> > > > Results are shown in the following table. There are three >columns: >> > > > >> > > > * Sequential, which means running the compilation with just "-c >-O2". >> > > > * Without Static Promotion, which means appending >-fparallel-jobs=<NUM_THREADS> >> > > > * With Static Promotion, which means appending >-fparallel-jobs=<NUM_THREADS> >> > > > and --param=promote-statics=1 >> > > > * Speedup, which is Sequential / MIN (W/O Static Promotion, >Static Promotion) >> > > > * LTO Emulated Parallelism, which consist of running a two >step: >> > > > g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \ >> > > > g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel >-flto=<NUM_THREADS> >> > > > >> > > > The test was the result of a single execution with a previous >warm up >> > > > execution. The compiled GCC had checking enabled, and therefore >release >> > > > version might have better timings in both sequential and >parallel, but the >> > > > speedup may remain the same. >> > > > >> > > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads) >> > > > | | | Without Static | With Static | > | LTO Emulated | >> > > > | File | Sequential | Promotion | Promotion | >Speedup | Parallelism | >> > > > >|----------------|------------|----------------|-----------------------|--------------| >> > > > | gimple-match.c | 60s | 63s | 34s | > 1.7x | 32s | >> > > > | insn-emit.c | 37s | 19s | 20s | > 1.9x | 20s | >> > > > >> > > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads) >> > > > | | | Without Static | With Static | > | LTO Emulated | >> > > > | File | Sequential | Promotion | Promotion | >Speedup | Parallelism | >> > > > >|----------------|------------|----------------|-----------------------|--------------| >> > > > | gimple-match.c | 2m42s | 2m43s | 49s | > 3.3x | 46s | >> > > > | insn-emit.c | 1m35s | 30s | 30s | > 3.1x | 32s | >> > > > >> > > > There was also a reduction from 35m32s to 29m30s when >bootstrapping >> > > > GCC when -fparallel-jobs=8 on the Opteron machine. >> > > > >> > > > ## Conclusions >> > > > >> > > > This project can still be improved on a lot of places. For >example, >> > > > just by looking at the table above, we can conclude that: >> > > > >> > > > * Promoting statics to globals can increase parallelism >opportunity on >> > > > compilation. >> > > > * Even with the IO Streaming overhead, the LTO Emulated >Parallelism >> > > > was faster than with our method. This means that either our >> > > > partitioner is slow or we are losing time into something which >can >> > > > be improved. >> > > > * This can be used as a way to reduce overall compilation time >in >> > > > many-core machines. >> > > > >> > > > ## Fixed bugs from phase 2: >> > > > >> > > > * Make -fparallel-jobs=jobserver automatically detect if >jobserver is >> > > > present. >> > > > * Fix a bug which caused the resulting object file to not be >the same >> > > > in every compilation, therefore making bootstrap comparison >work as >> > > > expected. >> > > > * Fix a bug which caused LTO to crash due to a uninitialized >output >> > > > FILE pointer. >> > > > >> > > > ## TODOs: >> > > > >> > > > * Maybe increase minimal partition size >(--param=lto-min-partitions). >> > > > * Improve checking for unreasonable bias (unbalanced >partitons). >> > > > * Use cost of partitioning symbols into account. >> > > > * Avoid forking for the first (or last) partitions. >> > > > * Modify the driver to use the SPEC language instead of >injecting >> > > > commands in the execute () function. >> > > > * Use the main process to write the assembler files instead of >the >> > > > worker process. This removes the requirement for sorting later >in >> > > > the driver. >> > > > * Remove hidden "-fPIC" and "fPIE" from the driver. >> > > > * Check for race condition in additional_asm_file, as it is >created >> > > > later in the compilation process. >> > > > * Replace all array allocations of size greater or equal the >number >> > > > of nodes in the graph from the stack to the heap (i.e. avoid >alloca). >> > > > * Use 64-bit integers when computing sizes. >> > > > * Check for ways to reduce the time spent inside the kernel. >> > > > * Add support to every GCC frontend. >> > > > * Rely on ipa_summary instead of setting a custom cost to nodes >> > > > without summary. >> > > > * Merge the current partitioning logic with the LTO logic. >> > > > * Merge compute_boundary function with what is already done in >LTO. >> > > > >> > > > Cosmetic: >> > > > >> > > > * Change n to num_nodes when that is related to the number of >nodes of >> > > > the graph. >> > > > * Explain better what the compression[] array does in >lto-partition.c >> > > > >> > > > [1] - >https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html