Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Richard Biener via Gcc
On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
 wrote:
>
> Hi,
>
> This is the final report of the "Automatic Parallel Compilation
> Viability" project.  Please notice that this report is pretty
> similar to the delivered from the 2nd evaluation, as this phase
> consisted of mostly rebasing and bug fixing.
>
> Please, reply this message for any question or suggestion.

Thank you for your great work Giuliano!

It's odd that LTO emulated parallelism is winning here,
I'd have expected it to be slower.  One factor might
be different partitioning choices and the other might
be that the I/O required is faster than the GC induced
COW overhead after forking.  Note you can optimize
one COW operation by re-using the main process for
compiling the last partition.  I suppose you tested
this on a system with a fast SSD so I/O overhead is
small?

Thanks again,
Richard.

> Thank you,
> Giuliano.
>
> --- 8< ---
>
> # Automatic Parallel Compilation Viability: Final Report
>
> ## Complete Tasks
>
> For the third evaluation, we expected to deliver the product as a
> series of patches for trunk.  The patch series were in fact delivered
> [1], but several items must be fixed before merge.
>
>
> Overall, the project works and speedups ranges from 0.95x to 3.3x.
> Bootstrap is working, and therefore this can be used in an experimental
> state.
>
> ## How to use
>
> 1. Clone the autopar_devel branch:
> ```
> git clone --single-branch --branch devel/autopar_devel \
>   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> ```
> 2. Follow the standard compilation options provided in the Compiling
> GCC page, and install it on some directory. For instance:
>
> ```
> cd gcc_autopar_devel
> mkdir build && cd build
> ../configure --disable-bootstrap --enable-languages=c,c++
> make -j 8
> make DESTDIR=/tmp/gcc11_autopar install
> ```
>
> 3. If you want to test whether your version is working, just launch
> Gcc with `-fparallel-jobs=2` when compiling a file with -c.
>
> 5. If you want to compile a project with this version it uses GNU
> Makefiles, you must modify the compilation rule command and prepend a
> `+` token to it. For example, in Git's Makefile, Change:
> ```
> $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> $(EXTRA_CPPFLAGS) $<
> ```
> to:
> ```
> $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> $(EXTRA_CPPFLAGS) $<
> ```
> as well as point the CC variable to the installed gcc, and
> append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
>
> # How the parallelism works in this project
>
> In LTO, the Whole Program Analysis decides how to partition the
> callgraph for running the LTRANS stage in parallel.  This project
> works very similar to this, however with some changes.
>
> The first was to modify the LTO structure so that it accepts
> the compilation without IR streaming to files.  This avoid an IO
> overhead when compiling in parallel.
>
> The second was to use a custom partitioner to find which nodes
> should be in the same partition.  This was mainly done to bring COMDAT
> together, as well as symbols that are part of other symbols, and even
> private symbols so that we do not output hidden global symbols.
>
> However, experiment showed that bringing private symbols together did
> not yield a interesting speedup on some large files, and therefore
> we implemented two modes of partitioning:
>
> 1. Partition without static promotion. This is the safer method to use,
> as we do not modify symbols in the Compilation Unit. This may lead to
> speedups in files that have multiple entries points with low
> connectivity between then (such as insn-emit.c), however this will not
> provide speedups when this hypothesis is not true (gimple-match.c is an
> example of this). This is the default mode.
>
> 2. Partition with static promotion to global. This is a more aggressive
> method, as we can decide to promote some functions to global to increase
> parallelism opportunity. This also will change the final assembler name
> of the promoted function to avoid collision with functions of others
> Compilation Units. To use this mode, the user has to manually specify
> --param=promote-statics=1, as they must be aware of this.
>
> Currently, partitioner 2. do not account the number of nodes to be
> promoted.  Implementing this certainly will reduce impact on produced
> code.
>
> ## Jobserver Integration
>
> We implemented a interface to communicate with the GNU Make's Jobserver
> that is able to detect when the GNU Make Jobserver is active, thanks to
> Nathan Sidwell. This works as follows:
>
> When -fparallel-jobs=jobserver is provided, GCC will try to detect if
> there is a running Jobserver in which we can communicate to. If true,
> we return the token that Make originally gave to us, then we wait for
> make for a new token that, when provided, will launch a forked child
> process with 

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Jan Hubicka
> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
>  wrote:
> >
> > Hi,
> >
> > This is the final report of the "Automatic Parallel Compilation
> > Viability" project.  Please notice that this report is pretty
> > similar to the delivered from the 2nd evaluation, as this phase
> > consisted of mostly rebasing and bug fixing.
> >
> > Please, reply this message for any question or suggestion.
> 
> Thank you for your great work Giuliano!

Indeed, it is quite amazing work :)
> 
> It's odd that LTO emulated parallelism is winning here,
> I'd have expected it to be slower.  One factor might
> be different partitioning choices and the other might
> be that the I/O required is faster than the GC induced
> COW overhead after forking.  Note you can optimize
> one COW operation by re-using the main process for
> compiling the last partition.  I suppose you tested
> this on a system with a fast SSD so I/O overhead is
> small?

At the time I implemented fork based parallelism for WPA (which I think
we could recover by bit generalizing guiliano's patches), I had same
outcome: forked ltranses was simply running slower than those after
streaming.  This was however tested on Firefox in my estimate sometime
around 2013. I never tried it on units comparable to insn-emit (which
would be differnt at that time anyway). I was mostly aiming to get it
fully transparent with streaming but never quite finished it since, at
that time, it I tought time is better spent on optimizing LTO data
layout.

I suppose we want to keep both mechanizms in both WPA and normal
compilation and make compiler to choose fitting one.

Honza
> 
> Thanks again,
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > --- 8< ---
> >
> > # Automatic Parallel Compilation Viability: Final Report
> >
> > ## Complete Tasks
> >
> > For the third evaluation, we expected to deliver the product as a
> > series of patches for trunk.  The patch series were in fact delivered
> > [1], but several items must be fixed before merge.
> >
> >
> > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > Bootstrap is working, and therefore this can be used in an experimental
> > state.
> >
> > ## How to use
> >
> > 1. Clone the autopar_devel branch:
> > ```
> > git clone --single-branch --branch devel/autopar_devel \
> >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > ```
> > 2. Follow the standard compilation options provided in the Compiling
> > GCC page, and install it on some directory. For instance:
> >
> > ```
> > cd gcc_autopar_devel
> > mkdir build && cd build
> > ../configure --disable-bootstrap --enable-languages=c,c++
> > make -j 8
> > make DESTDIR=/tmp/gcc11_autopar install
> > ```
> >
> > 3. If you want to test whether your version is working, just launch
> > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> >
> > 5. If you want to compile a project with this version it uses GNU
> > Makefiles, you must modify the compilation rule command and prepend a
> > `+` token to it. For example, in Git's Makefile, Change:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > to:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > as well as point the CC variable to the installed gcc, and
> > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> >
> > # How the parallelism works in this project
> >
> > In LTO, the Whole Program Analysis decides how to partition the
> > callgraph for running the LTRANS stage in parallel.  This project
> > works very similar to this, however with some changes.
> >
> > The first was to modify the LTO structure so that it accepts
> > the compilation without IR streaming to files.  This avoid an IO
> > overhead when compiling in parallel.
> >
> > The second was to use a custom partitioner to find which nodes
> > should be in the same partition.  This was mainly done to bring COMDAT
> > together, as well as symbols that are part of other symbols, and even
> > private symbols so that we do not output hidden global symbols.
> >
> > However, experiment showed that bringing private symbols together did
> > not yield a interesting speedup on some large files, and therefore
> > we implemented two modes of partitioning:
> >
> > 1. Partition without static promotion. This is the safer method to use,
> > as we do not modify symbols in the Compilation Unit. This may lead to
> > speedups in files that have multiple entries points with low
> > connectivity between then (such as insn-emit.c), however this will not
> > provide speedups when this hypothesis is not true (gimple-match.c is an
> > example of this). This is the default mode.
> >
> > 2. Partition with static promotion to global. This is a more aggressive
> > method, as we can decide to promote some functions to global to increase

Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Prathamesh Kulkarni via Gcc
On Fri, 28 Aug 2020 at 17:27, Richard Biener  wrote:
>
> On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Wed, 26 Aug 2020 at 16:50, Richard Biener  
> > wrote:
> > >
> > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > >  wrote:
> > > >
> > > > Hi,
> > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs 
> > > > -O2
> > > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > > bit, and it seems the regression comes from inlining of orthonl into
> > > > e_c3d. Disabling that brings back the performance. However, inlining
> > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > > 16.9% which isn't too large.
> > > >
> > > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > > which contains unmodified function.
> > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > > sufficient.
> > > >
> > > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > > 1) ... w(3, 3), which wasn't
> > > > possible without inlining. The hoistings happen because of basic block
> > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > > following block in line 1035 in e_c3d.f:
> > > >
> > > > senergy=
> > > >  &(s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > >  &+s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > >  &+s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >
> > > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > > performance. I am not able to understand why (if?) these hoistings of
> > > > w(1, 1) ...
> > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > > code-path from perf in e_c3d shows following code-gen diff:
> > > > For inlined version:
> > > > .L122:
> > > > ldr d15, [x1, -248]
> > > > add w0, w0, 1
> > > > add x2, x2, 24
> > > > add x1, x1, 72
> > > > fmuld15, d17, d15
> > > > fmuld15, d15, d18
> > > > fmuld14, d15, d14
> > > > fmadd   d16, d14, d31, d16
> > > > cmp w0, 4
> > > > beq .L121
> > > > ldr d14, [x2, -8]
> > > > b   .L122
> > > >
> > > > and for non-inlined version:
> > > > .L118:
> > > > ldr d0, [x1, -248]
> > > > add w0, w0, 1
> > > > ldr d2, [x2, -8]
> > > > add x1, x1, 72
> > > > add x2, x2, 24
> > > > fmuld0, d3, d0
> > > > fmuld0, d0, d5
> > > > fmuld0, d0, d2
> > > > fmadd   d1, d4, d0, d1
> > > > cmp w0, 4
> > > > bne .L118
> > >
> > > I wonder if you have profles.  The inlined version has a
> > > non-empty latch block (looks like some PRE is happening
> > > there?).  Eventually your uarch does not like the close
> > > (does your assembly show the layour as it is?) branches?
> > Hi Richard,
> > I have uploaded profiles obtained by perf here:
> > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
> >
> > For the above loop, it shows the following:
> > -O2:
> >   0.01 │ f1c:  ldur   d0, [x1, #-248]
> >   3.53 │addw0, w0, #0x1
> >   │ldur   d2, [x2, #-8]
> >   3.54 │addx1, x1, #0x48
> >   │addx2, x2, #0x18
> >   5.89 │fmul   d0, d3, d0
> > 14.12 │fmul   d0, d0, d5
> > 14.14 │fmul   d0, d0, d2
> > 14.13 │fmadd  d1, d4, d0, d1
> >   0.00 │cmpw0, #0x4
> >   3.52 │  ↑ b.ne   f1c
> >
> > -O2 -flto:
> >   5.47  |1124:ldur   d15, [x1, #-248]
> >   2.19  │addw0, w0, #0x1
> >   1.10  │addx2, x2, #0x18
> >   2.18  │addx1, x1, #0x48
> >   4.37  │fmul   d15, d17, d15
> >  13.13 │fmul   d15, d15, d18
> >  13.13 │fmul   d14, d15, d14
> >  13.14 │fmadd  d16, d14, d31, d16
> >│cmpw0, #0x4
> >   3.28  │↓ b.eq   1154
> >   0.00  │ldur   d14, [x2, #-8]
> >   2.19  │↑ b  1124
> >
> > IIUC, the biggest relative difference comes from load [x1, #-248]
> > which in LTO's case takes 5.47% of overall samples:
> > 5.47  |1124:   ldur   d15, [x1, #-248]
> > while in case of -O2, it's just 0.01:
> >  0.01 │ f1c:   ldur   d0, [x1, #-248]
> >
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> This looks more like the branch since usually branch costs
> are attributed to the target rather than the branch itself.  You could
> try re-ordering the code so the loop entry jumps around the
> latch which can

Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Prathamesh Kulkarni via Gcc
On Fri, 28 Aug 2020 at 17:33, Alexander Monakov  wrote:
>
> On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
>
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> Probably not. Some advice to make your search more directed:
>
> Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> when they are computed against different bases, it's much easier to see that
> a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> as opposed to 90% to 91% in relative sample ratio.
>
> Before diving down 'perf report', be sure to fully account for differences
> in 'perf stat' output. Do the programs execute the same number of 
> instructions,
> so the difference only in scheduling? Do the programs suffer from the same
> amount of branch mispredictions? Please show output of 'perf stat' on the
> mailing list too, so everyone is on the same page about that.
>
> I also suspect that the dramatic slowdown has to do with the extra branch.
> Your CPU might have some specialized counters for branch prediction, see
> 'perf list'.
Hi Alexander,
Thanks for the suggestions! I am in the process of doing the
benchmarking experiments,
and will post the results soon.

Thanks,
Prathamesh
>
> Alexander


Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Jan Hubicka
> Thanks for the suggestions.
> Is it possible to modify assembly files emitted after ltrans phase ?
> IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
> the obtained object files which doesn't make it possible to hand edit
> assembly files post ltrans ?
> In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
> contains e_c3d to avoid the extra branch.
> (If that doesn't work out, I can proceed with manually inlining in the
> source and then modifying generated assembly).

It is not intended to work that way, but for smaller benchmark you can
just keep the .s files, modify them and then compile again with gfortran
*.s or so.

Honza
> 
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > > which corresponds to the following loop in line 1014.
> > > > > do n1=1,3
> > > > >   s(iii1,jjj1)=s(iii1,jjj1)
> > > > >  &  +anisox(m1,k1,n1,l1)
> > > > >  &  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > > >  &  *weight
> > > > >
> > > > > I am not sure why would hoisting have any direct effect on this loop
> > > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > > increased register pressure. Perhaps that's why it's using highered
> > > > > number regs for code-gen in inlined version ? However disabling
> > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > > (by grepping for str to sp), so
> > > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > > and would be grateful for suggestions.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh


Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Richard Biener via Gcc
On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka  wrote:
>
> > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> >  wrote:
> > >
> > > Hi,
> > >
> > > This is the final report of the "Automatic Parallel Compilation
> > > Viability" project.  Please notice that this report is pretty
> > > similar to the delivered from the 2nd evaluation, as this phase
> > > consisted of mostly rebasing and bug fixing.
> > >
> > > Please, reply this message for any question or suggestion.
> >
> > Thank you for your great work Giuliano!
>
> Indeed, it is quite amazing work :)
> >
> > It's odd that LTO emulated parallelism is winning here,
> > I'd have expected it to be slower.  One factor might
> > be different partitioning choices and the other might
> > be that the I/O required is faster than the GC induced
> > COW overhead after forking.  Note you can optimize
> > one COW operation by re-using the main process for
> > compiling the last partition.  I suppose you tested
> > this on a system with a fast SSD so I/O overhead is
> > small?
>
> At the time I implemented fork based parallelism for WPA (which I think
> we could recover by bit generalizing guiliano's patches), I had same
> outcome: forked ltranses was simply running slower than those after
> streaming.  This was however tested on Firefox in my estimate sometime
> around 2013. I never tried it on units comparable to insn-emit (which
> would be differnt at that time anyway). I was mostly aiming to get it
> fully transparent with streaming but never quite finished it since, at
> that time, it I tought time is better spent on optimizing LTO data
> layout.
>
> I suppose we want to keep both mechanizms in both WPA and normal
> compilation and make compiler to choose fitting one.

I repeated Giulianos experiment on gimple-match.ii and
producing LTO bytecode takes 5.3s and the link step
9.5s with two jobs, 6.6s with three and 5.0s with four
and 2.4s with 32.

With -fparallel-jobs=N and --param promote-statics=1 I
see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
is to 11s.

It looks like LTO much more aggressively partitions
this - I see 36 partitions generated for gimple-match.c
while -fparallel-jobs creates "only" 27.  -fparallel-jobs
doesn't seem to honor the various lto-partition
--params btw?  The relevant ones would be
--param lto-partitions (the max. number of partitions
to generate) and --param lto-min-partition
(the minimum size for a partition).  I always thought
that lto-min-partition should be higher for
-fparallel-jobs (which I envisioned to be enabled by
default).

I guess before investigating the current state in detail
it might be worth exploring Honzas wish of sharing
the actual partitioning code between LTO and -fparallel-jobs.

Note that larger objects take a bigger hit from the GC COW
issue so at some point that becomes dominant because the
first GC walk for each partition is the same as doing a GC
walk for the whole object.  Eventually it makes sense to
turn off GC completely for smaller partitions.

Richard.

> Honza
> >
> > Thanks again,
> > Richard.
> >
> > > Thank you,
> > > Giuliano.
> > >
> > > --- 8< ---
> > >
> > > # Automatic Parallel Compilation Viability: Final Report
> > >
> > > ## Complete Tasks
> > >
> > > For the third evaluation, we expected to deliver the product as a
> > > series of patches for trunk.  The patch series were in fact delivered
> > > [1], but several items must be fixed before merge.
> > >
> > >
> > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > Bootstrap is working, and therefore this can be used in an experimental
> > > state.
> > >
> > > ## How to use
> > >
> > > 1. Clone the autopar_devel branch:
> > > ```
> > > git clone --single-branch --branch devel/autopar_devel \
> > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > ```
> > > 2. Follow the standard compilation options provided in the Compiling
> > > GCC page, and install it on some directory. For instance:
> > >
> > > ```
> > > cd gcc_autopar_devel
> > > mkdir build && cd build
> > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > make -j 8
> > > make DESTDIR=/tmp/gcc11_autopar install
> > > ```
> > >
> > > 3. If you want to test whether your version is working, just launch
> > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > >
> > > 5. If you want to compile a project with this version it uses GNU
> > > Makefiles, you must modify the compilation rule command and prepend a
> > > `+` token to it. For example, in Git's Makefile, Change:
> > > ```
> > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > $(EXTRA_CPPFLAGS) $<
> > > ```
> > > to:
> > > ```
> > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > $(EXTRA_CPPFLAGS) $<
> > > ```
> > > as well as point the CC variable to the installed gcc, and
> > > append a `-fparallel-jobs

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Giuliano Belinassi via Gcc
Hi, Richi.

On 08/31, Richard Biener wrote:
> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
>  wrote:
> >
> > Hi,
> >
> > This is the final report of the "Automatic Parallel Compilation
> > Viability" project.  Please notice that this report is pretty
> > similar to the delivered from the 2nd evaluation, as this phase
> > consisted of mostly rebasing and bug fixing.
> >
> > Please, reply this message for any question or suggestion.
> 
> Thank you for your great work Giuliano!

Thank you :)

> 
> It's odd that LTO emulated parallelism is winning here,
> I'd have expected it to be slower.  One factor might
> be different partitioning choices and the other might
> be that the I/O required is faster than the GC induced
> COW overhead after forking.  Note you can optimize
> one COW operation by re-using the main process for
> compiling the last partition.  I suppose you tested
> this on a system with a fast SSD so I/O overhead is
> small?

The Core-i7 machine runs on a fast NVMe SSD.  The Opteron machine seems
to run on a RAID setup with conventional HDDs, but I am not sure how it
is configured, as I have no phisical access to that machine.

Thank you,
Giuliano

> 
> Thanks again,
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > --- 8< ---
> >
> > # Automatic Parallel Compilation Viability: Final Report
> >
> > ## Complete Tasks
> >
> > For the third evaluation, we expected to deliver the product as a
> > series of patches for trunk.  The patch series were in fact delivered
> > [1], but several items must be fixed before merge.
> >
> >
> > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > Bootstrap is working, and therefore this can be used in an experimental
> > state.
> >
> > ## How to use
> >
> > 1. Clone the autopar_devel branch:
> > ```
> > git clone --single-branch --branch devel/autopar_devel \
> >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > ```
> > 2. Follow the standard compilation options provided in the Compiling
> > GCC page, and install it on some directory. For instance:
> >
> > ```
> > cd gcc_autopar_devel
> > mkdir build && cd build
> > ../configure --disable-bootstrap --enable-languages=c,c++
> > make -j 8
> > make DESTDIR=/tmp/gcc11_autopar install
> > ```
> >
> > 3. If you want to test whether your version is working, just launch
> > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> >
> > 5. If you want to compile a project with this version it uses GNU
> > Makefiles, you must modify the compilation rule command and prepend a
> > `+` token to it. For example, in Git's Makefile, Change:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > to:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > as well as point the CC variable to the installed gcc, and
> > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> >
> > # How the parallelism works in this project
> >
> > In LTO, the Whole Program Analysis decides how to partition the
> > callgraph for running the LTRANS stage in parallel.  This project
> > works very similar to this, however with some changes.
> >
> > The first was to modify the LTO structure so that it accepts
> > the compilation without IR streaming to files.  This avoid an IO
> > overhead when compiling in parallel.
> >
> > The second was to use a custom partitioner to find which nodes
> > should be in the same partition.  This was mainly done to bring COMDAT
> > together, as well as symbols that are part of other symbols, and even
> > private symbols so that we do not output hidden global symbols.
> >
> > However, experiment showed that bringing private symbols together did
> > not yield a interesting speedup on some large files, and therefore
> > we implemented two modes of partitioning:
> >
> > 1. Partition without static promotion. This is the safer method to use,
> > as we do not modify symbols in the Compilation Unit. This may lead to
> > speedups in files that have multiple entries points with low
> > connectivity between then (such as insn-emit.c), however this will not
> > provide speedups when this hypothesis is not true (gimple-match.c is an
> > example of this). This is the default mode.
> >
> > 2. Partition with static promotion to global. This is a more aggressive
> > method, as we can decide to promote some functions to global to increase
> > parallelism opportunity. This also will change the final assembler name
> > of the promoted function to avoid collision with functions of others
> > Compilation Units. To use this mode, the user has to manually specify
> > --param=promote-statics=1, as they must be aware of this.
> >
> > Currently, partitioner 2. do not account the number of nodes to be
> > promoted.  Implementing this certainly will reduce impact on produced
> > code.

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Giuliano Belinassi via Gcc
Hi, Richi.

On 08/31, Richard Biener wrote:
> On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka  wrote:
> >
> > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > This is the final report of the "Automatic Parallel Compilation
> > > > Viability" project.  Please notice that this report is pretty
> > > > similar to the delivered from the 2nd evaluation, as this phase
> > > > consisted of mostly rebasing and bug fixing.
> > > >
> > > > Please, reply this message for any question or suggestion.
> > >
> > > Thank you for your great work Giuliano!
> >
> > Indeed, it is quite amazing work :)
> > >
> > > It's odd that LTO emulated parallelism is winning here,
> > > I'd have expected it to be slower.  One factor might
> > > be different partitioning choices and the other might
> > > be that the I/O required is faster than the GC induced
> > > COW overhead after forking.  Note you can optimize
> > > one COW operation by re-using the main process for
> > > compiling the last partition.  I suppose you tested
> > > this on a system with a fast SSD so I/O overhead is
> > > small?
> >
> > At the time I implemented fork based parallelism for WPA (which I think
> > we could recover by bit generalizing guiliano's patches), I had same
> > outcome: forked ltranses was simply running slower than those after
> > streaming.  This was however tested on Firefox in my estimate sometime
> > around 2013. I never tried it on units comparable to insn-emit (which
> > would be differnt at that time anyway). I was mostly aiming to get it
> > fully transparent with streaming but never quite finished it since, at
> > that time, it I tought time is better spent on optimizing LTO data
> > layout.
> >
> > I suppose we want to keep both mechanizms in both WPA and normal
> > compilation and make compiler to choose fitting one.
> 
> I repeated Giulianos experiment on gimple-match.ii and
> producing LTO bytecode takes 5.3s and the link step
> 9.5s with two jobs, 6.6s with three and 5.0s with four
> and 2.4s with 32.
> 
> With -fparallel-jobs=N and --param promote-statics=1 I
> see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
> is to 11s.
> 
> It looks like LTO much more aggressively partitions
> this - I see 36 partitions generated for gimple-match.c
> while -fparallel-jobs creates "only" 27.  -fparallel-jobs
> doesn't seem to honor the various lto-partition
> --params btw?  The relevant ones would be
> --param lto-partitions (the max. number of partitions
> to generate) and --param lto-min-partition
> (the minimum size for a partition).  I always thought
> that lto-min-partition should be higher for
> -fparallel-jobs (which I envisioned to be enabled by
> default).

There is a partition balancing mechanism that can be disabled
with --param=balance-partitions=0.

Assuming that you used -fparallel-jobs=32, it may be possible
that it merged small partitions until it reached the average
size of max_size / 33, which resulted in 27 partitions.

The only lto parameter that I use is --param=lto-min-partition
controlling the minimum size in which that it will run
in parallel. 

> 
> I guess before investigating the current state in detail
> it might be worth exploring Honzas wish of sharing
> the actual partitioning code between LTO and -fparallel-jobs.
> 
> Note that larger objects take a bigger hit from the GC COW
> issue so at some point that becomes dominant because the
> first GC walk for each partition is the same as doing a GC
> walk for the whole object.  Eventually it makes sense to
> turn off GC completely for smaller partitions.

Just a side note, I added a ggc_collect () before start forking
and it did not improve things.

Thank you,
Giuliano.

> 
> Richard.
> 
> > Honza
> > >
> > > Thanks again,
> > > Richard.
> > >
> > > > Thank you,
> > > > Giuliano.
> > > >
> > > > --- 8< ---
> > > >
> > > > # Automatic Parallel Compilation Viability: Final Report
> > > >
> > > > ## Complete Tasks
> > > >
> > > > For the third evaluation, we expected to deliver the product as a
> > > > series of patches for trunk.  The patch series were in fact delivered
> > > > [1], but several items must be fixed before merge.
> > > >
> > > >
> > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > Bootstrap is working, and therefore this can be used in an experimental
> > > > state.
> > > >
> > > > ## How to use
> > > >
> > > > 1. Clone the autopar_devel branch:
> > > > ```
> > > > git clone --single-branch --branch devel/autopar_devel \
> > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > ```
> > > > 2. Follow the standard compilation options provided in the Compiling
> > > > GCC page, and install it on some directory. For instance:
> > > >
> > > > ```
> > > > cd gcc_autopar_devel
> > > > mkdir build && cd build
> > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > make -j 8
> > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > ```
> > > >
> >

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Jan Hubicka
> > I guess before investigating the current state in detail
> > it might be worth exploring Honzas wish of sharing
> > the actual partitioning code between LTO and -fparallel-jobs.
> > 
> > Note that larger objects take a bigger hit from the GC COW
> > issue so at some point that becomes dominant because the
> > first GC walk for each partition is the same as doing a GC
> > walk for the whole object.  Eventually it makes sense to
> > turn off GC completely for smaller partitions.
> 
> Just a side note, I added a ggc_collect () before start forking
> and it did not improve things.

We was discussing it on IRC. Your problem is similar to what precompiled
headers are hitting, too. They essentially make a memory dump of GGC
memory that is mmaped back at the time the compiler recognizes corresponding
#include.

There is mechanism in garbage collector to push and pop context which
was partly removed in meantime.  Pushing context makes all data in
garbage collector permanent until you popo it again and the marking is
not supposed to dirtify the pages.  It is still done for PCH, see 
ggc_pch_read and the line setting contxt_depth to 1.

In theory setting context-depth to 1 before forking and fixing paces we
dirtify the pages should help both the fork based parallelism and the
PCHs. As richi mentioned the ggc seems to set in_use_p anyway that may
trigger unwanted COW.

I would also trim memory pool and malloc before forking, that should be
cheap and work. It is already done by WPA before fork as well.

Honza
> 
> Thank you,
> Giuliano.
> 
> > 
> > Richard.
> > 
> > > Honza
> > > >
> > > > Thanks again,
> > > > Richard.
> > > >
> > > > > Thank you,
> > > > > Giuliano.
> > > > >
> > > > > --- 8< ---
> > > > >
> > > > > # Automatic Parallel Compilation Viability: Final Report
> > > > >
> > > > > ## Complete Tasks
> > > > >
> > > > > For the third evaluation, we expected to deliver the product as a
> > > > > series of patches for trunk.  The patch series were in fact delivered
> > > > > [1], but several items must be fixed before merge.
> > > > >
> > > > >
> > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > > Bootstrap is working, and therefore this can be used in an 
> > > > > experimental
> > > > > state.
> > > > >
> > > > > ## How to use
> > > > >
> > > > > 1. Clone the autopar_devel branch:
> > > > > ```
> > > > > git clone --single-branch --branch devel/autopar_devel \
> > > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > > ```
> > > > > 2. Follow the standard compilation options provided in the Compiling
> > > > > GCC page, and install it on some directory. For instance:
> > > > >
> > > > > ```
> > > > > cd gcc_autopar_devel
> > > > > mkdir build && cd build
> > > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > > make -j 8
> > > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > > ```
> > > > >
> > > > > 3. If you want to test whether your version is working, just launch
> > > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > > > >
> > > > > 5. If you want to compile a project with this version it uses GNU
> > > > > Makefiles, you must modify the compilation rule command and prepend a
> > > > > `+` token to it. For example, in Git's Makefile, Change:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > to:
> > > > > ```
> > > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > > $(EXTRA_CPPFLAGS) $<
> > > > > ```
> > > > > as well as point the CC variable to the installed gcc, and
> > > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> > > > >
> > > > > # How the parallelism works in this project
> > > > >
> > > > > In LTO, the Whole Program Analysis decides how to partition the
> > > > > callgraph for running the LTRANS stage in parallel.  This project
> > > > > works very similar to this, however with some changes.
> > > > >
> > > > > The first was to modify the LTO structure so that it accepts
> > > > > the compilation without IR streaming to files.  This avoid an IO
> > > > > overhead when compiling in parallel.
> > > > >
> > > > > The second was to use a custom partitioner to find which nodes
> > > > > should be in the same partition.  This was mainly done to bring COMDAT
> > > > > together, as well as symbols that are part of other symbols, and even
> > > > > private symbols so that we do not output hidden global symbols.
> > > > >
> > > > > However, experiment showed that bringing private symbols together did
> > > > > not yield a interesting speedup on some large files, and therefore
> > > > > we implemented two modes of partitioning:
> > > > >
> > > > > 1. Partition without static promotion. This is the safer method to 
> > > > > use,
> > > > > as we do not modif

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Richard Biener via Gcc
On August 31, 2020 6:21:27 PM GMT+02:00, Giuliano Belinassi 
 wrote:
>Hi, Richi.
>
>On 08/31, Richard Biener wrote:
>> On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka  wrote:
>> >
>> > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
>> > >  wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > This is the final report of the "Automatic Parallel Compilation
>> > > > Viability" project.  Please notice that this report is pretty
>> > > > similar to the delivered from the 2nd evaluation, as this phase
>> > > > consisted of mostly rebasing and bug fixing.
>> > > >
>> > > > Please, reply this message for any question or suggestion.
>> > >
>> > > Thank you for your great work Giuliano!
>> >
>> > Indeed, it is quite amazing work :)
>> > >
>> > > It's odd that LTO emulated parallelism is winning here,
>> > > I'd have expected it to be slower.  One factor might
>> > > be different partitioning choices and the other might
>> > > be that the I/O required is faster than the GC induced
>> > > COW overhead after forking.  Note you can optimize
>> > > one COW operation by re-using the main process for
>> > > compiling the last partition.  I suppose you tested
>> > > this on a system with a fast SSD so I/O overhead is
>> > > small?
>> >
>> > At the time I implemented fork based parallelism for WPA (which I
>think
>> > we could recover by bit generalizing guiliano's patches), I had
>same
>> > outcome: forked ltranses was simply running slower than those after
>> > streaming.  This was however tested on Firefox in my estimate
>sometime
>> > around 2013. I never tried it on units comparable to insn-emit
>(which
>> > would be differnt at that time anyway). I was mostly aiming to get
>it
>> > fully transparent with streaming but never quite finished it since,
>at
>> > that time, it I tought time is better spent on optimizing LTO data
>> > layout.
>> >
>> > I suppose we want to keep both mechanizms in both WPA and normal
>> > compilation and make compiler to choose fitting one.
>> 
>> I repeated Giulianos experiment on gimple-match.ii and
>> producing LTO bytecode takes 5.3s and the link step
>> 9.5s with two jobs, 6.6s with three and 5.0s with four
>> and 2.4s with 32.
>> 
>> With -fparallel-jobs=N and --param promote-statics=1 I
>> see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
>> is to 11s.
>> 
>> It looks like LTO much more aggressively partitions
>> this - I see 36 partitions generated for gimple-match.c
>> while -fparallel-jobs creates "only" 27.  -fparallel-jobs
>> doesn't seem to honor the various lto-partition
>> --params btw?  The relevant ones would be
>> --param lto-partitions (the max. number of partitions
>> to generate) and --param lto-min-partition
>> (the minimum size for a partition).  I always thought
>> that lto-min-partition should be higher for
>> -fparallel-jobs (which I envisioned to be enabled by
>> default).
>
>There is a partition balancing mechanism that can be disabled
>with --param=balance-partitions=0.

Ah, I used =1 for this... 

>Assuming that you used -fparallel-jobs=32, it may be possible
>that it merged small partitions until it reached the average
>size of max_size / 33, which resulted in 27 partitions.

Note that the partitioning shouldn't depend on the argument to -fparallel-jobs 
for the sake of reproducible builds. 

>The only lto parameter that I use is --param=lto-min-partition
>controlling the minimum size in which that it will run
>in parallel. 
>
>> 
>> I guess before investigating the current state in detail
>> it might be worth exploring Honzas wish of sharing
>> the actual partitioning code between LTO and -fparallel-jobs.
>> 
>> Note that larger objects take a bigger hit from the GC COW
>> issue so at some point that becomes dominant because the
>> first GC walk for each partition is the same as doing a GC
>> walk for the whole object.  Eventually it makes sense to
>> turn off GC completely for smaller partitions.
>
>Just a side note, I added a ggc_collect () before start forking
>and it did not improve things.

You need to force collection, ggc_collect () is usually a no-op. Also see 
Honzas response here. Some experiments are needed here. 

Richard. 

>Thank you,
>Giuliano.
>
>> 
>> Richard.
>> 
>> > Honza
>> > >
>> > > Thanks again,
>> > > Richard.
>> > >
>> > > > Thank you,
>> > > > Giuliano.
>> > > >
>> > > > --- 8< ---
>> > > >
>> > > > # Automatic Parallel Compilation Viability: Final Report
>> > > >
>> > > > ## Complete Tasks
>> > > >
>> > > > For the third evaluation, we expected to deliver the product as
>a
>> > > > series of patches for trunk.  The patch series were in fact
>delivered
>> > > > [1], but several items must be fixed before merge.
>> > > >
>> > > >
>> > > > Overall, the project works and speedups ranges from 0.95x to
>3.3x.
>> > > > Bootstrap is working, and therefore this can be used in an
>experimental
>> > > > state.
>> > > >
>> > > > ## How to use
>> > > >
>> > > > 1. Clone the autopar_devel branch:
>> > > > ```
>> > > > git clone --sin

Built-in Specs ignored... unless -specs=dump is specified

2020-08-31 Thread Giacomo Tesio
Hello everybody!

To cleanup my port of GCC (9.2.0) to Jehanne OS (http://jehanne.io)
I'd like to add a `--posixly` command line options to the gcc driver
that should be expanded to a couple of -isystem and -L options
to ease the compilation of POSIX programs (Jehanne is a Plan 9
derivative so it's not a POSIX system).

So I defined in the gcc/config/jehanne.h (see attachment) the macros
LINK_SPEC, LIB_SPEC and CPP_SPEC that substitute such command line
options as required.

What's weird is that, if I dump the specs file with `gcc -dumpspecs`
and then load it with `gcc -specs=dump` it works like a charm,
properly translating the --posixly option.

However without the -specs option, it can't recognise such option.

I have no idea of what I'm missing.
I tried to put the dump into PREFIX/lib/gcc/x86_64-jehanne/9.2.0/specs
(also in the attachments) but it doesn't change anything. 
Yet, by using the -specs= option it works.


Am I missing something obvious?

Thanks for your help and sorry if the issue is dumb: I tried my best to
understand the Gcc Internals but I was unable to fix this.


Giacomo


specs
Description: Binary data
/*
 * This file is part of Jehanne.
 *
 * Copyright (C) 2016-2020 Giacomo Tesio 
 *
 * Jehanne is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, version 2 of the License.
 *
 * Jehanne is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with Jehanne.  If not, see .
 */

#undef TARGET_JEHANNE
#define TARGET_JEHANNE 1

#undef DRIVER_SELF_SPECS
#define DRIVER_SELF_SPECS "%{-posixly}"

/* Default arguments you want when running x86_64-jehanne-gcc */
#undef LINK_SPEC
#define LINK_SPEC "%{-posixly:-L/posix/lib}"

#undef LIB_SPEC
#define LIB_SPEC "%{-posixly:%{!shared:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}} -ljehanne"


#undef STANDARD_STARTFILE_PREFIX
#define STANDARD_STARTFILE_PREFIX "/arch/amd64/lib/"

#undef CPP_SPEC
#define CPP_SPEC "%{-posixly:-isystem/posix/include}"


/* Architecture specific header (u.h) goes here (from config.gcc) */
#define ARCH_INCLUDE_DIR NATIVE_SYSTEM_HEADER_DIR 

/* The default include dir is /sys/include */
#define PORTABLE_INCLUDE_DIR "/sys/include"

#ifdef GPLUSPLUS_INCLUDE_DIR
/* Pick up GNU C++ generic include files.  */
# define ID_GPLUSPLUS { GPLUSPLUS_INCLUDE_DIR, "G++", 1, 1, GPLUSPLUS_INCLUDE_DIR_ADD_SYSROOT, 0 },
#else
# define ID_GPLUSPLUS 
#endif
#ifdef GPLUSPLUS_TOOL_INCLUDE_DIR
/* Pick up GNU C++ target-dependent include files.  */
# define ID_GPLUSPLUS_TOOL { GPLUSPLUS_TOOL_INCLUDE_DIR, "G++", 1, 1, GPLUSPLUS_INCLUDE_DIR_ADD_SYSROOT, 1 },
#else
# define ID_GPLUSPLUS_TOOL
#endif
#ifdef GPLUSPLUS_BACKWARD_INCLUDE_DIR
/* Pick up GNU C++ backward and deprecated include files.  */
# define ID_GPLUSPLUS_BACKWARD { GPLUSPLUS_BACKWARD_INCLUDE_DIR, "G++", 1, 1, GPLUSPLUS_INCLUDE_DIR_ADD_SYSROOT, 0 },
#else
# define ID_GPLUSPLUS_BACKWARD
#endif
#ifdef GCC_INCLUDE_DIR
/* This is the dir for gcc's private headers.  */
# define ID_GCC { GCC_INCLUDE_DIR, "GCC", 0, 0, 0, 0 },
#else
# define ID_GCC
#endif
#ifdef PREFIX_INCLUDE_DIR
# define ID_PREFIX { PREFIX_INCLUDE_DIR, 0, 0, 1, 0, 0 },
#else
# define ID_PREFIX
#endif
#if defined (CROSS_INCLUDE_DIR) && defined (CROSS_DIRECTORY_STRUCTURE) && !defined (TARGET_SYSTEM_ROOT)
# define ID_CROSS { CROSS_INCLUDE_DIR, "GCC", 0, 0, 0, 0 },
#else
# define ID_CROSS
#endif
#ifdef TOOL_INCLUDE_DIR
/* Another place the target system's headers might be.  */
# define ID_TOOL { TOOL_INCLUDE_DIR, "BINUTILS", 0, 1, 0, 0 },
#else
# define ID_TOOL
#endif

#undef INCLUDE_DEFAULTS
#define INCLUDE_DEFAULTS\
  {			\
ID_GPLUSPLUS	\
ID_GPLUSPLUS_TOOL	\
ID_GPLUSPLUS_BACKWARD\
ID_GCC		\
ID_PREFIX		\
ID_CROSS		\
ID_TOOL		\
{ PORTABLE_INCLUDE_DIR, 0, 0, 0, 1, 0 },		\
{ ARCH_INCLUDE_DIR, 0, 0, 0, 1, 0 },		\
{ 0, 0, 0, 0, 0, 0 }\
  }

/* Files that are linked before user code.
   The %s tells gcc to look for these files in the library directory. */
#undef STARTFILE_SPEC
#define STARTFILE_SPEC "crt0.o%s crti.o%s crtbegin.o%s"
 
/* Files that are linked after user code. */
#undef ENDFILE_SPEC
#define ENDFILE_SPEC "crtend.o%s crtn.o%s"
 
/* Fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67132 */
#undef	WCHAR_TYPE
#define WCHAR_TYPE "unsigned int"
#undef	WCHAR_TYPE_SIZE
#define WCHAR_TYPE_SIZE 32

#undef  LINK_GCC_C_SEQUENCE_SPEC
#define LINK_GCC_C_SEQUENCE_SPEC "%G %L"

/* Additional predefined macros. */
#undef TARGET_OS_CPP_BUILTINS
#define TARGET_OS_CPP_BUILTINS()  \
  do {\
builtin_define ("__jehanne__"

Re: Built-in Specs ignored... unless -specs=dump is specified

2020-08-31 Thread Andrew Pinski via Gcc
On Mon, Aug 31, 2020 at 4:34 PM Giacomo Tesio  wrote:
>
> Hello everybody!
>
> To cleanup my port of GCC (9.2.0) to Jehanne OS (http://jehanne.io)
> I'd like to add a `--posixly` command line options to the gcc driver
> that should be expanded to a couple of -isystem and -L options
> to ease the compilation of POSIX programs (Jehanne is a Plan 9
> derivative so it's not a POSIX system).
>
> So I defined in the gcc/config/jehanne.h (see attachment) the macros
> LINK_SPEC, LIB_SPEC and CPP_SPEC that substitute such command line
> options as required.

You might need to add it to a .opt file like it is done in darwin.opt:
; Driver options.

all_load
Driver RejectNegative Alias(Zall_load)
Load all members of archive libraries, rather than only those that
satisfy undefined symbols.

Thanks,
Andrew Pinski

>
> What's weird is that, if I dump the specs file with `gcc -dumpspecs`
> and then load it with `gcc -specs=dump` it works like a charm,
> properly translating the --posixly option.
>
> However without the -specs option, it can't recognise such option.
>
> I have no idea of what I'm missing.
> I tried to put the dump into PREFIX/lib/gcc/x86_64-jehanne/9.2.0/specs
> (also in the attachments) but it doesn't change anything.
> Yet, by using the -specs= option it works.
>
>
> Am I missing something obvious?
>
> Thanks for your help and sorry if the issue is dumb: I tried my best to
> understand the Gcc Internals but I was unable to fix this.
>
>
> Giacomo


Re: Future debug options: -f* or -g*?

2020-08-31 Thread David Blaikie via Gcc
Hey Mark - saw a little of/bits about your presentation at LPC 2020 GNU
Tools Track (& your thread on on the gdb list about debug_names). Wondering
if you (or anyone else you know who's contributing to debug info in GCC)
have some thoughts on this flag naming issue. It'd be great to get some
alignment between GCC and Clang here, so as we both add new flags going
forward, at least they're at least categorically consistent for users, even
if we aren't necessarily implementing the exact same flags/flag names
(though in the -gsplit-dwarf case, it'd be good for any new semantics/name
to match exactly).

On Wed, Jul 29, 2020 at 9:46 AM David Blaikie  wrote:

> On Fri, Jul 10, 2020 at 12:09 PM Nathan Sidwell  wrote:
> >
> > On 7/9/20 3:28 PM, Fangrui Song via Gcc wrote:
> > > Fix email addresses:)
> > >
> >
> > IMHO the -f ones are misnamed.
> > -fFOO -> affect generated code (non-target-specific) or language feature
> > -gFOO -> affect debug info
> > -mFOO -> machine-specific option
> >
> > the -fdump options are misnamed btw, I remember Jeff Law pointed that
> out after
> > Mark Mitchell added the first one.  I'm not sure why we didn't rename it
> right
> > then.  I'll bet there are other -foptions that don;t match my
> comfortable world
> > view :)
>
> Appreciate the perspective, for sure!
>
> It sounds like some folks who've worked on this a fair bit (at least
> myself, Eric Christopher, and Cary Coutant) have had a different
> perspective for quite a while - that -g flags generally turn on debug
> info emission (& customize it in some way, potentially) and -f flags
> modify the emission but don't enable it.
>
> Specifically this conversation arose around changing the semantics of
> -gsplit-dwarf, which currently enables debug info emission and
> customizes the nature of that emission. There's a desire to separate
> these semantics.
>
> I have a few issues with this that I'd like to avoid:
>
> 1) changing the semantics of an existing flag (I think it's best to
> introduce a new one, perhaps deprecate the old one), especially across
> two compilers, issues around version compatibility, etc seem less than
> ideal
> 2) especially given the existing semantics of the flag it seems like
> it'd add to the confusion to make -gsplit-dwarf no longer imply -g2,
> whereas adding -fsplit-dwarf would be less ambiguous/more clear that
> this is not going to turn on debug info emission
>
> Going forward for new flags, I still feel like (given the current
> proliferation of -g flags that do enable debug info emission and tweak
> it) the ambiguity of -g flags is problematic from a usability
> perspective, but I'd be less opposed to new flags using -g than I am
> to changing the semantics of an existing -g flag.
>
> If GCC is outright "hard no" on -fsplit-dwarf and (sounds like) has
> already changed -gsplit-dwarf semantics, Clang will essentially have
> to follow to avoid greater confusion, but I'd like to avoid that if
> possible.
>
> Thoughts? Are there other ways we could reduce the ambiguity between
> "enables debug info" and "tweaks debug info emission, if enabled"?
>
> - Dave
>
> >
> > nathan
> >
> > > On 2020-07-09, Fangrui Song wrote:
> > >> Both GCC and Clang have implemented many debugging options under -f
> and
> > >> -g. Whether options go to -f or -g appears to be pretty arbitrary
> decisions.
> > >>
> > >> A non-complete list of GCC supported debug options is documented here
> at
> > >> https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
> > >>
> > >> I think there options belong to 3 categories:
> > >>
> > >> (a) -f* & doesn't imply -g2: -fdebug-types-section
> > >> -feliminate-unused-debug-types,
> > >>   -fdebug-default-version=5 (this exists in clang purely because
> -gdwarf-5
> > >> implies -g2 & there is need to not imply -g2)
> > >> (b) -g* & implies -g2: -gsplit-dwarf (I want to move this one to (c)
> > >> http://lists.llvm.org/pipermail/cfe-dev/2020-July/066202.html )
> > >>-gdwarf-5, -ggdb, -gstabs
> > >> (c) -g* but does not imply -g2: -ggnu-pubnames, -gcolumn-info,
> -gstrict-dwarf,
> > >> -gz, ...
> > >>the list appears to be much longer than (b)
> > >>
> > >> ( (b) isn't very good to its non-orthogonality. The interaction with
> -g0 -g1
> > >>  and -g3 can be non-obvious sometimes.)
> > >>
> > >> Cary Coutant kindly shared with me his understanding about debug
> > >> options (attached at the end). Many established options can't probably
> > >> be fixed now. Some are still fixable (-gsplit-dwarf).
> > >>
> > >> This post is mainly about future debug options. Shall we figure out a
> > >> convention for future debug options?
> > >>
> > >> Personally I'd prefer (c) but I won't object to (a).
> > >> I'd avoid (b).
> > >>
> > >>> In retrospect, I regret not naming the option -fsplit-dwarf, which
> > >>> clearly would not have implied -g, and would have fit in with a few
> > >>> other dwarf-related -f options. (I don't know whether Richard's
> > >>> objection to it