On 2018-11-02 11:39, Magnus Ihse Bursie wrote:
On 2018-11-02 00:53, Ioi Lam wrote:
Maybe precompiled.hpp can be periodically (weekly?) updated by a
robot, which parses the dependencies files generated by gcc, and pick
the most popular N files?
I think that's tricky to implement automatically. However, I've done
more or less, that, and I've got some wonderful results! :-)
Ok, I'm done running my tests.
TL;DR: I've managed to reduce wall-clock time from 2m 45s (with pch) or
2m 23s (without pch), to 1m 55s. The cpu time spent went from 52m 27s
(with pch) or 55m 30s (without pch) to 41m 10s. This is a huge gain for
our automated builds! And a clear improvement even for the ordinary
developer.
The list of included header files is reduced to just 37. The winning
combination was to include all header files that was included in more
than 130 different files, but to exclude all files with the name
"*.inline.hpp". Hopefully, a further gain of not pulling in the
*.inline.hpp files is that the risk of pch/non-pch failures will diminish.
However, these 37 files in turn pull in an additional 201 header files.
Of these, three are *.inline.hpp:
share/jfr/recorder/checkpoint/types/traceid/jfrTraceIdBits.inline.hpp,
os_cpu/linux_x86/bytes_linux_x86.inline.hpp and
os_cpu/linux_x86/copy_linux_x86.inline.hpp. This looks like a problem
with the header files to me.
With some exceptions (mostly related to JFR), these additional 200 files
have "generic" looking names (like share/gc/g1/g1_globals.hpp), which
indicate to me that it is reasonable to have them in this list, just as
the list of the original 37 tended to be quite general and high-level
includes. However, some files (like
share/jfr/instrumentation/jfrEventClassTransformer.hpp) has maybe leaked
in where they should not really be. It might be worth letting a hotspot
engineer spend some cycles to check up these files and see if anything
can be improved.
Caveats: I have only run this on my local linux build with the default
server JVM configuration. Other machines will have different sweet
spots. Other JVM variants/feature combinations will have different sweet
spots. And, most importantly, I have not tested this at all on Windows.
Nevertheless, I'm almost prepared to suggest a patch that uses this
selection of files if running on gcc, just as is, because of the speed
improvements I measured.
And some data:
Here is my log from my runs. The "on or above" means the cutoff I used
for how many files that needed to include the files that were selected.
As you can see, there is not much difference between cutoffs between
130-150, or (without the inline files) between 110 and 150. (There were
a lot of additional inline files in the positions below 130.) With all
other equal, I'd prefer a solution with fewer files. That is less likely
to go bad.
real 2m45.623s
user 52m27.813s
sys 5m27.176s
hotspot with original pch
real 2m23.837s
user 55m30.448s
sys 3m39.739s
hotspot without pch
real 1m59.533s
user 42m50.019s
sys 3m0.893s
hotspot new pch on or above 250
real 1m58.937s
user 42m18.994s
sys 3m0.245s
hotspot new pch on or above 200
real 2m0.729s
user 42m16.636s
sys 2m57.125s
hotspot new pch on or above 170
real 1m58.064s
user 42m9.618s
sys 2m57.635s
hotspot new pch on or above 150
real 1m58.053s
user 42m9.796s
sys 2m58.732s
hotspot new pch on or above 130
real 2m3.364s
user 42m54.818s
sys 3m2.737s
hotspot new pch on or above 100
real 2m6.698s
user 44m30.434s
sys 3m12.015s
hotspot new pch on or above 70
real 2m0.598s
user 41m17.810s
sys 2m56.258s
hotspot new pch on or above 150 without inline
real 1m55.981s
user 41m10.076s
sys 2m51.983s
hotspot new pch on or above 130 without inline
real 1m56.449s
user 41m10.667s
sys 2m53.808s
hotspot new pch on or above 110 without inline
And here is the "winning" list (which I declared as "on or above 130,
without inline"). I encourage everyone to try this on their own system,
and report back the results!
#ifndef DONT_USE_PRECOMPILED_HEADER
# include "classfile/classLoaderData.hpp"
# include "classfile/javaClasses.hpp"
# include "classfile/systemDictionary.hpp"
# include "gc/shared/collectedHeap.hpp"
# include "gc/shared/gcCause.hpp"
# include "logging/log.hpp"
# include "memory/allocation.hpp"
# include "memory/iterator.hpp"
# include "memory/memRegion.hpp"
# include "memory/resourceArea.hpp"
# include "memory/universe.hpp"
# include "oops/instanceKlass.hpp"
# include "oops/klass.hpp"
# include "oops/method.hpp"
# include "oops/objArrayKlass.hpp"
# include "oops/objArrayOop.hpp"
# include "oops/oop.hpp"
# include "oops/oopsHierarchy.hpp"
# include "runtime/atomic.hpp"
# include "runtime/globals.hpp"
# include "runtime/handles.hpp"
# include "runtime/mutex.hpp"
# include "runtime/orderAccess.hpp"
# include "runtime/os.hpp"
# include "runtime/thread.hpp"
# include "runtime/timer.hpp"
# include "services/memTracker.hpp"
# include "utilities/align.hpp"
# include "utilities/bitMap.hpp"
# include "utilities/copy.hpp"
# include "utilities/debug.hpp"
# include "utilities/exceptions.hpp"
# include "utilities/globalDefinitions.hpp"
# include "utilities/growableArray.hpp"
# include "utilities/macros.hpp"
# include "utilities/ostream.hpp"
# include "utilities/ticks.hpp"
#endif // !DONT_USE_PRECOMPILED_HEADER
/Magnus
I'd still like to run some more tests, but preliminiary data indicates
that there is much to be gained by having a more sensible list of
files in the precompiled header.
The fewer files we got on this list, the less likely it is to become
(drastically) outdated. So I don't think we need to do this
automatically, but perhaps manually every now and then when we feel
build times are increasing.
/Magnus
- Ioi
On 11/1/18 4:38 PM, David Holmes wrote:
It's not at all obvious to me that the way we use PCH is the
right/best way to use it. We dump every header we think it would be
good to precompile into precompiled.hpp and then only ask gcc to
precompile it. That results in a ~250MB file that has to be read
into and processed for every source file! That doesn't seem very
efficient to me.
Cheers,
David
On 2/11/2018 3:18 AM, Erik Joelsson wrote:
Hello,
My point here, which wasn't very clear, is that Mac and Linux seem
to lose just as much real compile time. The big difference in these
tests was rather the number of cpus in the machine (32 threads in
the linux box vs 8 on the mac). The total amount of work done was
increased when PCH was disabled, that's the user time. Here is my
theory on why the real (wall clock) time was not consistent with
user time between these experiments can be explained:
With pch the time line (simplified) looks like this:
1. Single thread creating PCH
2. All cores compiling C++ files
When disabling pch it's just:
1. All cores compiling C++ files
To gain speed with PCH, the time spent in 1 much be less than the
time saved in 2. The potential time saved in 2 goes down as the
number of cpus go up. I'm pretty sure that if I repeated the
experiment on Linux on a smaller box (typically one we use in CI),
the results would look similar to Macosx, and similarly, if I had
access to a much bigger mac, it would behave like the big Linux
box. This is why I'm saying this should be done for both or none of
these platforms.
In addition to this, the experiment only built hotspot. If you we
would instead build the whole JDK, then the time wasted in 1 in the
PCH case would be negated to a large extent by other build targets
running concurrently, so for a full build, PCH is still providing
value.
The question here is that if the value of PCH isn't very big,
perhaps it's not worth it if it's also creating as much grief as
described here. There is no doubt that there is value however. And
given the examination done by Magnus, it seems this value could be
increased.
The main reason why we haven't disabled PCH in CI before this. We
really really want to get CI builds fast. We don't have a ton of
over capacity to just throw at it. PCH made builds faster, so we
used them. My other reason is consistency between builds.
Supporting multiple different modes of building creates the
potential for inconsistencies. For that reason I would definitely
not support having PCH on by default, but turned off in our
CI/dev-submit. We pick one or the other as the official build
configuration, and we stick with the official build configuration
for all builds of any official capacity (which includes CI).
In the current CI setup, we have a bunch of tiers that execute one
after the other. The jdk-submit currently only runs tier1. In tier2
I've put slowdebug builds with PCH disabled, just to help verify a
common developer configuration. These builds are not meant to be
used for testing or anything like that, they are just run for
verification, which is why this is ok. We could argue that it would
make sense to move the linux-x64-slowdebug without pch build to
tier1 so that it's included in dev-submit.
/Erik
On 2018-11-01 03:38, Magnus Ihse Bursie wrote:
On 2018-10-31 00:54, Erik Joelsson wrote:
Below are the corresponding numbers from a Mac, (Mac Pro (Late
2013), 3.7 GHz, Quad-Core Intel Xeon E5, 16 GB). To be clear, the
-npch is without precompiled headers. Here we see a slight
degradation when disabling on both user time and wall clock time.
My guess is that the user time increase is about the same, but
because of a lower cpu count, the extra load is not as easily
covered.
These tests were run with just building hotspot. This means that
the precompiled header is generated alone on one core while
nothing else is happening, which would explain this degradation
in build speed. If we were instead building the whole product, we
would see a better correlation between user and real time.
Given the very small benefit here, it could make sense to disable
precompiled headers by default for Linux and Mac, just as we did
with ccache.
I do know that the benefit is huge on Windows though, so we
cannot remove the feature completely. Any other comments?
Well, if you show that it is a loss in time on macosx to disable
precompiled headers, and no-one (as far as I've seen) has
complained about PCH on mac, then why not keep them on as default
there? That the gain is small is no argument to lose it. (I
remember a time when you were hunting seconds in the build time ;-))
On linux, the story seems different, though. People experience PCH
as a problem, and there is a net loss of time, at least on
selected testing machines. It makes sense to turn it off as
default, then.
/Magnus
/Erik
macosx-x64
real 4m13.658s
user 27m17.595s
sys 2m11.306s
macosx-x64-npch
real 4m27.823s
user 30m0.434s
sys 2m18.669s
macosx-x64-debug
real 5m21.032s
user 35m57.347s
sys 2m20.588s
macosx-x64-debug-npch
real 5m33.728s
user 38m10.311s
sys 2m27.587s
macosx-x64-slowdebug
real 3m54.439s
user 25m32.197s
sys 2m8.750s
macosx-x64-slowdebug-npch
real 4m11.987s
user 27m59.857s
sys 2m18.093s
On 2018-10-30 14:00, Erik Joelsson wrote:
Hello,
On 2018-10-30 13:17, Aleksey Shipilev wrote:
On 10/30/2018 06:26 PM, Ioi Lam wrote:
Is there any advantage of using precompiled headers on Linux?
I have measured it recently on shenandoah repositories, and
fastdebug/release build times have not
improved with or without PCH. Actually, it gets worse when you
touch a single header that is in PCH
list, and you end up recompiling the entire Hotspot. I would be
in favor of disabling it by default.
I just did a measurement on my local workstation (2x8 cores x2
ht Ubuntu 18.04 using Oracle devkit GCC 7.3.0). I ran "time make
hotspot" with clean build directories.
linux-x64:
real 4m6.657s
user 61m23.090s
sys 6m24.477s
linux-x64-npch
real 3m41.130s
user 66m11.824s
sys 4m19.224s
linux-x64-debug
real 4m47.117s
user 75m53.740s
sys 8m21.408s
linux-x64-debug-npch
real 4m42.877s
user 84m30.764s
sys 4m54.666s
linux-x64-slowdebug
real 3m54.564s
user 44m2.828s
sys 6m22.785s
linux-x64-slowdebug-npch
real 3m23.092s
user 55m3.142s
sys 4m10.172s
These numbers support your claim. Wall clock time is actually
increased with PCH enabled, but total user time is decreased.
Does not seem worth it to me.
It's on by default and we keep having
breakage where someone would forget to add #include. The
latest instance is JDK-8213148.
Yes, we catch most of these breakages in CIs. Which tells me
adding it to jdk-submit would cover
most of the breakage during pre-integration testing.
jdk-submit is currently running what we call "tier1". We do have
builds of Linux slowdebug with precompiled headers disabled in
tier2. We also build solaris-sparcv9 in tier1 which does not
support precompiled headers at all, so to not be caught in
jdk-submit you would have to be in Linux specific code. The
example bug does not seem to be that. Mach5/jdk-submit was down
over the weekend and yesterday so my suspicion is the offending
code in this case was never tested.
That said, given that we get practically no benefit from PCH on
Linux/GCC, we should probably just turn it off by default for
Linux and/or GCC. I think we need to investigate Macos as well
here.
/Erik
-Aleksey