[clang] Bfi precision (PR #66285)
WenleiHe wrote: Thanks for investigating, Matthias. So it looks like the regression is indeed just clang non-PGO build being unlucky after the BFI improvement.. Sounds good to leave it as is. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
david-xl wrote: > And digging even deeper: > > * FWIW I noticed that I only used `clang -c` as benchmark previously, should > have used `clang -c -O3` resulting in this: > > ``` > Old-BFI: insn: 37,821,687,947 (baseline) > New-BFI: insn: 38,133,312,923 +0.82% > Old-BFI, no-cold: insn: 37,423,365,184 -1.05% > New-BFI, no-cold: insn: 37,438,736,713 -1.01% > ``` > > * The problematic part of the code that is inlined/not-inlined is actually > marked as `LLVM_UNLIKELY` here: > https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/ADT/DenseMap.h#L607 > and intuitively one would think that it is probably best to not inline > `grow` (as will happen with the new BFI version) > * In practice not-inlining `grow` mostly ends up being worse because of > knock-on effects as far as I can tell. These are the inlining decisions I > noticed for the most prominent situation: > > ``` > - `grow` inlined (Old-BFI): > - Instruction::getMetadataImpl > -> Value::getMetadata not inlined > -> DenseMap::operator[] inlined > -> DenseMap::FindAndConstruct inlined > -> DenseMap::InsertIntoBucket not inlined, size >likely too big with `grow` inlined here > -> DenseMap::grow inlined > - `grow` inlined (New-BFI): > - Instruction::getMadataImpl > -> Value::getMetadata inlined > -> DenseMap::operator[]inlined > -> DenseMap::FindAndConstruct not inlined, size > -> DenseMap::InsertIntoBucket inlined > -> DenseMap::grow not inlined > ``` > > Though if you look into this then I would state that the code got better for > the wrong reasons! Not inlining `grow` is a sensible decision in this context > and the `LLVM_UNLIKELY` annotation makes sense (I actually added some > counters and see the unlikely branch taken somewhere between 1% and 10% of > the cases depending on inputs, so seems sensible). > > Unfortunately the particular code here `getMetadataImpl` never inserts new > things into the map, but unfortunately `operator[]` gives you that behavior > by default so nobody noticed. So not inlining `InsertIntoBucket` happened to > be a great decision previously that the compiler did by accident without > having good data to support this. Now with better but still insufficient data > (as this is PGO) we happen to end up inlining `InsertIntoBucket` wasting code > size leading to worse inlining decisions further up the stack... > > Long story short: This ended up another of the many stories where the > compiler cannot make the right inlining decisions without having actual > runtime data. It tends to make a better decision based on the data that ends > up being wrong anyway here. I'm gonna leave things as is and rather put up > some improvements to LLVM code instead! Good analysis. Your plan sounds good but lowering the cold threshold is probably a good thing to do with the new BFI. The FindAndConstruct method in DenseMap can probably be annotated with buitin_expect to indicate the insertion is a unlikely path. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
dexonsmith wrote: Interesting. Probably `Value::getMetadata()` could/should call `DenseMap::find()` instead of `operator[]()` and assert that it's found before dereferencing, because `Value::hasMetadata()` (which, IIRC, consults a bit stored in `Value`) has already promised something will be there. Probably you'll find me on the git-blame for this... Your plan SGTM! https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
MatzeB wrote: Some very ad-hoc benchmarking. Of clang compilation speed (measured in instructions as reported by valgrind/callgrind which I think somewhat matches the setup of nikic) compiling `sqlite3` of CTMark: Old-BFI (this PR reverted), New-BFI (this PR applied), no-cold (cold-callsite-rel-freq set to 0 to disable `InlineCostCallAnalyzer::isColdCallSite` behavior for non-PGO builds: ``` Old-BFI: size: 59,802,772 (baseline) insn: 2,812,833,144 (baseline) New-BFI: size: 60,247,076 +0.74% insn: 2,818,639,641 +0.21% Old-BFI, no-cold: size: 60,773,988 +1.62% insn: 2,806,521,932 -0.22% New-BFI, no-cold: size: 60,741,700 +1.57% insn: 2,803,489,298 -0.33% ``` I could benchmark more with llvm-test-suite and put up a PR to disable cold-callsite-rel-freq by default if people support this direction... https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
vitalybuka wrote: FYI @ldionne This patch causing test_vector2.pass.cpp trigger memory leak https://lab.llvm.org/buildbot/#/builders/168/builds/16422 Not sure if this was leaking but lsan didn't see, it's possible. Or some regression in lsan. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
david-xl wrote: Contracting my past self (one of the reviewers of the patch), I think coldness check should be based on a global threshold, which exists when synthetic entry count propagation is enabled (without PGO). https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
dexonsmith wrote: > Seems this got introduced in https://reviews.llvm.org/D34312 with the rough > idea that we shouldn't inline into parts of the code that > `_builtin_expect(...)` deems unlikely. Which makes sense when you express it > like this, but I guess numeric thresholds can go wrong... Heh, yeah, the premise seems correct, but a percentage-based numeric threshold doesn't seem right. You kind of want a flag for the block. Or a special value, like "freq=0", which indicates "annotation says this is cold". https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
MatzeB wrote: Seems this got introduced in https://reviews.llvm.org/D34312 with the rough idea that we shouldn't inline into parts of the code that `_builtin_expect(...)` deems unlikely. Which makes sense when you say it express it like this, but I guess numeric thresholds can go wrong... https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
david-xl wrote: Agree that deciding coldness based on local entry frequency (2%) is a bad idea though. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
dexonsmith wrote: Not entirely accidental. When BPI/BFI first landed it was heavily profiled to be sure it didn't pessimize non-PGO code. I don't see why we'd suddenly be okay with pessimizing it. Under 2% isn't hard to hit for hot path code. Lots of functions will have strings of early exit conditions. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
david-xl wrote: Yes -- the revert can wait until more data is available. I agree that it should help performance in theory. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
WenleiHe wrote: > This is concerning. Can this be reverted for now and we can help with some > internal performance testing. @xur-llvm Internally, with this change, on a few large workloads, we saw ~0.2-0.5% performance improvements, all with 2-3% .text size reduction at the same time. Our testing covered both sampling pgo and instrumentation pgo. The change makes profile more accurate, which is a good improvement by itself. But PGO performance isn't just about profile accuracy, and it depends on training quality and representativeness too. Hypothetically, if some of the training isn't that representative relative to perf testing scenario, by making profile more accurate, we could also make profile deviate more from what's needed for best perf on testing scenario. That said, I think it'd be fair to revert if this is proven to be a general regression. For that, can we wait for your results before reverting? https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
MatzeB wrote: Oh so this isn't even a PGO enabled build, interesting... https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
nikic wrote: This is a standard release build. The exact cmake config is https://gist.github.com/nikic/1bf97b5b6aade298b87f41cd13bf37d5 where the stage1 build has the same configuration but is built with gcc. I'm testing CTMark, but the impact is quite consistent across the board, so anything will probably do. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] Bfi precision (PR #66285)
nikic wrote: It looks like this has made clang 0.7% slower (in the sense that the clang stage2 binary is optimized worse -- there is no negative impact on stage1). The binary is also 0.7% larger. It looks like this change has significant impact on inlining heuristics. Not really sure what to make of that, I assume you determined that this ends up being faster on average, so maybe clang is just unlucky. https://github.com/llvm/llvm-project/pull/66285 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits