Re: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v2]

2024-06-06 Thread MaxXing
> This patch changes the algorithm of `Node::dominates` to make the result more 
> precise, and allows the iterators of `ConcurrentHashMap` to be scalar 
> replaced.
> 
> The previous algorithm will return a conservative result when encountering a 
> dead control flow, and only try the first two input paths of a multi-input 
> Region node, which may prevent the scalar replacement in some cases.
> 
> For example, with G1 GC enabled, C2 generates GC barriers for 
> `ConcurrentHashMap` iteration operations at some early phases, and then 
> eliminates them in a later IGVN, but `LoadNode` is also idealized in the same 
> IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, 
> and refuse to split some instance field loads through Phi due to the 
> conservative result of `Node::dominates`, and thus the scalar replacement can 
> not be applied to iterators in the later macro elimination phase.
> 
> This patch allows `Node::dominates` to try other paths of the last 
> multi-input Region node when the first path is dead, and makes 
> `ConcurrentHashMap` iteration ~30% faster:
> 
> 
> Benchmark(nkeys)  Mode  CntScore   
> Error  Units
> Maps.testConcurrentHashMapIterators1  avgt   15   414099.085 ± 
> 33230.945  ns/op   # baseline
> Maps.testConcurrentHashMapIterators1  avgt   15   315490.281 ±  
> 3037.056  ns/op   # patch
> 
> 
> Testing: tier1-4.

MaxXing has updated the pull request incrementally with one additional commit 
since the last revision:

  Revert last commit, and push the `LoadNode` back to the worklist to wait for 
the dead code to be removed.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/19496/files
  - new: https://git.openjdk.org/jdk/pull/19496/files/e3330ece..b5db38dc

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=00-01

  Stats: 107 lines in 4 files changed: 39 ins; 34 del; 34 mod
  Patch: https://git.openjdk.org/jdk/pull/19496.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19496/head:pull/19496

PR: https://git.openjdk.org/jdk/pull/19496


Re: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v2]

2024-06-06 Thread MaxXing
On Wed, 5 Jun 2024 05:40:12 GMT, Tobias Hartmann  wrote:

>> MaxXing has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Revert last commit, and push the `LoadNode` back to the worklist to wait 
>> for the dead code to be removed.
>
> Impressive results! I haven't looked at the change yet but here are a few 
> questions / requests:
> - Could you add a screenshot of the IR of the case you are describing?
> - Wouldn't it help to add the LoadNode back to the IGVN worklist and wait for 
> the dead path to be removed?
> - Could you add an [IR 
> framework](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/README.md)
>  test that verifies that the optimization works as expected?
> 
> Thanks,
> Tobias

@TobiHartmann  Hi Tobias, thanks for your reply!

> Could you add a screenshot of the IR of the case you are describing?

Sure. Here's a simple example of iterating over the keys of `ConcurrentHashMap`:


public long sumMapKeys() {
  long sum = 0;
  Enumeration it = map.keys();
  while (it.hasMoreElements()) {
sum += (Long) it.nextElement();
  }
  return sum;
}


And here's what `-XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations` says:


JavaObject(6) NoEscape(NoEscape) [ 183F 189F 205F 196F 191F 179F 186F 202F 208F 
233F 637F 1069F [ 104 109 ]] 92  Allocate  === 76 6 69 8 1 (90 89 24 1 1 1 
22 1 1 43 77 87 ) [[ 93 94 95 102 103 104 ]]  rawptr:NotNull ( int:>=0, 
java/lang/Object:NotNull *, bool, top, bool ) ConcurrentHashMap::keys @ bci:16 
(line 2152) MyBenchmark::sumMapKeys @ bci:6 (line 105) !jvms: 
ConcurrentHashMap::keys @ bci:16 (line 2152) MyBenchmark::sumMapKeys @ bci:6 
(line 105)
LocalVar(60) [ 92P [ 109 183b 189b 205b ]]104  Proj  === 92  [[ 105 109 183 
189 205 ]] #5 !jvms: ConcurrentHashMap::keys @ bci:16 (line 2152) 
MyBenchmark::sumMapKeys @ bci:6 (line 105)
LocalVar(103) [ 104 92P [ 196b 191b 179b 186b 202b 208b 233b 637b 1069b ]]
109  CheckCastPP  === 106 104  [[ 1801 1771 1754 1584 1503 1503 1490 1490 1466 
1466 1451 1451 208 1393 1381 1381 179 179 186 186 208 196 191 191 196 202 202 
228 297 233 233 1363 1363 1393 1321 1321 1311 637 637 648 648 998 988 1311 1301 
477 477 487 487 497 497 569 1301 672 685 767 672 978 920 539 539 1069 1069 557 
557 1128 569 631 1061 685 631 ]]  
#java/util/concurrent/ConcurrentHashMap$KeyIterator 
(java/util/Iterator,java/util/Enumeration):NotNull:exact *  
Oop:java/util/concurrent/ConcurrentHashMap$KeyIterator 
(java/util/Iterator,java/util/Enumeration):NotNull:exact * !jvms: 
ConcurrentHashMap::keys @ bci:16 (line 2152) MyBenchmark::sumMapKeys @ bci:6 
(line 105)

NotScalar (Field load)  109  CheckCastPP  === 106 104  [[ 1801 1771 1754 1584 
1503 1503 1490 1490 1128 569 1451 1451 208 1393 1381 1381 672 685 1069 1069 208 
196 191 191 196 978 685 631 297 767 672 569 1301 1393 1321 1321 1311 557 557 
631 1061 998 988 1311 1301 477 477 487 487 497 497 ]]  
#java/util/concurrent/ConcurrentHashMap$KeyIterator 
(java/util/Iterator,java/util/Enumeration):NotNull:exact *,iid=92  
Oop:java/util/concurrent/ConcurrentHashMap$KeyIterator 
(java/util/Iterator,java/util/Enumeration):NotNull:exact *,iid=92 !jvms: 
ConcurrentHashMap::keys @ bci:16 (line 2152) MyBenchmark::sumMapKeys @ bci:6 
(line 105)
    2186  LoadI  === _ 1973 191  [[ 2185 ]]  
@java/util/concurrent/ConcurrentHashMap$Traverser+12 *, name=index, idx=9; #int 
!orig=[2178],[2171],[1373] !jvms: ConcurrentHashMap$Traverser::advance @ bci:51 
(line 3369) ConcurrentHashMap$KeyIterator::next @ bci:28 (line 3468) 
ConcurrentHashMap$KeyIterator::nextElement @ bci:1 (line 3472) 
MyBenchmark::sumMapKeys @ bci:21 (line 107)


It shows that scalar replacement is aborted due to field load 2186:

https://github.com/openjdk/jdk/assets/5129820/2e35769a-6e0f-4c8a-96ec-09923a5eb2c0";>

As we can see its memory is a Phi, and it should be split by 
`LoadNode::split_through_phi` if its address `109 CheckCastPP` dominates the 
control flow of the Phi node `1330 Region`.

The control node of `CheckCastPP` is `106 Proj`:

https://github.com/openjdk/jdk/assets/5129820/ebf6f7f8-fcd1-4d83-8734-3f1864dcc946";>

And it does dominate 1330, although not that obvious:

https://github.com/openjdk/jdk/assets/5129820/e53b25de-c966-4f89-b30e-54ccc35f645b";>

But `Node::dominates` think it doesn't because of the dead control flow.

> Wouldn't it help to add the LoadNode back to the IGVN worklist and wait for 
> the dead path to be removed?

I tried to revert the change of main algorithm of `Node::dominates`, and just 
add some code to add the LoadNode back to the worklist if we met dead path. It 
works, still makes the iteration ~30% faster:


Benchmark  (nkeys)  Mode  CntScore  
 Error  Units
Maps.testConcurrentHashMapIterators  1  avgt   15   312720.415 ±  
3255.500  ns/op


Thanks for pointing this out. I updated this PR, and the latest commit is 
passing test tier1-4.

>