New issue 2901: Disable inlining globally hurts performance
https://bitbucket.org/pypy/pypy/issues/2901/disable-inlining-globally-hurts
Tinho Lee:
`call_assembler_r` is found to be a heavy function when I profile my project
with the jit log, because of the parameters and result packing/unpacking. It
seems that inlining would help to solve the problem, however, inlining is not
work as expected sometimes.
I have construct two test cases, one for recursive call, another for
overwriting __getattr__. All the tests are run with environment variable
PYPY_GC_MIN=20G, in order to eliminate the influence of gc.
Here is the basic functions defined for the recursive call test case
```
#!python
def rec(n):
return 1 if n <= 1 else rec(n - 1) + n
def a():
rec(1)
rec(1)
rec(1)
rec(1)
rec(1)
def b():
rec(100)
rec(100)
rec(100)
def test_a():
start = time.time()
for n in xrange(10000000):
a()
print 'a', time.time() - start
def test_b():
start = time.time()
for n in xrange(100000):
b()
print 'b', time.time() - start
```
and the test code
```
#!python
print 'warm up'
test_a()
test_b()
print 'test'
test_a()
test_b()
```
The order of warm up is important in this case. Here is the result if `test_a`
is called before `test_b`:
```
warm up
a 0.0164198875427
b 0.196079015732
test
a 0.0130050182343
b 0.189415931702
```
And the result if `test_b` is called before `test_a`:
```
warm up
b 0.198947906494
a 1.29781603813
test
a 1.27277779579
b 0.186721801758
```
The performance of function `a` is greatly hurts by the order of warm up. I
have found some clues from the jit log. If `test_a` is warmed up first, `rec`
would be inlined in the function `a`, everything works fine. However, when
`test_b` is warmed up first, `rec` would be detected as recursive function call
and inline is disabled, which also affect the performance of `a`.
The implementation lies in `_opimpl_recursive_call` of pyjitpl.py.
```
#!python
if count >= memmgr.max_unroll_recursion:
# This function is recursive and has exceeded the
# maximum number of unrollings we allow. We want to stop
# inlining it further and to make sure that, if it
# hasn't happened already, the function is traced
# separately as soon as possible.
if have_debug_prints():
loc = targetjitdriver_sd.warmstate.get_location_str(greenboxes)
debug_print("recursive function (not inlined):", loc)
warmrunnerstate.dont_trace_here(greenboxes)
```
And the same behavior is found for the non-recursive function call. `--jit
trace_limit=300` is set as the `pypy` startup parameter to make the test
easily. Here is the basic definitions
```
#!python
class X(object):
def __getattr__(self, name):
return len(name)
def a(x):
return x.a + x.b + x.c
def b(x):
a = 1
a += 1
a += 2
a += 3
a += 4
a += 5
a += 6
a += 7
a += 8
a += 9
return x.a + x.b + x.c
def test_a(x):
start = time.time()
for n in xrange(1000000):
a(x)
print 'a', time.time() - start
def test_b(x):
start = time.time()
for n in xrange(1000000):
b(x)
print 'b', time.time() - start
```
and the test case
```
#!python
x = X()
print 'warm up'
test_a(x)
test_b(x)
print 'test'
test_a(x)
test_b(x)
```
The order of warm-up also have significant influence on the result.
When `test_a` is called before `test_b`, `__getattr__` is inlined for better
performance.
```
warm up
a 0.0307860374451
b 0.114766836166
test
a 0.0264270305634
b 0.110279083252
```
When `test_b` is called before `test_a`, `__getattr__` is disabled for
inlining, which result in worse performance for `a`.
```
warm up
b 0.115270137787
a 0.111267089844
test
a 0.10728096962
b 0.108850002289
```
I came up with some explanations for the behavior from the code
`blackhole_if_trace_too_long` in pyjitpl.py. If `test_b` is called first, trace
too long would be detected since a small trace limit is set. Then the biggest
function in `b`, which is `__getattr__`, is set to be noninlineable.
```
#!python
def blackhole_if_trace_too_long(self):
warmrunnerstate = self.jitdriver_sd.warmstate
if (self.history.length() > warmrunnerstate.trace_limit or
self.history.trace_tag_overflow()):
jd_sd, greenkey_of_huge_function = self.find_biggest_function()
self.staticdata.stats.record_aborted(greenkey_of_huge_function)
self.portal_trace_positions = None
if greenkey_of_huge_function is not None:
jd_sd.warmstate.disable_noninlinable_function(
greenkey_of_huge_function)
self.aborted_tracing_jitdriver = jd_sd
self.aborted_tracing_greenkey = greenkey_of_huge_function
if self.current_merge_points:
jd_sd = self.jitdriver_sd
greenkey =
self.current_merge_points[0][0][:jd_sd.num_green_args]
warmrunnerstate.JitCell.trace_next_iteration(greenkey)
raise SwitchToBlackhole(Counters.ABORT_TOO_LONG)
```
It seems that the disable inlining have a global influence on all the tracing
path. For the cases above, inlining should be enabled in `a` and be disabled in
`b`, but the truth is inlining is disabled in both `a` and `b`.
Is there any proposal could help to improve the performance?
_______________________________________________
pypy-issue mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-issue