#32919: Micro-optimisations for lexing & parsing of templates.
------------------------------------------------+------------------------
               Reporter:  Keryn Knight          |          Owner:  nobody
                   Type:  Cleanup/optimization  |         Status:  new
              Component:  Template system       |        Version:  dev
               Severity:  Normal                |       Keywords:
           Triage Stage:  Unreviewed            |      Has patch:  0
    Needs documentation:  0                     |    Needs tests:  0
Patch needs improvement:  0                     |  Easy pickings:  0
                  UI/UX:  0                     |
------------------------------------------------+------------------------
 I'm bundling all of these proposed changes into one ticket because on
 their own they're all very small things and none of them move the needle
 individually. I do have a patchset for the changes I'm describing below
 which I'll tidy up and put into a PR for discussion.

 Outside of looking at the pstats output format, the only proper
 benchmarking I've done is with timeit ... on a laptop. And unfortunately
 the bits being discussed are all quick enough that it's easy enough to
 foresee that the differences are within the margin of error, especially
 given the results can be affected by the CPU throttling, or other tasks
 running concurrently (though I've tried to avoid that as much as
 possible). It's entirely possibly that I got the baseline results during a
 throttled period accidentally, etc. So I'm going to -- possibly wrongly --
 lean on the pstats output. But by all accounts, the theory should be
 sound, because it's "do less work"...

 Note that `DEBUG=False` for all of these, but it's irrelevant because I'm
 not going through the template loaders so there's no cached Loader in play
 (I'm just not getting the DebuxLexer as a result). All of the work is
 based on instantiating a Template from a string, never getting as far as
 rendering it.

 == Baseline

 To get a baseline of where things are, I've read the contents of the
 `admin/change_list.html` into a variable, to get the following:
 {{{
 In [3]: %timeit -r5 -n5000 Template(changelist)
 2.08 ms ± 29.1 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each)
 }}}
 to try and boost the signal to noise of ms fluctuations, we'll also do:
 {{{
 def runtemplate(source=changelist, n=1):
     for x in range(n):
         t = Template(source)
 }}}
 and get the baseline for that:
 {{{
 In [2]: %timeit -r5 -n1 runtemplate(n=5000)
 10.5 s ± 335 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
 }}}

 From there, I've cprofiled (well, yappi actually) `runtemplate(n=1000)` to
 amortize any other setup/import/machinery costs and boost the lexing &
 parsing parts into the spotlight. I've cherry picked the relevant things
 that I'm putting forth for change, and for the purpose of extracting the
 distinct `get_nodes_by_type` methods being run, copy-pasted the default
 `Node.get_nodes_by_type` (sigh) to each.
 I've added in "percentage of time" in at the end of each line, calculated
 as `(tottime / overall time) * 100` (eg: `Lexer.create_token` is `(1.418 /
 30.018) * 100`) ... these are ideally going to go down in the general
 case, but with each subsequent change may go back up because something
 else changed position, so are a guide more than anything.

 {{{
 6776637 function calls (6946213 primitive calls) in 30.018 seconds
 Ordered by: internal time
    ncalls       tottime  percall  cumtime  percall
 filename:lineno(function)
 1000/31000      2.410    0.002   24.801    0.001
 django/template/base.py:445(Parser.parse) (8%)
    160000       1.418    0.000    2.436    0.000
 django/template/base.py:367(Lexer.create_token) (4.7%)
    687159       1.292    0.000    1.293    0.000   builtins:0(isinstance)
 (4.3%)
 2000/56000      1.199    0.001    5.095    0.000
 django/template/base.py:962(NodeList.get_nodes_by_type) (4%)
    131000       1.100    0.000    1.835    0.000
 django/template/base.py:512(Parser.extend_nodelist) (3.7%)
      1000       1.047    0.001    4.206    0.004
 django/template/base.py:353(Lexer.tokenize) (3.5%)
     59000       0.911    0.000    3.220    0.000
 django/template/base.py:768(Variable.__init__) (3%)
    423338       0.853    0.000    0.853    0.000   __builtin__:0(<method
 'startswith' of 'str' objects>) (2.8%)
    140000       0.820    0.000    1.345    0.000
 django/template/base.py:972(TextNode.get_nodes_by_type) (2.7%)
 371343/371373   0.770    0.000    1.235    0.000   builtins:0(getattr)
 (2.6%)
     24000       0.141    0.000    0.231    0.000
 django/template/base.py:1024(VariableNode.get_nodes_by_type) (0.5%)
     10000       0.059    0.000    0.097    0.000
 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
      6000       0.036    0.000    0.059    0.000
 django/template/defaulttags.py:434(URLNode.get_nodes_by_type) (0.1%)
      4000       0.024    0.000    0.039    0.000
 django/templatetags/static.py:95(StaticNode.get_nodes_by_type) (0.1%)
      2000       0.012    0.000    0.020    0.000
 django/template/defaulttags.py:361(LoadNode.get_nodes_by_type) (0%)
      2000       0.012    0.000    0.020    0.000
 django/template/library.py:186(SimpleNode.get_nodes_by_type) (0%)
      2000       0.012    0.000    0.020    0.000
 django/template/defaulttags.py:53(CsrfTokenNode.get_nodes_by_type) (0%)

 }}}

 == After changes

 So that you don't need to read all the rest of this just to find out the
 difference, I'll front-load the final/ideal results too. At the end
 there's an easier to read summary of the changes, too.

 {{{
 5682638 function calls (5852221 primitive calls) in 25.186 seconds
 Ordered by: internal time
    ncalls     tottime  percall  cumtime  percall filename:lineno(function)
 1000/31000      2.000    0.002   21.065    0.001
 django/template/base.py:446(Parser.parse) (7.9%)
 2000/56000      1.160    0.001    3.711    0.000
 django/template/base.py:966(NodeList.get_nodes_by_type) (4.6%)
    556159       1.014    0.000    1.015    0.000 builtins:0(isinstance)
 (4%)
      1000       1.007    0.001    3.088    0.003
 django/template/base.py:353(Lexer.tokenize) (4%)
    160000       0.892    0.000    1.387    0.000
 django/template/base.py:367(Lexer.create_token) (3.5%)
    131000       0.815    0.000    1.292    0.000
 django/template/base.py:514(Parser.extend_nodelist) (3.2%)
     59000       0.748    0.000    3.081    0.000
 django/template/base.py:770(Variable.__init__) (2.9%)
    140000       0.528    0.000    0.778    0.000
 django/template/base.py:976(TextNode.get_nodes_by_type) (2.1%)
 181343/181373   0.342    0.000    0.343    0.000 builtins:0(getattr)
 (1.4%)
    107338       0.222    0.000    0.222    0.000 __builtin__:0(<method
 'startswith' of 'str' objects>) (0.9%)
     14000       0.142    0.000    0.243    0.000
 django/template/defaulttags.py:280(IfNode.__init__) (0.5%)
     51000       0.101    0.000    0.101    0.000
 django/template/defaulttags.py:288(IfNode.__iter__) (0.4%)
     24000       0.091    0.000    0.134    0.000
 django/template/base.py:1028(VariableNode.get_nodes_by_type) (0.4%)
     10000       0.038    0.000    0.056    0.000
 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
     ... other get_nodes_by_type are all the same miniscule parts ...
 }}}

 and the timeit output:
 {{{
 In [3]: %timeit -r5 -n5000 Template(changelist)
 1.82 ms ± 20 µs per loop (mean ± std. dev. of 5 runs, 5000 loops each)
 In [4]: %timeit -r5 -n1 runtemplate(n=5000)
 9.38 s ± 186 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
 }}}

 == Proposed changes

 And now I'm going to describe the various bits, hopefully maybe possibly
 convince you with stats, and ideally avoid adding in much in the way of
 extra complexity.
 From here on out, the percentages listed are the difference from the
 ''previous'' execution, rather than the baseline.

 === Optimising Lexer.create_token

 Currently the lexer has many repeated calls to
 `token_string.startswith(...)` to determine what kind of token to make.
 Hoisting the first 2 chars as `token_start = token_string[0:2]` and then
 doing `token_start == ...` is notionally faster because it's the hot loop
 for lexing:

 {{{
     ncalls   tottime  percall  cumtime  percall filename:lineno(function)
    160000      0.892    0.000    1.380    0.000
 django/template/base.py:367(Lexer.create_token) (4.7% -> 3.2%)
    165338      0.331    0.000    0.331    0.000 __builtin__:0(<method
 'startswith' of 'str' objects> (2.8% -> 1.2%)
 }}}

 Obviously indexing into a string isn't free, it's just not trackable as a
 method call, but `startswith` has the extra ergonomics for passing in a
 `tuple` of values to check, which we don't need. The lexer already makes
 nods to micro-optimisation being desirable, with the comment: `We could do
 len(BLOCK_TAG_START) to be more "correct", but we've hard-coded the 2s
 here for performance. And it's not like the TAG_START values are going to
 change anytime, anyway.` way back in
 234b7e372000381657250fe133cbcfa90ffbb2a9 so I think this is a safe
 proposal for the same reasons.

 === Optimising Parser.parse

 The method currently makes mention of `Use the raw values here for
 TokenType.* for a tiny performance boost.` We can micro-optimise it more,
 because for each of the 3 if branches the opcodes are:
 {{{
 34 LOAD_FAST    3 (token)
 36 LOAD_ATTR    3 (token_type)
 38 LOAD_ATTR    4 (value)
 40 LOAD_CONST   2 (0 or 1 or 2)
 42 COMPARE_OP   2 (==)
 }}}
 By setting `token_type = token.token_type.value` and then re-using that
 variable, it bcomes
 {{{
 34 LOAD_FAST    3 (token)
 36 LOAD_ATTR    3 (token_type)
 38 LOAD_ATTR    4 (value)
 40 STORE_FAST   4 (token_type)
 }}}
 followed by, for each comparison:
 {{{
 42 LOAD_FAST    4 (token_type)
 44 LOAD_CONST   2 (0 or 1 or 2)
 46 COMPARE_OP   2 (==)
 }}}

 which makes a small difference, at least when instrumented, because again
 it's the hot loop:
 {{{
     ncalls   tottime  percall  cumtime  percall filename:lineno(function)
 1000/31000     1.894    0.002   22.146    0.001
 django/template/base.py:446(Parser.parse) (8% -> 7.2%)
 }}}


 === Removing isinstance from Parser.extend_nodelist

 Currently the method includes the line `if isinstance(nodelist, NodeList)
 and not isinstance(node, TextNode):` but the only places the method is
 used, it's re-using `nodelist = NodeList()` and I'm fairly certain that a
 3rd party using the method and returning something which isn't a
 `NodeList` would error further in due to missing extension methods. So
 it's ''always'' true.

 {{{
 django3.pstat
    ncalls   tottime  percall  cumtime  percall filename:lineno(function)
    556159     0.997    0.000    0.998    0.000 builtins:0(isinstance)
 (4.7% -> 3.9%)
    131000     0.793    0.000    1.263    0.000
 django/template/base.py:514(Parser.extend_nodelist) (4.0% -> 3.1%)
 }}}

 === Opting nodes out of having a nodelist

 Slightly less sure of this one, it's sound in terms of passing the test
 suite, and makes sense from what I know about Nodes/NodeLists and their
 nature and how/when they're shared...
 All nodes currently inherit `child_nodelists = ('nodelist',)` if none is
 given, but AFAICT a bunch of nodes can't & don't actually have that
 attribute. This default definition causes `get_nodes_by_type` to run and
 execute the following:
 {{{
     nodelist = getattr(self, attr, None)
     if nodelist: ...
 }}}
 By declaring `child_nodelists = ()` for those where it's not possible
 those lines can be avoided. Those `Node` subclasses I've noted in the
 template and adapted then behave like this:
 {{{
 django4.pstat
    ncalls       tottime  percall  cumtime  percall
 filename:lineno(function)
 2000/56000      1.139    0.001    4.148    0.000
 django/template/base.py:964(NodeList.get_nodes_by_type) (4.5%, different
 implementation)
 12000/70000     0.633    0.000    3.870    0.000
 django/template/base.py:938(BlockNode.get_nodes_by_type) (2.5%,
 unaffected/untouched)
 181343/181373   0.391    0.000    0.837    0.000 builtins:0(getattr) (2.8%
 -> 1.5%)
    140000       0.519    0.000    0.770    0.000
 django/template/base.py:974(TextNode.get_nodes_by_type) (3.0% -> 2.1%)
     24000       0.089    0.000    0.132    0.000
 django/template/base.py:1026(VariableNode.get_nodes_by_type) (0.5% ->
 0.4%)
     10000       0.038    0.000    0.056    0.000
 django/templatetags/i18n.py:72(TranslateNode.get_nodes_by_type) (0.2%)
     ... others are miniscule or same ...
 }}}

 Assuming my understanding is correct (it may not be!), there are almost
 certainly other nodes which can omit the implicit child nodelist, these
 are just the ones I have accounted for, but they're also probably the
 lion's share of the cost anyway (with TextNode being highest)

 === Optimising Variable.__init__

 Like with the micro-optimisations to `Lexer.create_token` by hoisting the
 first and last characters of the given string to variables, and avoiding
 startswith, we can see some small improvements:
 after variable node init:
 {{{
 django5.pstat
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     59000    0.773    0.000    3.013    0.000
 django/template/base.py:770(Variable.__init__) (3.6% -> 2.9%)
    107338    0.228    0.000    0.228    0.000 __builtin__:0(<method
 'startswith' of 'str' objects>) (1.3% -> 0.9%)
 }}}

 == Summary of profiled changes

 Remember these timings are intentionally inflated by rendering the same
 `str` into a `Template` 1000 times...
 First using Yappi, which is what I was using throughout, precisely because
 it's ''slower'' and more accurate WRT asyncio etc.

 === Yappi percentage changes

 Based solely on the work done, rather than secondary affects:

 {{{
 Parser.parse:           8%   -> 7.9%
 Lexer.create_token:     4.7% -> 3.5%
 isinstance:             4.3% -> 4%
 Parser.extend_nodelist: 3.7% -> 3.2%
 Variable.__init__:      3%   -> 2.9%
 str.startswith:         2.8% -> 0.9%
 getattr:                2.6% -> 1.4%
 Node.get_nodes_by_type: 6.1% -> 5.4%
 }}}

 === Yappi timing changes:

 More important than the swapping around of percentages, based on the own
 time/tottime:
 {{{
 Parser.parse:           2.41s -> 2s
 Lexer.create_token:     1.41s -> 0.9s
 isinstance:             1.29s -> 1s
 Parser.extend_nodelist: 1.1s  -> 0.81s
 Variable.__init__:      0.9s  -> 0.75s
 str.startswith:         0.85s -> 0.22s
 getattr:                0.77s -> 0.34s
 get_nodes_by_type:      1.80s -> 1.35s
 }}}

 === cProfile percentage changes

 {{{
 Parser.parse:           9%   -> 8.5%
 Lexer.create_token:     4.1% -> 3.8%
 isinstance:             1.9% -> 1.8%
 Parser.extend_nodelist: 2.8% -> 2.5%
 Variable.__init__:      3%   -> 2.9%
 str.startswith:         2.1% -> 0.8%
 getattr:                1.3% -> 0.7%
 Node.get_nodes_by_type: 3.2% -> 2.8%
 }}}

 === cProfile timing changes:
 {{{
 Total time:             4.148s -> 3.833s

 Parser.parse:           370ms -> 320ms
 Lexer.create_token:     170ms -> 140ms
 isinstance:             77ms  -> 69ms
 Parser.extend_nodelist: 118ms  -> 97ms
 Variable.__init__:      240ms  -> 230ms
 str.startswith:         86ms -> 29ms
 getattr:                52ms -> 27ms
 get_nodes_by_type:      131ms -> 107ms
 }}}

 Numbers should all be correct, though I'll admit I've re-written and tried
 to edit this for clarity a couple of times and it got long enough that I
 could've pasted a wrong number somewhere. I've also rebased from `main`
 once previously so ... yeah. Scientific method may out of the window.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/32919>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/052.8f921dc964678e589096e9f9b2ca4c59%40djangoproject.com.

Reply via email to