Re: [luatex] runtime performance of lua(la)tex incl. gprof

Taco Hoekwater Wed, 21 Jun 2023 03:04:29 -0700

> On 21 Jun 2023, at 11:18, Axel Kittenberger <[email protected]> wrote:
> 
> Hi guys,
> 
> Thanks a lot for your replies, it was an interesting read. I pondered a bit 
> what could be done and how much is unavoidable fate due to 32bit/truetype etc.

The sparse arrays are 64-bit data values in a 21-bit address space split into 
three 7bit segments.

A full non-sparse version of one of the arrays would occupy 4MB of memory. That 
is not a lot these days, but besides a dozen or so static ones, each of the 
loaded font instances has one of those for the glyph indices; there are 
potentially 32k catcode tables of one each; and there are potentially 16k 
hyphenation codes, also of one each.

That adds up, even for a somewhat ’normal’ smallish run. Full Unicode support 
is quite expensive in terms of memory, we found. 

If even a relatively simple TeX run needs some 100s of Mbs of memory, that 
introduces it own slowdown due to L2/L3 cache misses, operating system disk 
cache shrinkage,  and possibly even process memory swapping. You do not see 
that in the low-level profile counts, but it becomes visible when using 
high-level running time “time” tests.

That is why there is a sparse array implementation in the first place. Not a 
quick as flat access, but faster than btree while using a lot less memory. 

Using dummy data doesn’t help I believe, because you then need to check the 
memory pointer against the dummy pointer instead of against NULL. That should 
not be helpful. 

However, forcing head->tree != NULL is nice indeed. 

Best wishes,
Taco

> 
> For example one of the larger chunks in the profile is get_sa_item() 
> (relating to 32bit and sparse arrays). Albeit the profile wasn't as lopsided 
> as I hoped.
> 
> The current implementation is:
> ---
> sa_tree_item get_sa_item(const sa_tree head, const int n)
> {
>     if (head->tree != NULL) {
>         register int h = HIGHPART_PART(n);
>         if (head->tree[h] != NULL) {
>             register int m = MIDPART_PART(n);
>             if (head->tree[h][m] != NULL) {
>                 return head->tree[h][m][LOWPART_PART(n)];
>             }
>         }
>     }
>     return head->dflt;
> }
> ---
> Since someone put in those "register" directives I believe that person 
> already did something similar as I and pointed it out as one bottleneck. 
> However that register directive very likely does absolutely nothing for most 
> compilers as they would optimize it anyway.
> 
> At first I wondered about the dereferencing of the same pointers multiple 
> times. The first experiment was to store the pointers in variables instead of 
> repeating the same array operations. But I must say the effect was zero. The 
> compiler certainly sees these non-volatile pointers cannot change inbetween 
> and optimizes that anyhow.
> 
> However one thing where we indeed can help the compiler is by allowing it the 
> option to static inline that function. This improves the runtime already a 
> bit (when compiled with -O3). On the other hand it likely increases the 
> binary a bit, that is the typical tradeoff.
> 
> Another starting point is that first if. Looking at new_sa_tree() the 
> likelihood of tree to be NULL is very unlikely. Actually I think there may 
> only be one way via undump_sa_tree() to ever be NULL, when undump(x) was 0.. 
> which I believe also cannot happen, as long all undumped trees have to be 
> dumped first. But I admit I have no idea what this whole dump API does.
> 
> For experimenting I changed the code in a way for tree to be always non-null. 
> I may easily miss some special cases though, but IMO a head without any tree 
> makes little sense. And while at it, reduce one deferencation by putting the 
> HIGHPART as array directly into head, since it is non null and non changing 
> for a given head anyway.
> 
> Also my previous post with the one-shot time measurements was a bit 
> unprecise, now comparing patch with n=20.
> 
> without patch
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
>             Mean        Std.Dev.    Min         Median      Max
> real        2.072       0.053       1.984       2.067       2.197      
> user        2.021       0.050       1.932       2.010       2.128      
> sys         0.050       0.012       0.028       0.048       0.068 
> 
> with patch:
> 
> ===> multitime results
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
>             Mean        Std.Dev.    Min         Median      Max
> real        1.948       0.021       1.917       1.943       2.012       
> user        1.899       0.023       1.867       1.895       1.956       
> sys         0.045       0.011       0.024       0.048       0.064
> 
> A 5% performance increase, IMO not bad considering it from the original 
> profile to be responsible for 8% of total runtime (thus 8% max improvement).
> 
> I think get_sa_item() could even be made totally branch less -- something 
> modern CPUs love -- by filling the HIGH and MID parts with a link to an array 
> with default values instead of NULLs (albeit ~512 Byte more memory required 
> for the default dummy trees). But it would require some more code adaptations 
> to test for dummy tree links everywhere where it currently tests for a NULL 
> pointer.
> 
> I admit though, I'm fiddling with a code base with no understanding of the 
> larger picture it fits in.
> 
> Kind regards, Axel
> 
> On Tue, Jun 20, 2023 at 3:11 PM Axel Kittenberger <[email protected]> wrote:
> Hello,
> 
> First let me say, I really don't want this to be a complaint. I'm just 
> wondering.
> 
> I considered switching my department's default compiler from pdflatex to 
> lualatex.
> Some subtle differences were to be expected and with test documents so far 
> easily catered for.
> The output is okay.
> 
> However what surprised me is a complicated test document which took ~150 
> seconds with pdflatex now takes 210 seconds with lualatex.
> 
> Trying to figure out if this is some of the many packages it does, I kept 
> simplifying.
> 
> --- laliptest.tex ---
> \documentclass{article}
> \input{plipsum}
> \begin{document}
> \newcount\ii
> \ii=100
> \loop
>   \lipsum{1-100}
>   \advance\ii-1
> \ifnum \ii>0
> \repeat
> \end{document}
> ---------
> 
> This most simple document doesn't use any package, but plipsum which can be 
> replaced with plain text too. Compile time results:
> 
> pdflatex: user 0m1.920s (3.1 MB result)
> lualatex: user 0m17.565s (3.8 MB result)
> 
> 8 times slower.
> 
> Versions tested with:
> pdfTeX 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian)
> This is LuaHBTeX, Version 1.15.0 (TeX Live 2022/Debian)
> 
> Since LaTeX also includes a lot of stuff already, same tests with plain TeX.
> 
> --- liptest.tex ---
> \input{plipsum}
> \newcount\i
> \i=100
> \loop
>   \lipsum{1-100}
>   \advance\i-1
> \ifnum \i>0
> \repeat
> \end
> ---------
> pdftex: user 0m1.053s (2.9 MB result)
> luatex: user 0m1.943s (3.1 MB result)
> 
> This isn't as bad as the LaTeX variants, but still almost a factor two.
> Searching about this online turns up results about microtype or front loading 
> etc.
> Both cannot be an issue, since microtype is off and frontloading must be a 
> fixed offset, but the compile time increases linearly with document length.
> 
> This now took me a while, but I managed to compile luatex with "-gp" to 
> create a gprof profile and this is
> the result:
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  14.63      0.42     0.42  2409555     0.00     0.00  longest_match
>   8.71      0.67     0.25   295700     0.00     0.00  hnj_hyphen_hyphenate
>   8.19      0.91     0.24 52832741     0.00     0.00  get_sa_item
>   6.62      1.10     0.19      773     0.00     0.00  deflate_slow
>   3.48      1.20     0.10 30117352     0.00     0.00  char_info
>   2.79      1.28     0.08    10000     0.00     0.00  ext_do_line_break
>   2.79      1.36     0.08      773     0.00     0.00  compress_block
>   2.09      1.42     0.06  2978422     0.00     0.00  calc_pdfpos
>   2.09      1.48     0.06   515855     0.00     0.00  handle_lig_word
>   1.74      1.53     0.05 14032575     0.00     0.00  char_exists
>   1.74      1.58     0.05  4689611     0.00     0.00  flush_node
>   1.74      1.63     0.05  2896557     0.00     0.00  output_one_char
>   1.74      1.68     0.05   227877     0.00     0.00  hash_normalized
>   1.74      1.73     0.05    41510     0.00     0.00  hlist_out
>   1.74      1.78     0.05    23020     0.00     0.00  fix_node_list
>   1.74      1.83     0.05     2319     0.00     0.00  adler32_z
>   1.39      1.87     0.04   227877     0.00     0.00  hash_insert_normalized
>   1.39      1.91     0.04    39615     0.00     0.00  fm_scan_line
>   1.39      1.95     0.04    11510     0.00     0.00  hnj_hyphenation
>   1.05      1.98     0.03  3831639     0.00     0.00  get_x_token
>   1.05      2.01     0.03  2896557     0.00     0.00  get_charinfo_whd
>   1.05      2.04     0.03  2382502     0.00     0.00  add_kern_before
>   1.05      2.07     0.03   303962     0.00     0.00  luaS_hash
>   1.05      2.10     0.03    10000     0.00     0.00  ext_post_line_break
> -------
> So it's not like there is one function that takes the bulk of the slowdown as 
> I expected (and often
> happens in reality an innocent looking small thing takes so much)
> 
> longest_match() is something from zlib.
> 
> I'm just really surprised, I keep following this project for a while now, 
> since I consider it highly interesting and thought since I read one of the 
> major steps was rewriting the TeX core from somewhat idiosyncratic WEB to C, 
> I expected it to be even a bit faster...
> 
> And this is the profile of pdftex in comparison.
> ----------
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  nam
>  29.48      0.51     0.51  2362906     0.00     0.00  longest_match
>  13.29      0.74     0.23  5876210     0.00     0.00  zdividescaled
>  11.56      0.94     0.20      775     0.00     0.00  deflate_slow
>   4.62      1.02     0.08    41510     0.00     0.00  pdfhlistout
>   4.62      1.10     0.08      774     0.00     0.00  compress_block
>   3.47      1.16     0.06        1     0.06     1.59  maincontrol
>   2.89      1.21     0.05      423     0.00     0.00  inflate_fast
>   2.31      1.25     0.04   227877     0.00     0.00  hash_insert_normalized
>   2.31      1.29     0.04    41510     0.00     0.00  zhpack
>   1.73      1.32     0.03 17821585     0.00     0.00  zeffectivechar
>   1.73      1.35     0.03   825830     0.00     0.00  zpdfprintint
>   1.73      1.38     0.03   260088     0.00     0.00  read_line
>   1.73      1.41     0.03   223361     0.00     0.00  pqdownheap
>   1.73      1.44     0.03    39615     0.00     0.00  fm_scan_line
>   1.45      1.47     0.03  1274937     0.00     0.00  zgetnode
>   1.16      1.49     0.02  2896157     0.00     0.00  zadvcharwidth
>   1.16      1.51     0.02   579800     0.00     0.00  ztrybreak
>   1.16      1.53     0.02   227877     0.00     0.00  hash_normalized
>   1.16      1.55     0.02    26742     0.00     0.00  zflushnodelist
>   0.87      1.56     0.02  1274936     0.00     0.00  zfreenode
>   0.58      1.57     0.01  4160738     0.00     0.00  getnext
>   0.58      1.58     0.01  3419912     0.00     0.00  zgetautokern
>   0.58      1.59     0.01  2896161     0.00     0.00  hasfmentry
>   0.58      1.60     0.01  2896159     0.00     0.00  isscalable
>   0.58      1.61     0.01  2896157     0.00     0.00  zpdfprintchar
> --------
> Both weren't exactly the same version as tested previously, I self compiled 
> the newest texlive tagged as release.
> (This is LuaTeX, Version 1.16.0 (TeX Live 2023))
> (pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023))
> 
> Runtimes when compiled with -O3 are almost the same as the native debian 
> above, and I profiled the plain TeX variants only.
> 
> So zlib also takes a bulk, in relation even larger. So not the culprit.
> Different implementation of hyphenation seems to be one factor I'd "blame"
> 
> Turning it off with \language=255 improves it:
> 
> pdftex: user 0m1.029s
> luatex: user 0m1.596s
> 
> but there is still more.
> which is get_sa_item()/char_info().
> 
> And reading the comments managed-sa.c, it seems main the issue is being 
> sparse? So I guess the way to improve that would be to be not sparse?
> 
> Anyway, that was my report to this, unfortunately I'm holding off pushing it 
> as the new default compiler for us, since the slowdown is a bad sell for 
> something which only sometimes is userful.
> 
> PS: personally I use Lua to calculate some complexer drawing for tikz, as 
> using a "normal" programming
> language is much easier to me than doing more complicated pgf macros. But 
> also in the end it just generates .tex code, which I simply  feed into 
> pdflatex, it's only this gets complicated which files people ought to change 
> and which are autogenerated .tex files.
> 
> Kind regards, Axel
> <diff>

— 
Taco Hoekwater              E: [email protected]
genderfluid (all pronouns)
Re: [luatex] runtime performance of lua(la)tex incl. gprof

Reply via email to