> On 21 Jun 2023, at 11:18, Axel Kittenberger <axk...@gmail.com> wrote:
>
> Hi guys,
>
> Thanks a lot for your replies, it was an interesting read. I pondered a bit
> what could be done and how much is unavoidable fate due to 32bit/truetype etc.
The sparse arrays are 64-bit data values in a 21-bit address space split into
three 7bit segments.
A full non-sparse version of one of the arrays would occupy 4MB of memory. That
is not a lot these days, but besides a dozen or so static ones, each of the
loaded font instances has one of those for the glyph indices; there are
potentially 32k catcode tables of one each; and there are potentially 16k
hyphenation codes, also of one each.
That adds up, even for a somewhat ’normal’ smallish run. Full Unicode support
is quite expensive in terms of memory, we found.
If even a relatively simple TeX run needs some 100s of Mbs of memory, that
introduces it own slowdown due to L2/L3 cache misses, operating system disk
cache shrinkage, and possibly even process memory swapping. You do not see
that in the low-level profile counts, but it becomes visible when using
high-level running time “time” tests.
That is why there is a sparse array implementation in the first place. Not a
quick as flat access, but faster than btree while using a lot less memory.
Using dummy data doesn’t help I believe, because you then need to check the
memory pointer against the dummy pointer instead of against NULL. That should
not be helpful.
However, forcing head->tree != NULL is nice indeed.
Best wishes,
Taco
>
> For example one of the larger chunks in the profile is get_sa_item()
> (relating to 32bit and sparse arrays). Albeit the profile wasn't as lopsided
> as I hoped.
>
> The current implementation is:
> ---
> sa_tree_item get_sa_item(const sa_tree head, const int n)
> {
> if (head->tree != NULL) {
> register int h = HIGHPART_PART(n);
> if (head->tree[h] != NULL) {
> register int m = MIDPART_PART(n);
> if (head->tree[h][m] != NULL) {
> return head->tree[h][m][LOWPART_PART(n)];
> }
> }
> }
> return head->dflt;
> }
> ---
> Since someone put in those "register" directives I believe that person
> already did something similar as I and pointed it out as one bottleneck.
> However that register directive very likely does absolutely nothing for most
> compilers as they would optimize it anyway.
>
> At first I wondered about the dereferencing of the same pointers multiple
> times. The first experiment was to store the pointers in variables instead of
> repeating the same array operations. But I must say the effect was zero. The
> compiler certainly sees these non-volatile pointers cannot change inbetween
> and optimizes that anyhow.
>
> However one thing where we indeed can help the compiler is by allowing it the
> option to static inline that function. This improves the runtime already a
> bit (when compiled with -O3). On the other hand it likely increases the
> binary a bit, that is the typical tradeoff.
>
> Another starting point is that first if. Looking at new_sa_tree() the
> likelihood of tree to be NULL is very unlikely. Actually I think there may
> only be one way via undump_sa_tree() to ever be NULL, when undump(x) was 0..
> which I believe also cannot happen, as long all undumped trees have to be
> dumped first. But I admit I have no idea what this whole dump API does.
>
> For experimenting I changed the code in a way for tree to be always non-null.
> I may easily miss some special cases though, but IMO a head without any tree
> makes little sense. And while at it, reduce one deferencation by putting the
> HIGHPART as array directly into head, since it is non null and non changing
> for a given head anyway.
>
> Also my previous post with the one-shot time measurements was a bit
> unprecise, now comparing patch with n=20.
>
> without patch
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
> Mean Std.Dev. Min Median Max
> real 2.072 0.053 1.984 2.067 2.197
> user 2.021 0.050 1.932 2.010 2.128
> sys 0.050 0.012 0.028 0.048 0.068
>
> with patch:
>
> ===> multitime results
> 1: /disk/axel/texlive23-build/source/work2/texk/web2c/luatex liptest.tex
> Mean Std.Dev. Min Median Max
> real 1.948 0.021 1.917 1.943 2.012
> user 1.899 0.023 1.867 1.895 1.956
> sys 0.045 0.011 0.024 0.048 0.064
>
> A 5% performance increase, IMO not bad considering it from the original
> profile to be responsible for 8% of total runtime (thus 8% max improvement).
>
> I think get_sa_item() could even be made totally branch less -- something
> modern CPUs love -- by filling the HIGH and MID parts with a link to an array
> with default values instead of NULLs (albeit ~512 Byte more memory required
> for the default dummy trees). But it would require some more code adaptations
> to test for dummy tree links everywhere where it currently tests for a NULL
> pointer.
>
> I admit though, I'm fiddling with a code base with no understanding of the
> larger picture it fits in.
>
> Kind regards, Axel
>
> On Tue, Jun 20, 2023 at 3:11 PM Axel Kittenberger <axk...@gmail.com> wrote:
> Hello,
>
> First let me say, I really don't want this to be a complaint. I'm just
> wondering.
>
> I considered switching my department's default compiler from pdflatex to
> lualatex.
> Some subtle differences were to be expected and with test documents so far
> easily catered for.
> The output is okay.
>
> However what surprised me is a complicated test document which took ~150
> seconds with pdflatex now takes 210 seconds with lualatex.
>
> Trying to figure out if this is some of the many packages it does, I kept
> simplifying.
>
> --- laliptest.tex ---
> \documentclass{article}
> \input{plipsum}
> \begin{document}
> \newcount\ii
> \ii=100
> \loop
> \lipsum{1-100}
> \advance\ii-1
> \ifnum \ii>0
> \repeat
> \end{document}
> ---------
>
> This most simple document doesn't use any package, but plipsum which can be
> replaced with plain text too. Compile time results:
>
> pdflatex: user 0m1.920s (3.1 MB result)
> lualatex: user 0m17.565s (3.8 MB result)
>
> 8 times slower.
>
> Versions tested with:
> pdfTeX 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian)
> This is LuaHBTeX, Version 1.15.0 (TeX Live 2022/Debian)
>
> Since LaTeX also includes a lot of stuff already, same tests with plain TeX.
>
> --- liptest.tex ---
> \input{plipsum}
> \newcount\i
> \i=100
> \loop
> \lipsum{1-100}
> \advance\i-1
> \ifnum \i>0
> \repeat
> \end
> ---------
> pdftex: user 0m1.053s (2.9 MB result)
> luatex: user 0m1.943s (3.1 MB result)
>
> This isn't as bad as the LaTeX variants, but still almost a factor two.
> Searching about this online turns up results about microtype or front loading
> etc.
> Both cannot be an issue, since microtype is off and frontloading must be a
> fixed offset, but the compile time increases linearly with document length.
>
> This now took me a while, but I managed to compile luatex with "-gp" to
> create a gprof profile and this is
> the result:
> ----------
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 14.63 0.42 0.42 2409555 0.00 0.00 longest_match
> 8.71 0.67 0.25 295700 0.00 0.00 hnj_hyphen_hyphenate
> 8.19 0.91 0.24 52832741 0.00 0.00 get_sa_item
> 6.62 1.10 0.19 773 0.00 0.00 deflate_slow
> 3.48 1.20 0.10 30117352 0.00 0.00 char_info
> 2.79 1.28 0.08 10000 0.00 0.00 ext_do_line_break
> 2.79 1.36 0.08 773 0.00 0.00 compress_block
> 2.09 1.42 0.06 2978422 0.00 0.00 calc_pdfpos
> 2.09 1.48 0.06 515855 0.00 0.00 handle_lig_word
> 1.74 1.53 0.05 14032575 0.00 0.00 char_exists
> 1.74 1.58 0.05 4689611 0.00 0.00 flush_node
> 1.74 1.63 0.05 2896557 0.00 0.00 output_one_char
> 1.74 1.68 0.05 227877 0.00 0.00 hash_normalized
> 1.74 1.73 0.05 41510 0.00 0.00 hlist_out
> 1.74 1.78 0.05 23020 0.00 0.00 fix_node_list
> 1.74 1.83 0.05 2319 0.00 0.00 adler32_z
> 1.39 1.87 0.04 227877 0.00 0.00 hash_insert_normalized
> 1.39 1.91 0.04 39615 0.00 0.00 fm_scan_line
> 1.39 1.95 0.04 11510 0.00 0.00 hnj_hyphenation
> 1.05 1.98 0.03 3831639 0.00 0.00 get_x_token
> 1.05 2.01 0.03 2896557 0.00 0.00 get_charinfo_whd
> 1.05 2.04 0.03 2382502 0.00 0.00 add_kern_before
> 1.05 2.07 0.03 303962 0.00 0.00 luaS_hash
> 1.05 2.10 0.03 10000 0.00 0.00 ext_post_line_break
> -------
> So it's not like there is one function that takes the bulk of the slowdown as
> I expected (and often
> happens in reality an innocent looking small thing takes so much)
>
> longest_match() is something from zlib.
>
> I'm just really surprised, I keep following this project for a while now,
> since I consider it highly interesting and thought since I read one of the
> major steps was rewriting the TeX core from somewhat idiosyncratic WEB to C,
> I expected it to be even a bit faster...
>
> And this is the profile of pdftex in comparison.
> ----------
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call nam
> 29.48 0.51 0.51 2362906 0.00 0.00 longest_match
> 13.29 0.74 0.23 5876210 0.00 0.00 zdividescaled
> 11.56 0.94 0.20 775 0.00 0.00 deflate_slow
> 4.62 1.02 0.08 41510 0.00 0.00 pdfhlistout
> 4.62 1.10 0.08 774 0.00 0.00 compress_block
> 3.47 1.16 0.06 1 0.06 1.59 maincontrol
> 2.89 1.21 0.05 423 0.00 0.00 inflate_fast
> 2.31 1.25 0.04 227877 0.00 0.00 hash_insert_normalized
> 2.31 1.29 0.04 41510 0.00 0.00 zhpack
> 1.73 1.32 0.03 17821585 0.00 0.00 zeffectivechar
> 1.73 1.35 0.03 825830 0.00 0.00 zpdfprintint
> 1.73 1.38 0.03 260088 0.00 0.00 read_line
> 1.73 1.41 0.03 223361 0.00 0.00 pqdownheap
> 1.73 1.44 0.03 39615 0.00 0.00 fm_scan_line
> 1.45 1.47 0.03 1274937 0.00 0.00 zgetnode
> 1.16 1.49 0.02 2896157 0.00 0.00 zadvcharwidth
> 1.16 1.51 0.02 579800 0.00 0.00 ztrybreak
> 1.16 1.53 0.02 227877 0.00 0.00 hash_normalized
> 1.16 1.55 0.02 26742 0.00 0.00 zflushnodelist
> 0.87 1.56 0.02 1274936 0.00 0.00 zfreenode
> 0.58 1.57 0.01 4160738 0.00 0.00 getnext
> 0.58 1.58 0.01 3419912 0.00 0.00 zgetautokern
> 0.58 1.59 0.01 2896161 0.00 0.00 hasfmentry
> 0.58 1.60 0.01 2896159 0.00 0.00 isscalable
> 0.58 1.61 0.01 2896157 0.00 0.00 zpdfprintchar
> --------
> Both weren't exactly the same version as tested previously, I self compiled
> the newest texlive tagged as release.
> (This is LuaTeX, Version 1.16.0 (TeX Live 2023))
> (pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023))
>
> Runtimes when compiled with -O3 are almost the same as the native debian
> above, and I profiled the plain TeX variants only.
>
> So zlib also takes a bulk, in relation even larger. So not the culprit.
> Different implementation of hyphenation seems to be one factor I'd "blame"
>
> Turning it off with \language=255 improves it:
>
> pdftex: user 0m1.029s
> luatex: user 0m1.596s
>
> but there is still more.
> which is get_sa_item()/char_info().
>
> And reading the comments managed-sa.c, it seems main the issue is being
> sparse? So I guess the way to improve that would be to be not sparse?
>
> Anyway, that was my report to this, unfortunately I'm holding off pushing it
> as the new default compiler for us, since the slowdown is a bad sell for
> something which only sometimes is userful.
>
> PS: personally I use Lua to calculate some complexer drawing for tikz, as
> using a "normal" programming
> language is much easier to me than doing more complicated pgf macros. But
> also in the end it just generates .tex code, which I simply feed into
> pdflatex, it's only this gets complicated which files people ought to change
> and which are autogenerated .tex files.
>
> Kind regards, Axel
> <diff>
—
Taco Hoekwater E: t...@bittext.nl
genderfluid (all pronouns)