Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references

Tim Cross Tue, 25 Jan 2022 09:27:31 -0800


Ihor Radchenko <[email protected]> writes:


> pareto optimal <[email protected]> writes:
>
>> Using =emacs -Q= to tangle org files with more than over 100 noweb-refs gets 
>> slow fast.
>>
>> Given this org code for N=2:
>> Using Gcc Emacs 28.0.91 (which I typicall use) I get these results:
>>
>> | N blocks | runtime | # of GCs |
>> |----------+---------+----------|
>> |       10 |   0.027 |        0 |
>> |       20 |   0.049 |        0 |
>> |       40 |   0.121 |        0 |
>> |       80 |   0.391 |        0 |
>> |      160 |   1.426 |        0 |
>> |      320 |   6.765 |        0 |
>> |      640 |  23.972 |        0 |
>>
>> so roughly it scales like:
>>
>> #+begin_example
>> 0.8x (10-20)
>> 2.5x  (20-40)
>> 3x  (40-80)
>> 3.5x  (160-320)
>> 4x  (320-640)
>> #+end_example
>>
>> Though I'm not sure how much that can tell us... my guess is noweb-ref's 
>> don't use a dictionary with =O(1)= lookups?
>
> Thanks for the nice benchmark and for providing a test case!
>
> You are using stable version of Org. We did some important performance
> improvements on main. The same benchmark using latest main (26decec):
>
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.030 |        0 |
> |       20 |   0.067 |        0 |
> |       40 |   0.065 |        0 |
> |       80 |   0.081 |        0 |
> |      160 |   0.214 |        0 |
> |      320 |   0.597 |        0 |
> |      640 |   2.225 |        0 |
> |     1280 |   9.221 |        0 |
> |     2560 |  36.966 |        0 |
>
> And with disabled self-verification code:
> (setq org-element--cache-self-verify nil)
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.078 |        0 |
> |       20 |   0.075 |        0 |
> |       40 |   0.063 |        0 |
> |       80 |   0.085 |        0 |
> |      160 |   0.095 |        0 |
> |      320 |   0.178 |        0 |
> |      640 |   0.311 |        0 |
> |     1280 |   0.819 |        0 |
> |     2560 |   2.302 |        0 |
> |     5120 |   8.878 |        0 |
> |    10240 |  32.706 |        0 |
>
> Graphical comparison:
>
>
>
>
> There is ~80x improvement.
>
> However, the scaling is still not quite linear:
>
>
>
>
> So, there is still some power-law nonlinearity in the tangle code.
>
> Benchmarking revealed the following:
>         8259  95%               - org-babel-tangle-file
>         8259  95%                - org-babel-tangle
>         8170  94%                 - org-babel-tangle-collect-blocks
>         3620  41%                  - org-in-archived-heading-p
>         3600  41%                     org-before-first-heading-p
>
> Showing that nonlinear scaling comes from regexp search.
>
> Caching org-before-first-heading-p improves the performance further (see
> Org 9.6-dev no self-verify + patch curves):
>
> | N blocks | runtime | # of GCs |
> |----------+---------+----------|
> |       10 |   0.118 |        0 |
> |       20 |   0.106 |        0 |
> |       40 |   0.136 |        0 |
> |       80 |   0.157 |        0 |
> |      160 |   0.139 |        0 |
> |      320 |   0.212 |        0 |
> |      640 |   0.542 |        0 |
> |     1280 |   0.797 |        0 |
> |     2560 |   1.533 |        0 |
> |     5120 |   4.651 |        0 |
> |    10240 |  16.745 |        0 |
>
> The profiling gives:
>
>        16817  63%                    - org-babel-tangle-file
>        16280  61%                     - org-babel-tangle
>        16200  61%                      - org-babel-tangle-collect-blocks
>         1360   5%                       + org-babel-tangle-single-block
>         1210   4%                       + org-babel-get-src-block-info
>
> Most of the CPU time is spent in org-babel-tangle-collect-blocks, which
> is basically another regexp search for all the source blocks in buffer.
> The scaling is still slightly non-linear - maybe your source block
> regexp is too complex:
>
> (defvar org-babel-src-block-regexp
>   (concat
>    ;; (1) indentation                 (2) lang
>    "^\\([ \t]*\\)#\\+begin_src[ \t]+\\([^ \f\t\n\r\v]+\\)[ \t]*"
>    ;; (3) switches
>    "\\([^\":\n]*\"[^\"\n*]*\"[^\":\n]*\\|[^\":\n]*\\)"
>    ;; (4) header arguments
>    "\\([^\n]*\\)\n"
>    ;; (5) body
>    "\\([^\000]*?\n\\)??[ \t]*#\\+end_src")
>   "Regexp used to identify code blocks.")
>
> On the other hand, the test data is already within quite non-realistic
> over 5000 number of blocks. Probably, further performance improvement is
> not very useful (we already have some overheads at smaller N).
>
> Best,
> Ihor
>
> -----
> The Org file used to generate plots:
>
> #+name: nocache
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.027 |        0 |
> |       20 |   0.049 |        0 |
> |       40 |   0.121 |        0 |
> |       80 |   0.391 |        0 |
> |      160 |   1.426 |        0 |
> |      320 |   6.765 |        0 |
> |      640 |  23.972 |        0 |
>
> #+name: cache
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.030 |        0 |
> |       20 |   0.067 |        0 |
> |       40 |   0.065 |        0 |
> |       80 |   0.081 |        0 |
> |      160 |   0.214 |        0 |
> |      320 |   0.597 |        0 |
> |      640 |   2.225 |        0 |
> |     1280 |   9.221 |        0 |
> |     2560 |  36.966 |        0 |
>
> #+name: cache-no-self
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.078 |        0 |
> |       20 |   0.075 |        0 |
> |       40 |   0.063 |        0 |
> |       80 |   0.085 |        0 |
> |      160 |   0.095 |        0 |
> |      320 |   0.178 |        0 |
> |      640 |   0.311 |        0 |
> |     1280 |   0.819 |        0 |
> |     2560 |   2.302 |        0 |
> |     5120 |   8.878 |        0 |
> |    10240 |  32.706 |        0 |
>
> #+name: cache-no-self+fix
> | N blocks | runtime | # of GCs |
>
> |----------+---------+----------|
> |       10 |   0.118 |        0 |
> |       20 |   0.106 |        0 |
> |       40 |   0.136 |        0 |
> |       80 |   0.157 |        0 |
> |      160 |   0.139 |        0 |
> |      320 |   0.212 |        0 |
> |      640 |   0.542 |        0 |
> |     1280 |   0.797 |        0 |
> |     2560 |   1.533 |        0 |
> |     5120 |   4.651 |        0 |
> |    10240 |  16.745 |        0 |
>
> #+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var 
> d4=cache-no-self+fix :file benchmark1.png
> set title 'Tangle code performance timing'
> US='u 1:2 w lp ps 2'
> set xlabel "N blocks"
> set ylabel "Time, sec"
> set key top right
> plot d1 @US t'Org 9.5.2 stable', d2 @US t'Org 9.6-dev', d3 @US t'Org 9.6-dev 
> no self-verify', d4 @US t'Org 9.6-dev no self-verify + patch'
> #+end_src
>
>
> #+RESULTS[edd2a2d5d80b31876917faee349ed71ba0fe239a]:
> [[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark1.png]]
>
> #+begin_src gnuplot :var d1=nocache :var d2=cache :var d3=cache-no-self :var 
> d4=cache-no-self+fix :file benchmark2.png
> set title 'Tangle code performance scaling (normalized)'
> US='w lp ps 2'
> set xlabel "N blocks"
> set ylabel "Time, normalized by time at N=640"
> set key top left
> set xrange [0:6000]
> plot d2 u 1:($2/2.225) @US t'Org 9.6-dev', d3 u 1:($2/0.311) @US t'Org 
> 9.6-dev no self-verify', d1 u 1:($2/23.972) @US t'Org 9.5.2 stable', d4 u 
> 1:($2/0.542) @US t'Org 9.6-dev no self-verify + patch', x*x/870000 t'x^2', 
> #+end_src
>
> #+RESULTS[b170ef78ac377f6d2ad1c2eabb20cd62dc0ea33f]:
> [[file:/home/yantar92/.data/52/0930af-75ae-4d88-ae6a-d8dde39ecc72/benchmark2.png]]

This is nice work. Well done on all you and others have been doing along these
lines to improve performance.

Re: Poor org-babel-tangle-file performance with more than 100 trivial noweb-references

Reply via email to