Ivan Krasilnikov wrote:
> Recently I was using vim to edit large, multi-gigabyte text files and
> I noticed that writing a file back to disk takes significantly longer
> time than reading it.
>
> Here's the amount of time it took vim (v7.3.081) on my machine (24G RAM)
> to open a file with N 20-byte long lines and execute :wq command with
> default options and disabled swap files via -n flag. Also shown is
> the number of entries inserted into mf_hash table (explained below).
>
> N real user sys items in mf_hash at peak
> 1000000 (19Mb) 0.58 0.33 0.16 5967
> 2000000 (38Mb) 1.02 0.57 0.19 11931
> 4000000 (76Mb) 2.33 1.48 0.39 23860
> 8000000 (153Mb) 5.48 3.31 1.23 47716
> 16000000 (305Mb) 13.79 9.77 1.40 95429
> 32000000 (610Mb) 66.10 59.55 2.64 190855
> 64000000 (1.2Gb) 329.30 311.52 7.45 381707
> 128000000 (2.4Gb) 1366.13 1224.87 17.96 763410
>
> Just reading a large file without writing it back is much faster:
>
> N real user sys
> 1000000 0.23 0.17 0.05
> 2000000 0.44 0.31 0.12
> 4000000 0.88 0.70 0.17
> 8000000 1.72 1.35 0.36
> 16000000 4.13 2.69 1.41
> 32000000 7.22 5.58 1.60
> 64000000 14.62 11.11 3.44
> 128000000 27.49 22.13 5.24
>
> Profiler shows that much of the time during write is spent in
> mf_find_hash(),
> which implements a fixed 64-bucket chained hash table mf_hash, and which is
> called once for every line written. This number of buckets is inadequate for
> the number of items inserted into this hash table for large files.
>
> Attached is a patch which makes mf_hash a dynamically growing hashtable
> with a fixed maximum load factor. Here's the runtime of my write test
> case with this patch:
>
> max load factor = 1 max load factor = 64
> N real user sys real user sys
> 1000000 0.46 0.25 0.11 0.52 0.34 0.08
> 2000000 0.98 0.50 0.24 0.93 0.56 0.18
> 4000000 1.98 1.06 0.39 1.81 1.08 0.42
> 8000000 4.08 2.17 0.71 3.97 2.25 0.81
> 16000000 8.24 4.29 1.47 7.83 4.74 1.53
> 32000000 16.85 8.60 3.05 16.78 9.45 3.26
> 64000000 32.62 17.16 5.55 32.98 19.08 5.94
> 128000000 66.71 34.38 11.27 66.51 38.09 12.40
>
> Other values of load factor up to ~ 256 seem also good to me.
It looks good, but this is quite a big change to an important part of
the code. We need a few good tests for this.
The most tricky function is mf_hash_grow(). It's only invoked for big
files. Well, a buffer with lots of lines, the lines can be short. With
a script adding lines with the line number, and then checking that the
right number can be read back, should cover the basic case.
The requirement to have the "next" and "previous" pointers in a specific
position isn't very clean. Perhaps it's not too inconvenient to
use mf_hashitem_T in those places and access its members through a few
macros (with type casts).
An alternative is to do everything with bhdr_T and use a type cast for
NR_TRANS only. That simplifies things, although it still has the
requirement that the two structures start with those three specific
items.
For code layout: please put { in a separate line in a few places.
Keep the lines shorter than 80 characters.
--
Be thankful to be in a traffic jam, because it means you own a car.
/// Bram Moolenaar -- [email protected] -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php