On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote: > Hi, Hi Stefan,
> long mail, bottom line being: 30% to multiple times faster parsing with a > different hash function in dict.c. Keep reading. > > I did some profiling on lxml using callgrind. lxml configures the parser to > use a global per-thread dictionary, so everything it parses ends up in one > dictionary instead of tons of calls to malloc(). That by itself is quite an > impressive performance improvement. I think it's one of the danger of the approach taken in lxml to reuse the same dictionnary for all parsing (at least in a thread), as the dictionary can only grow and this leads to false expectations of performances which may just blow up when your module has run for a few million documents. The hash function is just a small factor of the performances, reusing the same dict over and over should IMHO be done only if you are parsing only one kind of document, basically working on a fixed size vocabulary. I don't see how to fix this problem in general for an infinite vocabulary. I'm not sure from the description if you used the xmlDict for anything out side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings. I hope you didn't tried to use it for the content, because that would be indeed a performance disaster in practice as random strings gets added to the dict, you would loose all the performance gain of the dictionary. > What I found in callgrind was that a huge part of the overall time is spent in > xmlDictLookup. And that this time is almost completely wasted on calls to > memcmp(), i.e. in walking through the content of one hash bucket and comparing > strings. > > Next, I did some testing on the current hash function (against > /usr/share/dict/words). For small dict sizes (64,128,256), the distribution is > fine. However, it seems that the distribution degrades with increasing size of > the hash table. Given a dict size of 8192, for example, the "words" file leads > to 80% of the buckets being larger than 4, but with an almost linearly > increasing distribution. To make things worse, a larger dict size is > encouraged by the size maintenance algorithm, which increases the size when it > has to look through more than 4 values sequentially (if I understood that > correctly). yes to avoid getting into linear speed the dictionary size increases if it find a clash list longer than a predefined size (might be 4 can't remember right now). This is costly but in general since you reuse most of the words found in the markup this is worth the experiments. For the profiling I used real case XMLs with very large markups like TEI (and DocBook for a smaller set) DTDs and used callgrind too at the time (around 2.5.0). > According to the profiling data, there appears to be a 50% chance > of having to grow the dictionary when a new entry gets added to the dict in > xmlDictLookup(). > > A little web search offers a couple of completely unreadable hash functions > that come with some obviously biased benchmarks. :) However, this one > > http://www.azillionmonkeys.com/qed/hash.html > > is quite short and readable and seems to do what I was looking for. In a (non > representative) benchmark, I get a speedup of a factor of 7 in lxml when > parsing from memory ("bench_XML" benchmark in lxml's bench_etree.py suite). > That's from over 300 msecs down to 40! I would prefer to see benchmarks done with xmllint directly, to avoid side effect of more string interning than libxml2. > However, that benchmark uses generated tag names, so it's not necessarily > representative for an average XML file. More realistic benchmarks, like > parsing and XPath searching an XML file containing the old testament, run > between 2% and 30% faster. Also, the number of cases where the dictionary size > is checked for an increase (i.e. where there were more than 4 entries in a > bucket) drop quite dramatically according to callgrind, from 49% to below 2%. > Hmmm, I have difficulties in believing those numbers myself... > > I attached a patch against libxml2 2.6.32 that replaces the current > xmlDictComputeKey() implementation with the SuperFastHash() function from the > web page. > > Note that this is not a production ready patch, not even a "ready for > inclusion" patch. It is just meant to let others give it a try to see if they > get similar results. The problem with hash functions is that they may work > great for some input but worse for others. This hash function is slower than > the current one, actually a lot slower. But the achieved hash distribution > seems to be so much better that it wins the contest by a long way. And there > may still be space left for improvements before the final inclusion. > > So I would really like to get some feedback from others. I'm not against changing the hash function if one can prove it works well in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of changes which would try to fix an abuse of the dictionary but might penalize the normal libxml2 users. If you tried to intern all strings passed to the user in lxml, I really think you need to revisit this because that's untenable, I hope I misundestood (very probable ;-) . And I'm all for improving the hash function, I wrote it at the time after finding a number of 'classic' algorithms had actually abominable behaviours in libxml2 use, it's certainly not perfect. I just want the evaluation to be done in a pure libxml2 context, as long as there is no degradation, I'm fine with this. Of course running the full regression tests for libxml2 and libxslt without problems is important before any patch is accepted too, but hash function chnages should only affect performances. Since you seems to be interested in the performances of the hash algorithm, I tried to drop the string comparisons on lookup when possible I have an old patch for this which I'm enclosing, but I never applied it since I had problems at the time (can't remember why/where, it's just a FYI patch ;-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml