Re: [xml] Better hash function for dict.c

Daniel Veillard Thu, 17 Apr 2008 07:05:15 -0700

On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
> Hi,

  Hi Stefan,


> long mail, bottom line being: 30% to multiple times faster parsing with a
> different hash function in dict.c. Keep reading.
> 
> I did some profiling on lxml using callgrind. lxml configures the parser to
> use a global per-thread dictionary, so everything it parses ends up in one
> dictionary instead of tons of calls to malloc(). That by itself is quite an
> impressive performance improvement.

  I think it's one of the danger of the approach taken in lxml to reuse
the same dictionnary for all parsing (at least in a thread), as the 
dictionary can only grow and this leads to false expectations of performances
which may just blow up when your module has run for a few million documents.
The hash function is just a small factor of the performances, reusing the
same dict over and over should IMHO be done only if you are parsing only
one kind of document, basically working on a fixed size vocabulary.

  I don't see how to fix this problem in general for an infinite vocabulary.

  I'm not sure from the description if you used the xmlDict for anything out
side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings.
I hope you didn't tried to use it for the content, because that would be indeed
a performance disaster in practice as random strings gets added to the dict,
you would loose all the performance gain of the dictionary.
  
> What I found in callgrind was that a huge part of the overall time is spent in
> xmlDictLookup. And that this time is almost completely wasted on calls to
> memcmp(), i.e. in walking through the content of one hash bucket and comparing
> strings.
> 
> Next, I did some testing on the current hash function (against
> /usr/share/dict/words). For small dict sizes (64,128,256), the distribution is
> fine. However, it seems that the distribution degrades with increasing size of
> the hash table. Given a dict size of 8192, for example, the "words" file leads
> to 80% of the buckets being larger than 4, but with an almost linearly
> increasing distribution. To make things worse, a larger dict size is
> encouraged by the size maintenance algorithm, which increases the size when it
> has to look through more than 4 values sequentially (if I understood that
> correctly).

  yes to avoid getting into linear speed the dictionary size increases if it
find a clash list longer than a predefined size (might be 4 can't remember 
right now). This is costly but in general since you reuse most of the words
found in the markup this is worth the experiments.
  For the profiling I used real case XMLs with very large markups like TEI
(and DocBook for a smaller set) DTDs and used callgrind too at the time 
(around 2.5.0).

> According to the profiling data, there appears to be a 50% chance
> of having to grow the dictionary when a new entry gets added to the dict in
> xmlDictLookup().
> 
> A little web search offers a couple of completely unreadable hash functions
> that come with some obviously biased benchmarks. :) However, this one
> 
> http://www.azillionmonkeys.com/qed/hash.html
> 
> is quite short and readable and seems to do what I was looking for. In a (non
> representative) benchmark, I get a speedup of a factor of 7 in lxml when
> parsing from memory ("bench_XML" benchmark in lxml's bench_etree.py suite).
> That's from over 300 msecs down to 40!

  I would prefer to see benchmarks done with xmllint directly, to avoid
side effect of more string interning than libxml2.

> However, that benchmark uses generated tag names, so it's not necessarily
> representative for an average XML file. More realistic benchmarks, like
> parsing and XPath searching an XML file containing the old testament, run
> between 2% and 30% faster. Also, the number of cases where the dictionary size
> is checked for an increase (i.e. where there were more than 4 entries in a
> bucket) drop quite dramatically according to callgrind, from 49% to below 2%.
> Hmmm, I have difficulties in believing those numbers myself...
> 
> I attached a patch against libxml2 2.6.32 that replaces the current
> xmlDictComputeKey() implementation with the SuperFastHash() function from the
> web page.
> 
> Note that this is not a production ready patch, not even a "ready for
> inclusion" patch. It is just meant to let others give it a try to see if they
> get similar results. The problem with hash functions is that they may work
> great for some input but worse for others. This hash function is slower than
> the current one, actually a lot slower. But the achieved hash distribution
> seems to be so much better that it wins the contest by a long way. And there
> may still be space left for improvements before the final inclusion.
> 
> So I would really like to get some feedback from others.

  I'm not against changing the hash function if one can prove it works well
in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of
changes which would try to fix an abuse of the dictionary but might penalize
the normal libxml2 users.
  If you tried to intern all strings passed to the user in lxml, I really think
you need to revisit this because that's untenable, I hope I misundestood
(very probable ;-) . And I'm all for improving the hash function, I wrote it
at the time after finding a number of 'classic' algorithms had actually
abominable behaviours in libxml2 use, it's certainly not perfect. I just want
the evaluation to be done in a pure libxml2 context, as long as there is 
no degradation, I'm fine with this. Of course running the full regression
tests for libxml2 and libxslt without problems is important before any patch
is accepted too, but hash function chnages should only affect performances.

  Since you seems to be interested in the performances of the hash 
algorithm, I tried to drop the string comparisons on lookup when possible
I have an old patch for this which I'm enclosing, but I never applied it
since I had problems at the time (can't remember why/where, it's just 
a FYI patch ;-)

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Better hash function for dict.c

Reply via email to