Re: [xml] Better hash function for dict.c

Stefan Behnel Thu, 17 Apr 2008 14:18:00 -0700

Hi Daniel,

Daniel Veillard wrote:
> On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
>> lxml configures the parser to
>> use a global per-thread dictionary, so everything it parses ends up in one
>> dictionary instead of tons of calls to malloc(). That by itself is quite an
>> impressive performance improvement.
> 
>   I think it's one of the danger of the approach taken in lxml to reuse
> the same dictionnary for all parsing (at least in a thread), as the 
> dictionary can only grow and this leads to false expectations of performances
> which may just blow up when your module has run for a few million documents.


Even a few million documents rarely means more than a couple of vocabularies.
Think of HTML engines, for example, which only have one vocabulary in total,
or even a corporate ESB, which maybe has a couple of hundred small to medium
sized languages with redundant terminology. Although I wouldn't deny that
there might really be cases where the assumption of a sufficiently bounded
dict size does not hold, I've never seen one.


> I don't see how to fix this problem in general for an infinite vocabulary.

Any solution will just fail here.


> I'm not sure from the description if you used the xmlDict for anything out
> side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings.
> I hope you didn't tried to use it for the content

We (obviously) don't do that. It's not configured in any other way than by
libxml2.


>> So I would really like to get some feedback from others.
> 
>   I'm not against changing the hash function if one can prove it works well
> in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of
> changes which would try to fix an abuse of the dictionary but might penalize
> the normal libxml2 users.
>   And I'm all for improving the hash function, I wrote it
> at the time after finding a number of 'classic' algorithms had actually
> abominable behaviours in libxml2 use, it's certainly not perfect. I just want
> the evaluation to be done in a pure libxml2 context, as long as there is 
> no degradation, I'm fine with this. Of course running the full regression
> tests for libxml2 and libxslt without problems is important before any patch
> is accepted too, but hash function chnages should only affect performances.

I agree. However, I don't have any other code available than lxml, that's why
I'm asking for feedback here. :)


>   Since you seems to be interested in the performances of the hash 
> algorithm, I tried to drop the string comparisons on lookup when possible
> I have an old patch for this which I'm enclosing, but I never applied it
> since I had problems at the time (can't remember why/where, it's just 
> a FYI patch ;-)

Sure, I can give it a try and compare it to the other two.

Stefan

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Better hash function for dict.c

Reply via email to