Hi Daniel, Daniel Veillard wrote: > On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote: >> lxml configures the parser to >> use a global per-thread dictionary, so everything it parses ends up in one >> dictionary instead of tons of calls to malloc(). That by itself is quite an >> impressive performance improvement. > > I think it's one of the danger of the approach taken in lxml to reuse > the same dictionnary for all parsing (at least in a thread), as the > dictionary can only grow and this leads to false expectations of performances > which may just blow up when your module has run for a few million documents.
Even a few million documents rarely means more than a couple of vocabularies. Think of HTML engines, for example, which only have one vocabulary in total, or even a corporate ESB, which maybe has a couple of hundred small to medium sized languages with redundant terminology. Although I wouldn't deny that there might really be cases where the assumption of a sufficiently bounded dict size does not hold, I've never seen one. > I don't see how to fix this problem in general for an infinite vocabulary. Any solution will just fail here. > I'm not sure from the description if you used the xmlDict for anything out > side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings. > I hope you didn't tried to use it for the content We (obviously) don't do that. It's not configured in any other way than by libxml2. >> So I would really like to get some feedback from others. > > I'm not against changing the hash function if one can prove it works well > in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of > changes which would try to fix an abuse of the dictionary but might penalize > the normal libxml2 users. > And I'm all for improving the hash function, I wrote it > at the time after finding a number of 'classic' algorithms had actually > abominable behaviours in libxml2 use, it's certainly not perfect. I just want > the evaluation to be done in a pure libxml2 context, as long as there is > no degradation, I'm fine with this. Of course running the full regression > tests for libxml2 and libxslt without problems is important before any patch > is accepted too, but hash function chnages should only affect performances. I agree. However, I don't have any other code available than lxml, that's why I'm asking for feedback here. :) > Since you seems to be interested in the performances of the hash > algorithm, I tried to drop the string comparisons on lookup when possible > I have an old patch for this which I'm enclosing, but I never applied it > since I had problems at the time (can't remember why/where, it's just > a FYI patch ;-) Sure, I can give it a try and compare it to the other two. Stefan _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml