On Fri, Nov 04, 2011 at 11:32:19AM +0100, Stefan Behnel wrote:
> Hi,
> 
> almost exactly two years ago, I brought up the topic of the
> surprisingly unpredictable XPath performance on this list (thread
> titled "confusing xpath performance characteristics", without
> response at the time). The problem is not the actual search, but the
> merging of node sets after the fact. The effect is that a detailed
> expression like
> 
>      /xs:schema/xs:complexType//xs:element[@name="equity"]/@type
> 
> is several orders of magnitude slower than the less constrained expression
> 
>     //xs:element[@name="equity"]/@type
> 
> The problem here is that the evaluator finds several matches for
> "xs:complexType", searches each subtree, and then goes into merging
> the subresults and removing duplicates along the way, using a
> quadratic algorithm. The runtime of this approach quickly explodes
> with the number of nodes in the node set, especially since it can
> get applied several times while going up the expression tree.

  yeah I agree that's far from optimal. What I tried to do is speed
up the comparison of nodes if one calls xmlXPathOrderDocElems() on the
document before XPath or XSLT processing on it.
  It's definitely a known issue, c.f. comment on line 12055 of xpath.c
there are trivial cases where libxml2 will use
  xmlXPathNodeSetMergeAndClearNoDupls()
when we know we can't reach duplicates, when walking axis child,
attributes or namespaces, but for anything else one need more context
to be able to make the optimization.

> There are several surprising expressions where this shows, e.g.
> 
>     descendant-or-self::dl/descendant::dt
> 
> is very fast, whereas the semantically only slightly different
> 
>     descendant-or-self::dl/descendant-or-self::*/dt
> 
> is already quite a bit slower, but the equivalent
> 
>     descendant-or-self::dl//dt
> 
> is orders of magnitude slower. You can test them against the 4.7MB
> HTML5 spec page at
> 
> http://www.w3.org/TR/html5/Overview.html
> 
> The last approach takes literally hours, whereas the first two
> finish within the order of seconds. I ran this within callgrind, and
> the top function that takes 99.4% of the overall runtime is
> xmlXPathNodeSetMergeAndClear(), specifically the inner loop starting
> with the comment "skip duplicates".
> 
> There are two issues here. One is that the duplicate removal uses
> the easiest to implement, and unluckily also slowest algorithm. This
> is understandable because doing better is not trivial at all in C.
> However, the algorithm could be improved quite substantially, e.g.
> by using merge sort (even based on an arbitrary sort criteria like
> the node address). If eventual sorting in document order is
> required, a merge sort is the best thing to do anyway, as it could
> be applied recursively along the XPath evaluation tree, thus always
> yielding sorted node sets at each stage.
> 
> The second issue is that the duplicate removal is not necessary at
> all in a wide variety of important cases. As long as the
> subexpression on the right does not access any parents (which likely
> applies to >95% of the real world XPath expressions), i.e. the
> subresults originate from distinct subtrees, there can be no
> duplicates, and the subresults are automatically in document order.
> Thus, the merging will collapse into a simple concatenation. I admit
> that this case will sometimes be hard to detect because the left
> part of the expression may already have matched overlapping parts of
> the tree. But I think it is worth more than just a try.

  Completely agreed, but goal was to ensure correctness, and then
optimize, problem is that few optimizations were made.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
dan...@veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to