On 15/07/2024 08:24, Bernhard Liebl wrote:
Hello,

as a rather unexperienced xquery and basex user, I'm puzzled by the
timings of running the same xquery on two very similar xml files.

For an academic paper, I'm trying to benchmark finding certain
successive (windowed) <item> tags inside a <seq> tag. This is the full
query:

declare function local:check-sequence($items as element(item)*) as
element(item)* {
    if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then
        if (sum($items/@y) < 0.5) then $items else ()
    else ()
};

for $seq in doc("data.xml")//seq
return
 <seq>{
 for $i in 1 to (count($seq/item) - 2)
 let $window := subsequence($seq/item, $i, 3)
 return <match>{local:check-sequence($window)}</match>
 }</seq>

Here's the puzzling thing. When I run this on an xml file that has one
<seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000]
</seq></seqs>, I get the following reasonable timings:

- Parsing: 0.63 ms
- Compiling: 0.95 ms
- Optimizing: 30.59 ms
- Evaluating: 11.48 ms
- Printing: 0.61 ms
- Total Time: 44.27 ms

However, once I run the same query on an xml file with two (!) <seq>
of 10000 items each, the runtime does not double, but is now more than
50 times of that of the first query:

- Parsing: 0.6 ms
- Compiling: 1.3 ms
- Optimizing: 48.13 ms
- Evaluating: 2843.67 ms
- Printing: 1.12 ms
- Total Time: 2894.82 ms

I don't see how I induce any exponential runtimes, and esp.
check-sequence should be constant. Am I missing something fundamental
here?

The result size of the second version is about twice the size as the
result size of the first version.

Thanks in advance for any pointers.

you would make it a lot easier to reproduce the results if you provided
the sample data, perhaps in a repository.


The other issue that the description about "window" raises is as to
whether you know about the XQuery 3.1 window clause
https://www.w3.org/TR/xquery-31/#id-windows, perhaps that is a more
suitable way to approach that task. Whether it solves your performance
problems remains to be seen with sample data.

Reply via email to