Hi Andy,

Am 06.04.2015 um 13:13 schrieb Andy Bunce:
I have compared the timings for databases created from sources with and
without the -s option. In principle these should provide very similar
results.

xmlgen /f 0.2 /o folder/auction

xmlgen /f 0.2  /o folder/auction /s 400

In the first case a single file auction is created. In the second 35
files (auction00000... auction00034) are created.
In both cases the file or files are loaded to a single database to
query.  In most cases the query performance is very similar.
The exception is q09.xq. This appears to be 2 or 3 orders of magnitude
slower against the database created from the split sources.

The actual query result seems to be the same. I have seen the same
performance effect for factors f= 0.2, 0.5, 1 but not the somewhat
trival f=0.

Any idea what is happening here?

this is a case of heuristics not quite working out every time. The query working on a single file is so fast because it is rewritten to two nested index lookups:

for $p in db:open-pre("auction_full",0)/site/people/person
let $a :=
  for $t in db:attribute("auction_full", $p/@id)
      /self::person/parent::buyer/parent::closed_auction
  return element item {
    db:attribute("auction_full", $t/itemref/@item)/self::id/parent::item[
      parent::europe/parent::regions/parent::site/parent::document-node()
    ]/name/text()
  }
return element person { attribute name { $p/name/text() }, $a }

This rewriting only works if `$ca` as well as `$ei` are inlined into their respective `for` loops (which are then rewritten to XPath expressions and finally index lookups).

Since the expressions bound to those variables are not constants at compile time, inlining them into a loop (in this case `for $p in $auction/...`) will initially duplicate work, so it is not generally safe to do. Because of exactly those index rewritings we are talking about here, we still want to inline *cheap* axis paths. The cheapness is determined in the `Path#cheap()` method [1]. Currently it only allows for a single document node as root node, which is the cause of the differences you are seeing.

While every heuristic can be tweaked (and I have no opinion on this specific one), there will always be cases like this where small changes lead to unexpectedly big differences in running time. The alternative is to just be consistently slow ;-).

Hope that helps,
  Leo

[1] https://github.com/BaseXdb/basex/blob/20dbe4c/basex-core/src/main/java/org/basex/query/expr/path/Path.java#L280-295

Reply via email to