Thanks for the explanation Leo. So perhaps the surprise is that no other queries seem to be affected by this.
/Andy On 6 April 2015 at 14:43, Leonard Wörteler < leonard.woerte...@uni-konstanz.de> wrote: > Hi Andy, > > Am 06.04.2015 um 13:13 schrieb Andy Bunce: > >> I have compared the timings for databases created from sources with and >> without the -s option. In principle these should provide very similar >> results. >> >> xmlgen /f 0.2 /o folder/auction >> >> xmlgen /f 0.2 /o folder/auction /s 400 >> >> In the first case a single file auction is created. In the second 35 >> files (auction00000... auction00034) are created. >> In both cases the file or files are loaded to a single database to >> query. In most cases the query performance is very similar. >> The exception is q09.xq. This appears to be 2 or 3 orders of magnitude >> slower against the database created from the split sources. >> >> The actual query result seems to be the same. I have seen the same >> performance effect for factors f= 0.2, 0.5, 1 but not the somewhat >> trival f=0. >> >> Any idea what is happening here? >> > > this is a case of heuristics not quite working out every time. The query > working on a single file is so fast because it is rewritten to two nested > index lookups: > > for $p in db:open-pre("auction_full",0)/site/people/person >> let $a := >> for $t in db:attribute("auction_full", $p/@id) >> /self::person/parent::buyer/parent::closed_auction >> return element item { >> db:attribute("auction_full", $t/itemref/@item)/self::id/parent::item[ >> parent::europe/parent::regions/parent::site/parent::document-node() >> ]/name/text() >> } >> return element person { attribute name { $p/name/text() }, $a } >> > > This rewriting only works if `$ca` as well as `$ei` are inlined into their > respective `for` loops (which are then rewritten to XPath expressions and > finally index lookups). > > Since the expressions bound to those variables are not constants at > compile time, inlining them into a loop (in this case `for $p in > $auction/...`) will initially duplicate work, so it is not generally safe > to do. Because of exactly those index rewritings we are talking about here, > we still want to inline *cheap* axis paths. The cheapness is determined in > the `Path#cheap()` method [1]. Currently it only allows for a single > document node as root node, which is the cause of the differences you are > seeing. > > While every heuristic can be tweaked (and I have no opinion on this > specific one), there will always be cases like this where small changes > lead to unexpectedly big differences in running time. The alternative is to > just be consistently slow ;-). > > Hope that helps, > Leo > > [1] https://github.com/BaseXdb/basex/blob/20dbe4c/basex-core/ > src/main/java/org/basex/query/expr/path/Path.java#L280-295 >