Gioele, did you check in the execution plan that you query does use an index ?
One way to force the use of the text index could be to start your query with :
db:text('your-collection-name', 'arci')/parent::tei:orth/ and so on.
Regards,
-----Message d'origine-----
De : Fabrice Etanchaud
Envoyé : vendredi 12 juin 2015 11:13
À : [email protected]
Objet : RE: [basex-talk] Optimization of a slow query with `//`
Hello Gioele,
I have a souvenir that the use of namespaces was slowing down (or maybe
invalidating) the structure index.
Someone @BaseX will certainly correct me if I am wrong, but if your data is
single namespaced, what about reloading data with the "skip namespaces" option
enabled and test if performance improves ?
Another solution could be to create an index collection, where key would be
your search terms, and values the node-pre or node-id of your (sub-)documents.
Best regards,
Fabrice
-----Message d'origine-----
De : [email protected]
[mailto:[email protected]] De la part de Gioele
Barabucci Envoyé : vendredi 12 juin 2015 10:42 À :
[email protected]
Objet : [basex-talk] Optimization of a slow query with `//`
Hello,
I am working on an application that retrieves its data from a TEI XML file via
BaseX. The following query lies at the core of this application but is too slow
to be used in production: on a modern PC it requires about 600 ms to run over a
4MB file (1/10 of the complete dataset). Any suggestion on how to improve its
performance (without changing the underlying TEI files) would be much
appreciated.
Here is the query:
declare namespace tei='http://www.tei-c.org/ns/1.0';
/tei:TEI/tei:text/tei:body//
*[self::tei:entry or self::tei:re]
[./tei:form/tei:orth[. = "arci"]
[ancestor-or-self::*
[@xml:lang][1]
[(starts-with(@xml:lang, "san"))]
]
]
In human terms is should return all the `tei:entry` or `tei:re` that
* have the word "arci" in their `/tei:form/tei:orth` element,
* their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in
the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the
underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and
`tei:re` elements and their parents, so selecting the correct ones should be
quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
1. Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
2. Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
3. changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a
negligible effect.
4. Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
--
Gioele Barabucci <[email protected]>