Dear all My name is Bram Vanroy, and I am an intern at the Centre for Computational Linguistics (CCL; http://www.arts.kuleuven.be/ling/ccl [Dutch]) at the University of Leuven. My supervisor, Vincent Vandeghinste, has had contact with this mailing list some time ago, more specifically with Dirk Kirsten. My intership is titled "Fine-tuning the GrETEL Treebank Query Engine". GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics; available at http://gretel.ccl.kuleuven.be/gretel-2.0/. Its goal is to provide users with a fast, user-friendly on-line tool to search through text corpora backed by treebanks. Accessibility is an important point for us: users do not need to be proficient with any programming languages, strict formalisms, or treebank specific annotations; every query can be executed by using an intuitive graphical interface. More advanced users can use XPath to write the representation of the syntactic structure that they are looking for. BaseX is our tool of choice as a database for our corpora in XML format.
Initially, GrETEL provided access to smaller corpora such as CGN (9 million words) and Lassy Small (1 million words). We would like to expand the searchable corpora by also making the full Sonar corpus available (500 million words). This is already partially possible in GrETEL 2.0 but due to efficiency reasons, capabilities are restricted: users can only search in one component at a time, and the largest component in the corpus is not available due to its size (15 million sentences). We have applied these restrictions because the search time for the whole corpus was too long, which in turn would decrease the user-friendliness of the tool drastically. Steps have already been taken to improve search times in larger corpora. (See "Making a Large Treebank Searchable Online. The SoNaR Case." by Vincent Vandeghinste, and Liesbeth Augustinus; http://nederbooms.ccl.kuleuven.be/documentation/LREC2014-GrETELSoNaR.pdf.) To spare you the effort to go through the whole article, I hereby quote the most relevant citation from that article for this email: The general idea behind our approach is to restrict the search space by splitting up the data in many small databases, allowing for faster retrieval of syntactic structures. We organise the data in databases that contain all bottom-up subtrees for which the two top levels (i.e. the root and its children) adhere to the same syntactic pattern. When querying the database for certain syntactic constructions, we know on which databases we have to apply the XPath query which would otherwise have to be applied on the whole data set. We have called this method GrETEL Indexing (GrInd). (p. 17) So to optimise searching, the data has been pulled apart - in a sense - which would make the search space smaller and subsequently the search time shorter. In the future we would like to apply this technique on parallel corpora as well. We have not tested yet what influence this change has made to query time which is what I am going to find out during my internship. I have already analysed the XPath queries that users have made since GrETEL saw its first user and found that the queries are ten embedded levels deep at the most, but most are between one and five. The amount of nodes per query varies between one and 24, but most searches are for structures that contain between one and eight nodes. Based on this information, I am writing example XPaths that I am going to pull through BaseX as a sort of benchmark. I can then compare the query speeds between the split-up corpus, and the regular one. The problem that I have encountered is that BaseX seems to cache very efficiently. Obviously this is not a problem on production websites but for benchmarking it may not be ideal. My first question to you, then, is: is it possible to disable caching when testing queries locally? And how exactly does BaseX handle the caching? Or more specifically, if I enter a query: what is cached, and for how long? This information me be useful to analyse our logs with. If you have any feedback on GrETEL, or the new approach of GrInding, or if you have any ideas to improve search time for large corpora - I would love to hear from you, you can contact me via this email address or on LinkedIn. I reply to each email as extensively as possible. Thank you in advance, Kind regards Bram Vanroy https://be.linkedin.com/in/bramvanroy