Re: [basex-talk] Benchmarking and caching in BaseX

2016-02-21 Thread Bram Vanroy | KU Leuven
I was going to write a more extensive email, but then I saw that many of you 
are also active on StackOverflow. Therefore I have moved my question there. I 
hope to see you there!

http://stackoverflow.com/questions/35536286/benchmarking-in-basex


Kind regards

-Oorspronkelijk bericht-
Van: Christian Grün [mailto:christian.gr...@gmail.com] 
Verzonden: dinsdag 16 februari 2016 11:28
Aan: Bram Vanroy <bram.vanr...@student.kuleuven.be>
CC: BaseX <basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Benchmarking and caching in BaseX

Hi Bram,

> I did read on your website that it is possible to communicate with BaseX from 
> Java. Is there any documentation or guidelines on this?

We are spending quite some time into our documentation, so I hope that the 
existing articles will give you some initial help (see e.g.
[1,2]).

> I am knowledgeable with Java, so I assume I should be able to conjure up a 
> benchmark script in Java. The only thing that I don't know is how to contact 
> the database and insert a query.

As you will see in the examples ("QueryExample.java" and others), there is no 
need to insert queries in a database. Instead, you can directly send your query 
strings to the BaseX server and retrieve the results.

Cheers,
Christian

[1] http://docs.basex.org/wiki/Java_Examples
[2] http://docs.basex.org/wiki/Clients



Re: [basex-talk] Benchmarking and caching in BaseX

2016-02-16 Thread Christian Grün
Hi Bram,

> I did read on your website that it is possible to communicate with BaseX from 
> Java. Is there any documentation or guidelines on this?

We are spending quite some time into our documentation, so I hope that
the existing articles will give you some initial help (see e.g.
[1,2]).

> I am knowledgeable with Java, so I assume I should be able to conjure up a 
> benchmark script in Java. The only thing that I don't know is how to contact 
> the database and insert a query.

As you will see in the examples ("QueryExample.java" and others),
there is no need to insert queries in a database. Instead, you can
directly send your query strings to the BaseX server and retrieve the
results.

Cheers,
Christian

[1] http://docs.basex.org/wiki/Java_Examples
[2] http://docs.basex.org/wiki/Clients


Re: [basex-talk] Benchmarking and caching in BaseX

2016-02-16 Thread Bram Vanroy
Good morning Christian

Thank you for the quick reply! I am indeed surprised that BaseX does not do 
much particular caching. Now that I think of it, it does seem to make sense: if 
results are loaded in memory, they will be accessible much faster for 
consequent queries, and they will reside in memory until overwritten or wiped - 
or at least that is how I see it, I am no computer expert!

I have now gathered all XPath structures that I would like to benchmark (~100; 
I'm not sure if this is enough?). Considering I am no hero in XQuery, I will 
ask my supervisor if he can write a script for this purpose (he loves Perl, so 
I assume he'll come up with something). I did read on your website that it is 
possible to communicate with BaseX from Java. Is there any documentation or 
guidelines on this? I am knowledgeable with Java, so I assume I should be able 
to conjure up a benchmark script in Java. The only thing that I don't know is 
how to contact the database and insert a query. Could you lead me to a 
tutorial-like source, if available? If not I will ask my supervisor's help.

Finally I'd like to thank you for the tips for benchmarking, they are very 
useful!


Kind regards

Bram
https://be.linkedin.com/in/bramvanroy


Van: Christian Grün [christian.gr...@gmail.com]
Verzonden: maandag 15 februari 2016 13:26
Aan: Bram Vanroy
CC: BaseX
Onderwerp: Re: [basex-talk] Benchmarking and caching in BaseX

Hi Bram,

Thanks for the summary on your work on Treebank and BaseX!

> The problem that I have encountered is that BaseX seems to
> cache very efficiently. Obviously this is not a problem on production
> websites but for benchmarking it may not be ideal. My first question to you,
> then, is: is it possible to disable caching when testing queries locally?
> And how exactly does BaseX handle the caching? Or more specifically, if I
> enter a query: what is cached, and for how long? This information me be
> useful to analyse our logs with.

You may be surprised to hear that BaseX does not have any particular
caching strategies for queries and query results. Various
optimizations exist for caching IO data on a lower level, though. As
these strategies reach down to the OS and hardware disk access level,
it’s hardly possible to disable all of them. Usually, it’s simply your
main memory that distorts your performance measurements, because the
relevant disk data will only be pulled once from disk as long as
enough main memory is available. Besides that, Java programs are
generally getting faster and faster the longer they are running (due
to Just-in-Time Compilation – JIT)… and so on.

In practice, if you do benchmarking, it’s usually good to “warm up”
your BaseX instance by running various initial queries, and by using
the client/server architecture and e.g. look at the execution time
output by the -v or -V command-line flag. In order to simulate
real-life query patterns, you should run your test queries in random
order, and run a great number of different queries. Moreover, it’s
recommendable to run your queries multiple times and eventually take
the mean or minimum value as result. If this value differs more than
5% when repeating the test, then you should possibly increase the
number of runs.

I hope this helps a bit; I invite you to report back on your experiences,
Christian


Re: [basex-talk] Benchmarking and caching in BaseX

2016-02-15 Thread Christian Grün
Hi Bram,

Thanks for the summary on your work on Treebank and BaseX!

> The problem that I have encountered is that BaseX seems to
> cache very efficiently. Obviously this is not a problem on production
> websites but for benchmarking it may not be ideal. My first question to you,
> then, is: is it possible to disable caching when testing queries locally?
> And how exactly does BaseX handle the caching? Or more specifically, if I
> enter a query: what is cached, and for how long? This information me be
> useful to analyse our logs with.

You may be surprised to hear that BaseX does not have any particular
caching strategies for queries and query results. Various
optimizations exist for caching IO data on a lower level, though. As
these strategies reach down to the OS and hardware disk access level,
it’s hardly possible to disable all of them. Usually, it’s simply your
main memory that distorts your performance measurements, because the
relevant disk data will only be pulled once from disk as long as
enough main memory is available. Besides that, Java programs are
generally getting faster and faster the longer they are running (due
to Just-in-Time Compilation – JIT)… and so on.

In practice, if you do benchmarking, it’s usually good to “warm up”
your BaseX instance by running various initial queries, and by using
the client/server architecture and e.g. look at the execution time
output by the -v or -V command-line flag. In order to simulate
real-life query patterns, you should run your test queries in random
order, and run a great number of different queries. Moreover, it’s
recommendable to run your queries multiple times and eventually take
the mean or minimum value as result. If this value differs more than
5% when repeating the test, then you should possibly increase the
number of runs.

I hope this helps a bit; I invite you to report back on your experiences,
Christian


[basex-talk] Benchmarking and caching in BaseX

2016-02-15 Thread Bram Vanroy | KU Leuven
Dear all

My name is Bram Vanroy, and I am an intern at the Centre for Computational
Linguistics (CCL; http://www.arts.kuleuven.be/ling/ccl [Dutch]) at the
University of Leuven. My supervisor, Vincent Vandeghinste, has had contact
with this mailing list some time ago, more specifically with Dirk Kirsten.
My intership is titled "Fine-tuning the GrETEL Treebank Query Engine".
GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics;
available at http://gretel.ccl.kuleuven.be/gretel-2.0/. Its goal is to
provide users with a fast, user-friendly on-line tool to search through text
corpora backed by treebanks. Accessibility is an important point for us:
users do not need to be proficient with any programming languages, strict
formalisms, or treebank specific annotations; every query can be executed by
using an intuitive graphical interface. More advanced users can use XPath to
write the representation of the syntactic structure that they are looking
for. BaseX is our tool of choice as a database for our corpora in XML
format.

Initially, GrETEL provided access to smaller corpora such as CGN (9 million
words) and Lassy Small (1 million words). We would like to expand the
searchable corpora by also making the full Sonar corpus available (500
million words). This is already partially possible in GrETEL 2.0 but due to
efficiency reasons, capabilities are restricted: users can only search in
one component at a time, and the largest component in the corpus is not
available due to its size (15 million sentences). We have applied these
restrictions because the search time for the whole corpus was too long,
which in turn would decrease the user-friendliness of the tool drastically.

Steps have already been taken to improve search times in larger corpora.
(See "Making a Large Treebank Searchable Online. The SoNaR Case." by Vincent
Vandeghinste, and Liesbeth Augustinus;
http://nederbooms.ccl.kuleuven.be/documentation/LREC2014-GrETELSoNaR.pdf.)
To spare you the effort to go through the whole article, I hereby quote the
most relevant citation from that article for this email:

 

The general idea behind our approach is to restrict the search space by
splitting up the data in many small databases, allowing for faster retrieval
of syntactic structures. We organise the data in databases that contain all
bottom-up subtrees for which the two top levels (i.e. the root and its
children) adhere to the same syntactic pattern. When querying the database
for certain syntactic constructions, we know on which databases we have to
apply the XPath query which would otherwise have to be applied on the whole
data set. We have called this method GrETEL Indexing (GrInd). (p. 17)

 

So to optimise searching, the data has been pulled apart - in a sense -
which would make the search space smaller and subsequently the search time
shorter. In the future we would like to apply this technique on parallel
corpora as well. We have not tested yet what influence this change has made
to query time which is what I am going to find out during my internship. I
have already analysed the XPath queries that users have made since GrETEL
saw its first user and found that the queries are ten embedded levels deep
at the most, but most are between one and five. The amount of nodes per
query varies between one and 24, but most searches are for structures that
contain between one and eight nodes. Based on this information, I am writing
example XPaths that I am going to pull through BaseX as a sort of benchmark.
I can then compare the query speeds between the split-up corpus, and the
regular one. The problem that I have encountered is that BaseX seems to
cache very efficiently. Obviously this is not a problem on production
websites but for benchmarking it may not be ideal. My first question to you,
then, is: is it possible to disable caching when testing queries locally?
And how exactly does BaseX handle the caching? Or more specifically, if I
enter a query: what is cached, and for how long? This information me be
useful to analyse our logs with.

 

If you have any feedback on GrETEL, or the new approach of GrInding, or if
you have any ideas to improve search time for large corpora - I would love
to hear from you, you can contact me via this email address or on LinkedIn.
I reply to each email as extensively as possible.

 

 

Thank you in advance,

Kind regards

 

Bram Vanroy
https://be.linkedin.com/in/bramvanroy