Re: [basex-talk] Querying Basex from a web interface

Bram Vanroy | KU Leuven Sat, 14 May 2016 07:13:32 -0700

Hi Christian
Hi BaseX folk

I am sorry for the very late reply! As you can imagine I am rather busy. I will 
try to answer your questions as well as I can below. But before I do that, I 
think it is important that I provide more information on our set-up. Be warned, 
this is a very long email!

The promotor of my project has contacted this mailing list before. We are using 
Basex in quite a peculiar manner. To decrease the search space of our 500 
million token corpus, we have introduced a process called GrInding. You can 
read all about it here [1].

To summarize, as a pre-processing step the corpus is scanned and divided into 
sections. Each subtree (one parent node and its direct children, recursively) 
is taken out, and a breadth-first pattern is assigned to it based on properties 
of that tree. Maybe this is clearer with an example, taken from a thesis that I 
am writing:

A breadth-first pattern can be seen as a summary of the first level of nodes. 
From each of these nodes we take the rel attribute and the cat or pt attribute, 
and join these by a %-sign. The results of all nodes are then sorted 
alphabetically (to better manage all the data) and concatenated with an 
underscore separator. Finally, the topmost node’s category is prepended to the 
whole string. For instance, 1 is an XPath structure based on the dependency 
structure of de man (the man) which is an NP consisting of a head noun and an 
article that is a determiner. 2 is its breadth-first representation: the top 
node’s cat (np) followed by each node’s rel, %, and cat or pt. In other words, 
det%lid is the first child node, and hd%n represents the second one.

1       /node[@cat="np" and node[@rel="det" and @pt="lid"] and node[@rel="hd" 
and @pt="n"]]
2       npdet%lid_hd%n

Note that is done for each tree and all its subtrees, so that all patterns in 
that one tree can be matched. The breadth-first pattern is then used as a 
wrapper file that, ironically, includes so-called “includes”. These includes 
are simply file names of patterns that are included in said pattern. As an 
example, an include of 2 could be npdet%lid_hd%n_mod%adj.  The pattern is more 
specific, and thus should also be found by the more general pattern 2. In 
addition to includes, these files also contain the subtrees that exactly match 
the given pattern. The subtrees carry a reference to their original ancestor 
tree so that when the XPath input matches a found subtree, we can retrieve the 
whole sentence from a different database, by ID. I hope that this somewhat 
clarifies the process. As you hopefully gather from this, the purpose of this 
technique is to narrow done the search space quickly.

The approach explained above also implies that we had to create a lot of BaseX 
databases. A lot. Around 10 million of them. As a consequence, each query 
looking for the same XPath in another pattern (i.e. another include) means 
another query. From testing, it seems that the overhead from this is minimal, 
and benchmarks show that the GrInded version of our corpus is indeed queried 
much faster than the original version.

Now, to reply to your questions

        •       Have you done some simple performance tests?
Yes. I have done some simple tests. Comparing the search time when using Basex 
from the command line, and when using Basex through an online interface. 
Results are very similar (considering the web interface does some small 
post-processing steps). This also answers your next question: there is only 
very little overhead when returning the results to the browser. As I mentioned 
before, the benchmarks do show that the technique with 10 million of databases 
does prove to be faster than the other variant.

        •       Would a query that returns single results really be faster than 
one that returns 10 results?
Yes. In a search space of 500 million tokens, you can imagine that a rare 
pattern may take a lot of time to query – even in the GrInded version. In 
theory it is possible that the first data immediately matches a result, and 
that the second (and last) hit is only found in the very last sentence of said 
huge corpus. Therefore, considering user-friendliness, we opted for flushing 
each individual result as it was found. In the theoretical example I just gave, 
the user would quickly be presented with a matched sentence (we provide means 
to “explore” that sentence in a graphical user interface) so at least they have 
something to do while the next sentence is looked for. If they have to wait for 
that second and last example, they would need to wait for a very long time 
before anything actually “happens” on their screen. 

        •       Do you sort your search results? If yes, returning 100 results 
instead of 10 should not be much slower.
As I am not entirely sure what you mean by that, I don’t think we do. By 
sorting, do you mean the XQuery order by function?
I hope these questions answer the questions that you have. I have a couple of 
my own as well.

First of all, from my benchmarks it became clear that the GrInding was a 
success. From the results, I could also see some strange behaviour taking 
place: the GrInded version enjoyed a great caching effect, whereas the not 
grinded benchmark did not. To me this is odd. The GrInded version often does 
need to go through more databases then the not GrInded version (depending on 
the query), but wouldn’t that mean that BaseX’ cache is cleared more often? I 
could imagine that the garbage collector passes by after a query, or at least a 
session, is closed? Have you any idea how this is possible?

Secondly, and lastly, I do wonder about something in the online implementation 
that I built. Currently, searching goes like this. First, ajax requests are 
sent to the server, asking for ONE hit only, always the next one obviously, 
using the XQuery position function and some PHP variable mixing. As a callback 
to each request, the next request is sent. After a while (after X seconds or X 
found hits) I initiate a new ajax request which only counts all hits. This way 
the user at least gets an idea: currently we’ve already found X hits of Y total 
hits. And finally, if all results haven’t been found yet, I fire a last request 
which will search for all results in a single batch. By delaying each step, I 
am providing the user with some feedback along the way.

My two questions are: is count() actually faster than getting all results? Or 
does count() get all the hits any way, and should I count and get all results 
in one step? Secondly, it seems that when the last step is initialised, the 
other processes hang – leaving the user without any feedback. The processes 
literally seem to stop running. My question then is: does this happen because 
BaseX does not handle different sessions asynchronously, and new queries block 
others? If that is not the problem, I assume I may have an issue with using and 
modifying locked PHP $_SESSION variables and I’ll have to look into that. 
Finally, I simply want to ask what the best flow is for opening and closing 
BaseX sessions, and when one should open a new session. At this moment, I open 
up a new Session on each iteration, and for counting, and for fetching all 
results. The DRY enthusiast in my does not like this. The thing is that these 
scripts are run in separate files, and through an ajax request from the main 
file so I am not sure how I can “pass a session around”. I suppose that putting 
a BaseX session as a variable in a SESSION variable won’t work in the most 
desirable way? Or maybe this is not the recommended set up and I am using BaseX 
sessions completely wrong? Any insight is greatly appreciated.

I bet that this email was more than you bargained for :-), but I thought it 
would only be useful if I provided all details to you in full. I hope you find 
it interesting as well as useful.

Kind regards
Thank you in advance
Thank you for your patience

Bram Vanroy
LinkedIn [2]

[1] http://nederbooms.ccl.kuleuven.be/documentation/LREC2014-GrETELSoNaR.pdf 
[2] https://www.linkedin.com/in/bramvanroy

-----Oorspronkelijk bericht-----
Van: Christian Grün [mailto:christian.gr...@gmail.com] 
Verzonden: maandag 25 april 2016 12:23
Aan: Bram Vanroy | KU Leuven <bram.vanr...@student.kuleuven.be>
CC: BaseX <basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Querying Basex from a web interface

Hi Bram,

> I have implemented a system where on page load the ten first results are 
> shown, and when a user clicks a button an Ajax request is sent, and the ten 
> next results are looked up and appended to HTML. This works great, especially 
> in cases where many hits are found.

I’m glad to hear it works!

> However, this is still quite slow when you are querying for rare cases. 
> Therefore I am curious about your opinion on the following idea:
>
> What if you set the difference between $start and $end to 1, and immediately 
> return each hit, and fire the next ajax request?

Before spending some more work here, I am interested in hearing what exactly is 
the bottleneck in your setup:

* Have you done some simple performance tests?
* Does it take most time to evaluate the query or to return results to the 
browser?
* Would a query that returns single results really be faster than one that 
returns 10 results?
* Do you sort your search results? If yes, returning 100 results instead of 10 
should not be much slower.

Thanks in advance
Christian

> -----Oorspronkelijk bericht-----
> Van: Christian Grün [mailto:christian.gr...@gmail.com]
> Verzonden: donderdag 21 april 2016 15:22
> Aan: Bram Vanroy | KU Leuven <bram.vanr...@student.kuleuven.be>
> CC: BaseX <basex-talk@mailman.uni-konstanz.de>
> Onderwerp: Re: [basex-talk] Querying Basex from a web interface
>
> PS: The current way to only return chunks of the result is to pass on the 
> index to the first and last result in your result, and to do the filtering in 
> XQuery:
>
>   $results[position() = $start to $end]
>
> By returning one more result than requested, the client will know that there 
> will be more results. This may be helpful, because computing the total result 
> size is often much more expensive than returning only the first results.
>
> Does this help?
> Christian
>
>
>
> On Thu, Apr 21, 2016 at 3:18 PM, Christian Grün <christian.gr...@gmail.com> 
> wrote:
>> Hi Bram,
>>
>> Thanks for your query. This is just a quick response, but you could 
>> have a look at the new %rest:single annotation in BaseX [1]. Your 
>> feedback is welcome.
>>
>> Christian
>>
>> [1] http://docs.basex.org/wiki/RESTXQ#Query_Execution
>>
>>
>>
>> On Thu, Apr 21, 2016 at 3:11 PM, Bram Vanroy | KU Leuven 
>> <bram.vanr...@student.kuleuven.be> wrote:
>>> Hello all
>>>
>>>
>>>
>>> I have been reading through many emails on this list and I’ve learnt 
>>> a lot about how Basex works and how others use it. A month or so 
>>> back I have sent an email myself to this list concerning caching. 
>>> Even though I have some more questions about that, I will leave that 
>>> for another time. Today I am concerned about retrieving chunked input from 
>>> Basex.
>>>
>>>
>>>
>>> (Question also found on StackOverflow, with a nice bounty! J 
>>> http://stackoverflow.com/questions/36675388/efficient-and-user-frien
>>> d
>>> ly-way-to-present-slow-loading-results)
>>>
>>>
>>>
>>> Case at hand: we use Basex to query a 50 million tokens corpus. We 
>>> also make this available to other users through a website. The thing 
>>> is that this is slow. For our own projects that’s no problem, we 
>>> dive straight into the back-end and run a search command from 
>>> terminal and let the query run for all the time it needs. However, 
>>> for users it is paramount that they get a quick response. At the 
>>> moment it is taking too long. I don’t blame BaseX. We love BaseX and 
>>> are astounded by its efficiency and optimisations! However, we want to 
>>> deliver the best user-experience to our users.
>>>
>>>
>>>
>>> We call a new session from PHP, wait to receive the results, do some 
>>> post-processing and then load the result page. As said, this takes 
>>> too much time. We’ve been looking into some solutions. The best one 
>>> that I think should be possible, is returning chunks of the results.
>>> Do you know those websites that allow you to see results but only, 
>>> like, 20 per page? I think something similar is appropriate. When a 
>>> user has searched for a pattern, we only show the 20 or so first 
>>> results just so they can get an idea of the results they’d find.
>>> Then, when they click a button, we should query for the twenty next 
>>> results which are then appended to the list (JavaScript solution I 
>>> guess), and so on. Until all results have been found. Additionally, 
>>> I will also provide a button from which users can download all 
>>> results in a text file. This is allowed to take a longer time. The main 
>>> thing is that users should get early feedback and results on their query.
>>>
>>>
>>>
>>> Now the question is if something like this is possible in an 
>>> efficient manner in BaseX. Can you form a query that only finds the
>>> 20 first results, and then the following 20 and so on – and is this 
>>> faster than searching everything at once? In other words, when I am 
>>> searching for the results
>>> 120-140 (after having pushed the button a couple of times), is BaseX 
>>> smart enough to skip the search space it has already done to find 
>>> the
>>> 120 previous hits? If so, that would be great. Could you help me on 
>>> my way, with some PHP/XQuery code that is suitable?
>>>
>>>
>>>
>>> I also highly encourage you to participate on StackOverflow. As I 
>>> said, I am offering a 200 bounty – for the people who are interested in 
>>> Internet fame.
>>> J
>>>
>>>
>>>
>>>
>>>
>>> Thank you for your time
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Bram Vanroy
>>>
>>> http://bramvanroy.be
>

Re: [basex-talk] Querying Basex from a web interface

Reply via email to