#146: BibSort: new module
--------------------------+----------------------
Reporter: lmarian | Owner: lmarian
Type: enhancement | Status: in_merge
Priority: major | Milestone: v1.0
Component: BibSort | Version:
Resolution: | Keywords:
--------------------------+----------------------
Comment (by lmarian):
BibSort is a new module designed to fast sort the search results.
Currently, when the user is requesting that his search results are sorted
in a particular order the search_engine goes into the bibxxx tables, grabs
the data for each record and sorts it. While this could be fine
performance wise for a limited search result set, for large sets this is
to much stress for the database (not to overload the search_engine and
the db we currently have the CFG_WEBSEARCH_NB_RECORDS_TO_SORT to limit the
number of records sorted).
This new module should make no difference between 'Rank by' and 'Sort by'
functionalities.
The following idea could be implemented: for each sorting method, the hole
set of data could be sorted and then evenly split into several sorting
buckets (the number should depend on the size of the repository). It could
be something like:
sort by title:
bucket 1: titles A->D
bucket 2: titles E->L
bucket 3: titles L->R
..
OR sort by citation count:
bucket 1: #citations > 1000
bucket 2: 1000 > #citations > 500
bucket 3: 500 > #citations > 100
..
At the search time, if sorting has been requested by the user, and if
sorting buckets have been already created for the requested sorting
method, the search_engine will take the rg (page size) and jrec (jump to
record) and compute the widow of records that should be displayed to the
user (for ex, the user wants to see the records from 1 to 10 sorted by
citation count, so there is no point in trying to sort the complete set of
search results that could be of the order of 10^5^). Then it will
intersect the search results with the first bucket (bucket 1) and see if
the intersection provided enough records (10 in this example) to show to
the user. If yes, then display them, if not, continue intersection the
search results with the following buckets, until 10 records have been
found.
The scope of BibSort would be to construct and maintain these buckets
offline so that the search engine could cache them and use them. The
update of the buckets needs to be fast, in order to be up to date with the
latest changes done to the metadata.
The advantage of using the sorting buckets: there will be no need to look
in the bibxxx tables anymore for each record retrieved; having the buckets
permits only intersecting a small part of the hole data set with the
search results and produce a fast output to the user; there will be no
limit on the maximum number of records that could be sorted (making
possible operations like 'sort by citation count' on the hole database).
--
Ticket URL: <http://invenio-software.org/ticket/146#comment:5>
Invenio <http://invenio-software.org>