#146: BibSort: new module
--------------------------+----------------------
  Reporter:  lmarian      |      Owner:  lmarian
      Type:  enhancement  |     Status:  in_merge
  Priority:  major        |  Milestone:  v1.0
 Component:  BibSort      |    Version:
Resolution:               |   Keywords:
--------------------------+----------------------

Comment (by lmarian):

 BibSort is a new module designed to fast sort the search results.

 Currently, when the user is requesting that his search results are sorted
 in a particular order the search_engine goes into the bibxxx tables, grabs
 the data for each record and sorts it. While this could be fine
 performance wise for a limited search result set, for large sets this is
 to much stress for the database (not to overload the search_engine  and
 the db we currently have the CFG_WEBSEARCH_NB_RECORDS_TO_SORT to limit the
 number of records sorted).

 This new module should make no difference between 'Rank by' and 'Sort by'
 functionalities.

 The following idea could be implemented: for each sorting method, the hole
 set of data could be sorted and then evenly split into several sorting
 buckets (the number should depend on the size of the repository). It could
 be something like:
 sort by title:
 bucket 1: titles A->D
 bucket 2: titles E->L
 bucket 3: titles L->R
 ..

 OR sort by citation count:
 bucket 1: #citations > 1000
 bucket 2: 1000 > #citations > 500
 bucket 3: 500 > #citations > 100
 ..

 At the search time, if sorting has been requested by the user, and if
 sorting buckets have been already created for the requested sorting
 method, the search_engine will take the rg (page size) and jrec (jump to
 record) and compute the widow of records that should be displayed to the
 user (for ex, the user wants to see the records from 1 to 10 sorted by
 citation count, so there is no point in trying to sort the complete set of
 search results that could be of the order of 10^5^). Then it will
 intersect the search results with the first bucket (bucket 1) and see if
 the intersection provided enough records (10 in this example) to show to
 the user. If yes, then display them, if not, continue intersection the
 search results with the following buckets, until 10 records have been
 found.

 The scope of BibSort would be to construct and maintain these buckets
 offline so that the search engine could cache them and use them. The
 update of the buckets needs to be fast, in order to be up to date with the
 latest changes done to the metadata.

 The advantage of using the sorting buckets: there will be no need to look
 in the bibxxx tables anymore for each record retrieved; having the buckets
 permits only intersecting a small part of the hole data set with the
 search results and produce a fast output to the user; there will be no
 limit on the maximum number of records that could be sorted (making
 possible operations like 'sort by citation count' on the hole database).

-- 
Ticket URL: <http://invenio-software.org/ticket/146#comment:5>
Invenio <http://invenio-software.org>

Reply via email to