In Solr 4.7 an exciting new feature was added that allows one to page through a 
complete result set without having to worry about missing or double results at 
page boundaries while keeping resource utilization low.

I have a common use case that has similar performance and consistency problems 
that could be solved by extending the way CursorMarks work:

A. The user executes a search and obtains thousands of results of which he sees 
the first 'page'.
   Apart from scrolling through the list he also has a scrollbar (or paging 
controls) to jump to anywhere in the list.
B. The user uses the scrollbar to jump to an arbitrary place in the list.
C. The user scrolls down a bit (but past the current 'page') to find what he's 
looking for.
D. The user realizes he's too far down and scrolls up a bit again (but before 
the current 'page' again...)

(Yes, I know that users should be educated to refine their search, but 
unfortunately, if the client for which the application is developed specifies 
that it should be possible to use it this way...)

For the moment this is implemented by using the start/rows parameters to get 
the appropriate 'page' and this has the disadvantages that cursorMark solves:
- Solr (actually I use Lucene directly, but that doesn't matter here) needs to 
store *all* documents up to document (start+rows) to be able to returns just 
the rows requested. Except for step A (where start==0), this may be a huge 
performance hit.
- If the index is modified concurrently (especially when using NRT), jumping to 
the next/previous page can cause documents being repeated or skipped at page 
boundaries (as explained in 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results)

Here's the way an extension to the cursorMark system could solve the problem:
A. Solr/Lucene executes the search and returns the total number of hits and the 
requested number of top documents.
   start=0, rows=n, cursorMark=*
B. start=x, rows=n, cursorMark=*: Here Solr should allow combining both 
start!=0 and cursorMark=*. It should execute a normal request using start=x and 
rows=n and add two cursorMarks : on corresponding to the sort values of the 
first document and one corresponding to the sort values of the last document
C. Use cursorMark to get the 'next' pages: This is the same way cursorMark 
works for the moment:  the user passes the cursorMark corresponding to the sort 
values of the last document.
D. Use the cursorMark corresponding to the sort values of the first document to 
get the 'previous' pages.
a
In terms of implementing these changes, I've been looking at the source code 
and already did the easy ones :)
- If a cursorMark is passed (either cursorMark=* or a 'real' value), Solr 
should return two cursorMarks in the result: nextCursorMark as before and 
prevCursorMark corresponding to the sort values of the first document. Done.
- start!=0 and cursorMark=* should no longer be mutually exclusive (but 
start!=0 and cursorMark!=* should). Done.
- When returning a result using a cursorMark, the start value returned should 
correspond to the actual position of the first document in the full result set. 
 For the next page, this equals to the number of documents skipped during 
processing, but unfortunately I didn't see a way (yet) to pass that information 
along everywhere.  This start value, together with the (possibly changed) 
numFound value can be used in the GUI to adjust the position of the scrollbar 
or the paging controls accordingly without having to estimate it.
- Implementing reverse paging could actually be easier than it sounds by 
internally reversing the sort order (really reversing, not just reversing 
ASC/DESC!) using the cursor as in the normal case and afterwards reversing the 
obtained list of documents.  I've updated PagingFieldCollector in 
TopFieldCollector.java by negating the values in reverseMul and overriding 
topDocs(start, howMany), but have to check everywhere partial results are 
merged as well...
- Implement a corresponding amount of test cases for the paging up case as that 
exist for the paging down case (help! :)

While working on the code, I thought of another use case as well: refreshing 
the current page:
Instead of passing the same start value again, the prevCursorMark could be 
passed, but with a hint that the document on or after this cursorMark should be 
returned.

Which brings me to the question of how to specify the new behavior to Solr 
without affecting the current behavior.

I propose that prevCursorMark and nextCursorMark simply encode the sort values 
for the first and last document (as nextCursorMark does now) and that a simple 
prefix is used when cursorMark should be used differently:
">": documents after the cursor position: use with nextCursorMark to get the 
next page of results
">=": documents after or on the cursor position: use with prevCursorMark to 
refresh the same page keeping the same sort position for the first document
"<": documents before the cursor position: use with prevCursorMark to get the 
previous page of results
"<=": documents before or on the cursor position: use with nextCursorMark to 
get the same page keeping the same sort position for the last document (for 
completeness, useful?)

So if prevCursorMark was "ABC" and nextCursorMark was "DEF",
- "<ABC" would return the previous page
- ">DEF" or "DEF" would return the next page
- ">=ABC" would return the same page (but with 'fresh' values/documents), 
keeping 'visual' position the same

I'd appreciate any comments on this or if anyone else has already started work 
on similar changes.
In the meantime I'll continue working on what I have and check how I can make 
my changes available (through a patch attached to a new issue in Jira?)

Luc Vanlerberghe

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to