[jira] [Commented] (SOLR-5463) Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")

Hoss Man (JIRA) Mon, 18 Nov 2013 17:34:35 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826056#comment-13826056
 ]


Hoss Man commented on SOLR-5463:
--------------------------------


I've been reading up on the internals of IndexSearcher.searchAfter and the 
associated PagingFieldCollector used (as well as some of the problems 
encountered in SOLR-1726) and I'm not convinced it could be a slam dunk to try 
and use them directly in Solr:

* IndexSearcher.searchAfter/PagingFieldCollector relies on the "client" (ie: 
Solr) passing back the FieldDoc of the last doc returned, and has expectations 
that the (lucene) docid contained in that FieldDoc will be meaningful
** We could perhaps serialize a representation of the "last" FieldDoc to 
include the the response of each request, and the deserialize that into a 
suitable imposter object on the "searchAfter" request -- but there is still the 
problem of the internal docid which will be missleading in a multishard 
distributed solr setup)
* There are a varity of code paths in SolrIndexSearcher for executing searches 
and it's not immediately obvious (to me) if/when it would make sense to augment 
each of those paths with PagingFieldCollector  (see yonik's comment in 
SOLR-1726 about faceting).

With that in mind, the approach i'm going to pursue (largely for my own sanity) 
is:

* Attempt a minimally invasive straw man implimentation of "searchAfter" type 
functionality that works in distributed mode -- ideally w/o modifying any 
existing Solr code.
* Use this straw man implementation to sanity check that the end user API is 
useful
* Build up good comprehensive (passing) tests against this straw man
* circle back and revisit the implementation details looking for oportunities 
to:
** refactor to eliminate similar code duplication
** improve performance

My current idea is to implement this straw man solution using a new 
SearchComponent that would run _after_ QueryComponent, along hte lines of...

* prepare:
** No-Op unless "searchAfter" param is specified
*** Use some marker value to mean "first page"
** assert that start==0 (doesn't make sense when using searchAfter)
** assert that uniqueKey is one of the sort fields (to ensure consistent 
ordering)
** if searchAfter param value indicates this is not the first request: 
*** deserialize the token it into a list of sort values
*** add a new PostFilter that restricts to documents based on those values and 
the sort directions (same basic logic as PagingFieldCollector)
* process:
** No-Op unless "searchAfter" param is specified
** do nothing if this is a shard request
** for regular old single node solr requests: serialize the sort values of the 
last doc in the Doc List (that QueryComponent has already built) and put it in 
the response as the "next" searchAfter token
* finishStage:
** No-Op unless "searchAfter" param is specified and stage is "DONE"
** serialize the sort values of the last doc in the Doc List (that 
QueryComponent already merged) and put it in the response as the "next" 
searchAfter token




> Provide cursor/token based "searchAfter" support that works with arbitrary 
> sorting (ie: "deep paging")
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5463
>                 URL: https://issues.apache.org/jira/browse/SOLR-5463
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>
> I'd like to revist a solution to the problem of "deep paging" in Solr, 
> leveraging an HTTP based API similar to how IndexSearcher.searchAfter works 
> at the lucene level: require the clients to provide back a token indicating 
> the sort values of the last document seen on the previous "page".  This is 
> similar to the "cursor" model I've seen in several other REST APIs that 
> support "pagnation" over a large sets of results (notable the twitter API and 
> it's "since_id" param) except that we'll want something that works with 
> arbitrary multi-level sort critera that can be either ascending or descending.
> SOLR-1726 laid some initial ground work here and was commited quite a while 
> ago, but the key bit of argument parsing to leverage it was commented out due 
> to some problems (see comments in that issue).  It's also somewhat out of 
> date at this point: at the time it was commited, IndexSearcher only supported 
> searchAfter for simple scores, not arbitrary field sorts; and the params 
> added in SOLR-1726 suffer from this limitation as well.
> ---
> I think it would make sense to start fresh with a new issue with a focus on 
> ensuring that we have deep paging which:
> * supports arbitrary field sorts in addition to sorting by score
> * works in distributed mode



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5463) Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")

Reply via email to