[jira] [Commented] (LUCENE-5299) Refactor Collector API for parallelism

Uwe Schindler (JIRA) Mon, 21 Oct 2013 10:33:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800847#comment-13800847
 ]


Uwe Schindler commented on LUCENE-5299:
---------------------------------------

I like the change abstract -> interface! In general, I like the new approach 
how to manage collection much more than the broken abstract Collector class we 
currently have. So its more obvious *where* to do the merging (it should be 
part of e.g. TopDocCollector). IndexSearcher should not deal with this at all! 
Users with conventiaonal serial collection can still do this with the abstract 
base class SerialCollector and don't even need to change their code (exept 
changing the superclass): This is cool because serial collector implements both 
interfaces and returns itsself as sub collector!

My biggest concern is not complexity of API (it is actually simplier and easier 
to understand!): it is more the fact that parallelism of Lucene Queries is in 
most cases not the best thing to do (if you have many users). It only makes 
sense if you have very few queries - which is not where full-text searches are 
used for. The overhead for merging is higher than what you get, especially when 
many users hit your search engine in parallel! I generally don't recommend to 
users to use the parallelization currently available in IndexSearcher. Every 
user gets one thread and if you have many users buy more processors. With 
additional parallelism this does not scale if userbase grows.

> Refactor Collector API for parallelism
> --------------------------------------
>
>                 Key: LUCENE-5299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5299
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Shikhar Bhushan
>         Attachments: benchmarks.txt, LUCENE-5299.patch
>
>
> h2. Motivation
> We should be able to scale-up better with Solr/Lucene by utilizing multiple 
> CPU cores, and not have to resort to scaling-out by sharding (with all the 
> associated distributed system pitfalls) when the index size does not warrant 
> it.
> Presently, IndexSearcher has an optional constructor arg for an 
> ExecutorService, which gets used for searching in parallel for call paths 
> where one of the TopDocCollector's is created internally. The 
> per-atomic-reader search happens in parallel and then the 
> TopDocs/TopFieldDocs results are merged with locking around the merge bit.
> However there are some problems with this approach:
> * If arbitary Collector args come into play, we can't parallelize. Note that 
> even if ultimately results are going to a TopDocCollector it may be wrapped 
> inside e.g. a EarlyTerminatingCollector or TimeLimitingCollector or both.
> * The special-casing with parallelism baked on top does not scale, there are 
> many Collector's that could potentially lend themselves to parallelism, and 
> special-casing means the parallelization has to be re-implemented if a 
> different permutation of collectors is to be used.
> h2. Proposal
> A refactoring of collectors that allows for parallelization at the level of 
> the collection protocol. 
> Some requirements that should guide the implementation:
> * easy migration path for collectors that need to remain serial
> * the parallelization should be composable (when collectors wrap other 
> collectors)
> * allow collectors to pick the optimal solution (e.g. there might be memory 
> tradeoffs to be made) by advising the collector about whether a search will 
> be parallelized, so that the serial use-case is not penalized.
> * encourage use of non-blocking constructs and lock-free parallelism, 
> blocking is not advisable for the hot-spot of a search, besides wasting 
> pooled threads.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5299) Refactor Collector API for parallelism

Reply via email to