[ 
https://issues.apache.org/jira/browse/LUCENE-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800954#comment-13800954
 ] 

Shikhar Bhushan commented on LUCENE-5299:
-----------------------------------------

Thanks for your comments [~thetaphi], I really appreciate the vote of 
confidence in the API changes :)

bq. My biggest concern is not complexity of API (it is actually simplier and 
easier to understand!): it is more the fact that parallelism of Lucene Queries 
is in most cases not the best thing to do (if you have many users). It only 
makes sense if you have very few queries - which is not where full-text 
searches are used for. The overhead for merging is higher than what you get, 
especially when many users hit your search engine in parallel! I generally 
don't recommend to users to use the parallelization currently available in 
IndexSearcher. Every user gets one thread and if you have many users buy more 
processors. With additional parallelism this does not scale if userbase grows.

There is certainly more work to be done overall per search-request for the 
Collector's where parallelization => merge step(s) [1]. It could mean better 
latency at the cost of additional hardware to sustain the same level of load. 
But it's a choice that should be available when developing search applications.

[1] there are trivially parallelizable collectors where the merge step is 
either really small or non-existent: e.g. TotalHitCountCollector, or even 
FacetCollector 
(https://github.com/shikhar/lucene-solr/commit/032683da739bf15c1a8afe9f15cb2586baa0b201?w=1)

> Refactor Collector API for parallelism
> --------------------------------------
>
>                 Key: LUCENE-5299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5299
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Shikhar Bhushan
>         Attachments: benchmarks.txt, LUCENE-5299.patch
>
>
> h2. Motivation
> We should be able to scale-up better with Solr/Lucene by utilizing multiple 
> CPU cores, and not have to resort to scaling-out by sharding (with all the 
> associated distributed system pitfalls) when the index size does not warrant 
> it.
> Presently, IndexSearcher has an optional constructor arg for an 
> ExecutorService, which gets used for searching in parallel for call paths 
> where one of the TopDocCollector's is created internally. The 
> per-atomic-reader search happens in parallel and then the 
> TopDocs/TopFieldDocs results are merged with locking around the merge bit.
> However there are some problems with this approach:
> * If arbitary Collector args come into play, we can't parallelize. Note that 
> even if ultimately results are going to a TopDocCollector it may be wrapped 
> inside e.g. a EarlyTerminatingCollector or TimeLimitingCollector or both.
> * The special-casing with parallelism baked on top does not scale, there are 
> many Collector's that could potentially lend themselves to parallelism, and 
> special-casing means the parallelization has to be re-implemented if a 
> different permutation of collectors is to be used.
> h2. Proposal
> A refactoring of collectors that allows for parallelization at the level of 
> the collection protocol. 
> Some requirements that should guide the implementation:
> * easy migration path for collectors that need to remain serial
> * the parallelization should be composable (when collectors wrap other 
> collectors)
> * allow collectors to pick the optimal solution (e.g. there might be memory 
> tradeoffs to be made) by advising the collector about whether a search will 
> be parallelized, so that the serial use-case is not penalized.
> * encourage use of non-blocking constructs and lock-free parallelism, 
> blocking is not advisable for the hot-spot of a search, besides wasting 
> pooled threads.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to