Brief comment (without reading all the comments, so FWIW) on segment merging 
and sort order. The default TMP will not guarantee anything; the “cost” of a 
merge takes into account a number of things. I _think_ the index will tend 
towards retaining the insertion order in an insert-only index, but by no means 
is that guaranteed (or even checked).

That said, if it’s an insert-only index, extending TMP or creating an 
“InsertOnlyMergePolicy” would be pretty easy. You’d want to borrow a few 
concepts from TMP when it comes to merging “like sized” segments rather than 
rewriting hugely dissimilar segments, and maybe make the segment max larger, 
but it should be pretty straight-forward. You’d also want to order the actual 
merge once the list of segments to merge was determined, which I’m not 
absolutely sure happens (no evidence one way or the other).

All this may be totally irrelevant, and for all I know there’s already one at 
the Lucene level, haven’t checked.

FWIW,
Erick

> On Mar 12, 2020, at 09:29, Michael Sokolov (Jira) <[email protected]> wrote:
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057967#comment-17057967
>  ] 
> 
> Michael Sokolov commented on LUCENE-8929:
> -----------------------------------------
> 
> Thanks for the insightful comments, [~jim.ferenczi], you've given me a lot to 
> think about! I had not really considered sorting segments: that makes a lot 
> of sense when documents are at least roughly inserted in sort order. I would 
> have thought merges would interfere with that opto, but I guess for the most 
> part it works out? The performance improvements you saw are stunning. It 
> would be great if we could get the segment sorting ideas merged into the 
> Lucene code base, no? I wonder how we determine when they are applicable 
> though. In Elasticsearch is it done based on some a-priori knowledge, or do 
> you analyze the distribution and turn on the opto automatically? That would 
> be compelling I think. On the other hand, the use case inspiring this does 
> not tend to correlate index sort order and insertion order, so I don't think 
> it would benefit as much from segment sorting (except due to chance, or in 
> special cases), so I think these are really two separate optimizations and 
> issues. We should be sure to structure the code in such a way that can 
> accomodate them all and properly choose which one to apply. We don't have a 
> formal query planner in Lucene, but I guess we are beginning to evolve one.
> 
> I think the idea of splitting collectors is a good one, to avoid overmuch 
> complexity in a single collector, but there is also a good deal of shared 
> code across these. I can give that a try and see what it looks like. 
> 
> By the way, I did also run a test using luceneutil's "modification timestamp" 
> field as the index sort and saw similar gains. I think that field is more 
> tightly correlated with insertion order, and also has much higher 
> cardinality, so it makes a good counterpoint: I'll post results here later 
> once I can do a workup.
> 
> I hear your concern about the non-determinism due to tie-breaking, but I * 
> think* this is accounted for by including (global) docid in the comparison in 
> MaxScoreTerminator.LeafState? I may be missing something though. It doesn't 
> seem we have a good unit test checking for this tiebreak. I'll add to 
> TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is 
> covered.
> 
> I'm not sure what to say about the `LeafFieldComparator` idea - it sounds 
> powerful, but I am also a bit leery of these complex Comparators - they make 
> other things more difficult since it becomes challenging to reason about the 
> sort order "from the outside". I had to resort to some "instanceof" hackery 
> to restrict consideration to cases where the comparator is numeric, and 
> extracting the sort value from the comparator is pretty messy too. We pay a 
> complexity cost here to handle some edge cases of more abstract comparators.  
> 
>> Early Terminating CollectorManager
>> ----------------------------------
>> 
>>                Key: LUCENE-8929
>>                URL: https://issues.apache.org/jira/browse/LUCENE-8929
>>            Project: Lucene - Core
>>         Issue Type: Sub-task
>>           Reporter: Atri Sharma
>>           Priority: Major
>>         Time Spent: 7h 20m
>> Remaining Estimate: 0h
>> 
>> We should have an early terminating collector manager which accurately 
>> tracks hits across all of its collectors and determines when there are 
>> enough hits, allowing all the collectors to abort.
>> The options for the same are:
>> 1) Shared total count : Global "scoreboard" where all collectors update 
>> their current hit count. At the end of each document's collection, collector 
>> checks if N > threshold, and aborts if true
>> 2) State Reporting Collectors: Collectors report their total number of 
>> counts collected periodically using a callback mechanism, and get a proceed 
>> or abort decision.
>> 1) has the overhead of synchronization in the hot path, 2) can collect 
>> unnecessary hits before aborting.
>> I am planning to work on 2), unless objections
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to