[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
[ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847455#comment-16847455 ] Jim Ferenczi commented on LUCENE-8788: -- {quote} I like the idea [~jim.ferenczi] proposed. I can open a Jira for that and work on a patch for it as well, unless Jim wants to do it himself? {quote} Something is needed for the search side and this issue is the right place to add such functionalities. I wonder if we need an issue for the merge side though since it's already possible to change the order of segments in a custom FilterMergePolicy. I tried to do it in a POC and the change is trivial so I am not sure that we need to do anything in core. > Order LeafReaderContexts by Estimated Number Of Hits > > > Key: LUCENE-8788 > URL: https://issues.apache.org/jira/browse/LUCENE-8788 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > We offer no guarantee on the order in which an IndexSearcher will look at > segments during a search operation. This can be improved for use cases where > an engine using Lucene invokes early termination and uses the partially > collected hits. A better model would be if we sorted segments by the > estimated number of hits, thus increasing the probability of the overall > relevance of the returned partial results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
[ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844820#comment-16844820 ] Adrien Grand commented on LUCENE-8788: -- I think this is definitely worth exploring. It looks like a subset of LUCENE-8727, since we are only aiming at using fully collected slices here to speed up slices that have not been collected yet. > Order LeafReaderContexts by Estimated Number Of Hits > > > Key: LUCENE-8788 > URL: https://issues.apache.org/jira/browse/LUCENE-8788 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > We offer no guarantee on the order in which an IndexSearcher will look at > segments during a search operation. This can be improved for use cases where > an engine using Lucene invokes early termination and uses the partially > collected hits. A better model would be if we sorted segments by the > estimated number of hits, thus increasing the probability of the overall > relevance of the returned partial results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
[ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844806#comment-16844806 ] Atri Sharma commented on LUCENE-8788: - [~jpountz] Yes, that is precisely the idea i.e. "learning" from previous collections to take decisions for the next set of collections. We can batch up slices into "familiies" i.e. set of sets, and each set is collected in sequential manner with shared metastate like you described above. We could potentially collect multiple families in parallel. WDYT? Thanks for validating the idea. I will work on a PoC patch now. I like the idea [~jim.ferenczi] proposed. I can open a Jira for that and work on a patch for it as well, unless Jim wants to do it himself? > Order LeafReaderContexts by Estimated Number Of Hits > > > Key: LUCENE-8788 > URL: https://issues.apache.org/jira/browse/LUCENE-8788 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > We offer no guarantee on the order in which an IndexSearcher will look at > segments during a search operation. This can be improved for use cases where > an engine using Lucene invokes early termination and uses the partially > collected hits. A better model would be if we sorted segments by the > estimated number of hits, thus increasing the probability of the overall > relevance of the returned partial results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
[ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844796#comment-16844796 ] Adrien Grand commented on LUCENE-8788: -- Do I get your idea right that your plan is to select multiple slices, but to collect them sequentially rather than in parallel so collection of a slice can leverage information that was gathered in previous slices? For instance in the case that a user wants the top 10 hits sorted by a numeric field foo and that the 10th best hit has a value of 7 for field foo after collecting the first slice, we could ignore documents whose value for the foo field is greater than 7 for follow-up slices. And then we can order slices in the order that best suits us since Lucene has no expectation regarding the order in which slices are collected, so we could sort slices by increasing minimum (or maximum, or median) foo value. This could be especially useful in the worst-case scenario that index order is inversely correlated with sort order. For instance lots of users end up pushing logs to Lucene indices, and usually more recent logs get higher doc IDs. So fetching the most recent logs hits the worst-case scenario I mentioned in my previous sentence. Index sorting could help address this problem, but these users often have lots of data and care about indexing rate, while index sorting adds overhead to indexing. A related idea that [~jimczi] mentioned to me would be to shuffle segments both at merge time and when opening point-in-time views, in order to avoid ever having an index order that is inversely correlated with sort order. Similarly to how one can avoid running into quicksort's worst-case by shuffling the array first. > Order LeafReaderContexts by Estimated Number Of Hits > > > Key: LUCENE-8788 > URL: https://issues.apache.org/jira/browse/LUCENE-8788 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > We offer no guarantee on the order in which an IndexSearcher will look at > segments during a search operation. This can be improved for use cases where > an engine using Lucene invokes early termination and uses the partially > collected hits. A better model would be if we sorted segments by the > estimated number of hits, thus increasing the probability of the overall > relevance of the returned partial results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
[ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841035#comment-16841035 ] Atri Sharma commented on LUCENE-8788: - Folks, any thoughts on this? I am envisioning adding a generic mechanism which can allow users to order slices in any custom order. Note that segments within a slice will still be ordered by docID to maintain guarantees while collecting hits. This can be useful for custom early termination logic and better control for users to customize query execution for thir specific workloads > Order LeafReaderContexts by Estimated Number Of Hits > > > Key: LUCENE-8788 > URL: https://issues.apache.org/jira/browse/LUCENE-8788 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > We offer no guarantee on the order in which an IndexSearcher will look at > segments during a search operation. This can be improved for use cases where > an engine using Lucene invokes early termination and uses the partially > collected hits. A better model would be if we sorted segments by the > estimated number of hits, thus increasing the probability of the overall > relevance of the returned partial results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org