[ https://issues.apache.org/jira/browse/LUCENE-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
AMIRAULT Martin updated LUCENE-7482: ------------------------------------ Description: We are currently using Lucene here in my company for our main product. Our search functionnality is quite basic and the results are always sorted given a predefined field. The user is only able to choose the sort order (Asc/Desc). I am currently investigating using the index sort feature with EarlyTerminationSortingCollector. This is quite a shame searching on a sorted index in reverse order do not have any optimization and was wondering if it would be possible to make it faster by creating a special "ReverseSortingCollector" for this purpose. I am aware the posting list is designed to be always iterated in the same order, so it is not about early-terminating the search but more about filtering-out unneeded documents more efficiently. If a segment is sorted in reverse order, we can work out easily the docId from which documents should be collected. Here is a sample quick code: {code:title=ReverseSortingCollector.java|borderStyle=solid} public class ReverseSortingCollector extends FilterCollector { /** Sort used to sort the search results */ protected final Sort sort; /** Number of documents to collect in each segment */ protected final int numDocsToCollect; [...] @Override public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { LeafReader reader = context.reader(); Sort segmentSort = reader.getIndexSort(); if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse order than the search sort //Here we can easily work out the docNum from which we should collect long collectFrom = context.reader().numDocs() - numDocsToCollect; return new FilterLeafCollector(in.getLeafCollector(context)) { @Override public void collect(int doc) throws IOException { if (doc >= collectFrom) {//only delegates super.collect(doc); } } }; }else{ return in.getLeafCollector(context); } } } {code} This is specially efficient when used along with TopFieldCollector as a lot of docValue lookup would not take place. In my experiment it reduced search time by 90%. However I was wondering if it is correct, as my knowledge of Lucene is still quite limited. Especially is it correct to assume that LeafReader docId always span from 0=>LeafReader.numDocs() ? Note : Does not support paging. Could be eventually implemented by providing a way to look up the docId to match from the last document collected (eg for LongPoint querying the docId closest to the previously returned value...) was: We are currently using Lucene here in my company for our main product. Our search functionnality is quite basic and the results are always sorted given a predefined field. The user is only able to choose the sort order (Asc/Desc). I am currently investigating using the index sort feature with EarlyTerminationSortingCollector. This is quite a shame searching on a sorted index in reverse order do not have any optimization and was wondering if it would be possible to make it faster by creating a special "ReverseSortingCollector" for this purpose. I am aware the posting list is designed to be always iterated in the same order, so it is not about early-terminating the search but more about filtering-out unneeded documents more efficiently. If a segment is sorted in reverse order, we can work out easily the docId from which documents should be collected. Here is a sample quick code: {code:title=ReverseSortingCollector.java|borderStyle=solid} public class ReverseSortingCollector extends FilterCollector { /** Sort used to sort the search results */ protected final Sort sort; /** Number of documents to collect in each segment */ protected final int numDocsToCollect; [...] @Override public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { LeafReader reader = context.reader(); Sort segmentSort = reader.getIndexSort(); if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse order than the search sort //Here we can easily work out the docNum from which we should collect long collectFrom = context.reader().numDocs() - numDocsToCollect; return new FilterLeafCollector(in.getLeafCollector(context)) { @Override public void collect(int doc) throws IOException { if (doc >= collectFrom) {//only delegates super.collect(doc); } } }; }else{ return in.getLeafCollector(context); } } } {code} This is specially efficient when used along with TopFieldCollector as a lot of docValue lookup would not take place. In my experiment it reduced search time by 90%. However I was wondering if it is correct, as my knowledge of Lucene is still quite limited. Especially is it correct to assume that LeafReader docId always span from 0->LeafReader.numDocs() ? Note : Does not support paging. Could be eventually implemented by providing a way to look up the docId to match from the last document collected (eg for LongPoint querying the docId closest to the previously returned value...) > Faster sorted index search for reverse order search > --------------------------------------------------- > > Key: LUCENE-7482 > URL: https://issues.apache.org/jira/browse/LUCENE-7482 > Project: Lucene - Core > Issue Type: New Feature > Reporter: AMIRAULT Martin > Priority: Minor > > We are currently using Lucene here in my company for our main product. > Our search functionnality is quite basic and the results are always sorted > given a predefined field. The user is only able to choose the sort order > (Asc/Desc). > I am currently investigating using the index sort feature with > EarlyTerminationSortingCollector. > This is quite a shame searching on a sorted index in reverse order do not > have any optimization and was wondering if it would be possible to make it > faster by creating a special "ReverseSortingCollector" for this purpose. > I am aware the posting list is designed to be always iterated in the same > order, so it is not about early-terminating the search but more about > filtering-out unneeded documents more efficiently. > If a segment is sorted in reverse order, we can work out easily the docId > from which documents should be collected. > Here is a sample quick code: > {code:title=ReverseSortingCollector.java|borderStyle=solid} > public class ReverseSortingCollector extends FilterCollector { > /** Sort used to sort the search results */ > protected final Sort sort; > /** Number of documents to collect in each segment */ > protected final int numDocsToCollect; > > [...] > @Override > public LeafCollector getLeafCollector(LeafReaderContext context) throws > IOException { > LeafReader reader = context.reader(); > Sort segmentSort = reader.getIndexSort(); > if (isReverseOrder(sort, segmentSort)) {//segment is sorted in > reverse order than the search sort > > //Here we can easily work out the docNum from which we > should collect > long collectFrom = context.reader().numDocs() - > numDocsToCollect; > > return new FilterLeafCollector(in.getLeafCollector(context)) { > @Override > public void collect(int doc) throws IOException { > if (doc >= collectFrom) {//only delegates > super.collect(doc); > } > } > }; > }else{ > return in.getLeafCollector(context); > } > } > > } > {code} > This is specially efficient when used along with TopFieldCollector as a lot > of docValue lookup would not take place. > In my experiment it reduced search time by 90%. > However I was wondering if it is correct, as my knowledge of Lucene is still > quite limited. > Especially is it correct to assume that LeafReader docId always span from > 0=>LeafReader.numDocs() ? > Note : Does not support paging. Could be eventually implemented by providing > a way to look up the docId to match from the last document collected (eg for > LongPoint querying the docId closest to the previously returned value...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org