[ https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875943#comment-16875943 ]
ASF subversion and git services commented on LUCENE-8885: --------------------------------------------------------- Commit f5991d96bab4297b2c975dc2fb1a8124b91f34b6 in lucene-solr's branch refs/heads/branch_8x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f5991d9 ] LUCENE-8885: Optimise BKD reader by exploiting cardinality information stored on leaves (#746) The commit adds the method InstersectVisitor#visit(DocIdSetIterator, byte[]). > Optimise BKD reader by exploiting cardinality information stored on leaves > -------------------------------------------------------------------------- > > Key: LUCENE-8885 > URL: https://issues.apache.org/jira/browse/LUCENE-8885 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Ignacio Vera > Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > In LUCENE-8688 it was introduce a new storing strategy for leaves contains > duplicated points. In such case the points are stored together with the > cardinality. We still call the IntersectVisitor once per document therefore > we are checking many times the same point agains the query. The idea is to > check the point once and then add all the documents. > The API of the IntersectVisitor does not allow that, and therefore to exploit > that property we need to either change the API or extend it. Here are the > possibilities I can think of: > 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) > by the following method: > {code:java} > /** Called for leaf cells that intersects the leaf to test if the point > matches to the query > * In case it matches, the implementor must call {@link > IntersectVisitor#visit(int)} with the > * documents associated with this point are visited */ > boolean matches(byte[] packedValue) throws IOException; > {code} > This will allow the BKD reader to check if a point matches the query and if > true then Coll the method IntersectVisitor#visit(int) for all documents > associated with that point. > The drawback of this approach is backwards compatibility and the need to > update all classes implement this interface. > 2) Extends the API by adding a new default method in the IntersectVisitor > interface: > {code:java} > /** Called for documents in a leaf cell that crosses the query. The consumer > * should scrutinize the packedValue to decide whether to accept it. If > accepted it should > * consider only the {@code numberDocs} documents starting at {@code > offset} In the 1D case, > * values are visited in increasing order, and in the case of ties, in > increasing > * docID order. */ > default void visit(int[] docID, int offset, int numberDocs, byte[] > packedValue) throws IOException { > for ( int i =offset; i < offset + numberDocs; i++) { > visit(docID[i], packedValue); > } > } > {code} > The merit of this approach is that is backwards compatible and it is up to > the implementors to override this method and get the benefits for this > optimisation.The biggest downside is that it assumes that the codec has doc > IDs available in an int[] slice as opposed to streaming them from disk > directly to the IntersectVisitor for instance as [~jpountz] noted. > Maybe there are more options I did not think about so looking forward to > hearing opining if we should do this change at all and if so, how to approach > it. My +1 goes to 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org