[
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873323#comment-16873323
]
Adrien Grand commented on LUCENE-8885:
--------------------------------------
I don't think backward compatibility should be a concern for such low-level
APIs. I like 1 but then we wouldn't have an API that we could use for merges
anymore since we wouldn't have a reliable way to know which byte[] maps to
which docID(s)? Maybe 2 is more practical, and we could make it look a bit
nicer by replacing the int[] with a DocIdSetIterator, ie. {{void
visit(DocIdSetIterator docs, byte[] packedValue) throws IOException}}?
> Optimise BKD reader by exploiting cardinality information stored on leaves
> --------------------------------------------------------------------------
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Ignacio Vera
> Priority: Major
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains
> duplicated points. In such case the points are stored together with the
> cardinality. We still call the IntersectVisitor once per document therefore
> we are checking many times the same point agains the query. The idea is to
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit
> that property we need to either change the API or extend it. Here are the
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int)
> by the following method:
> {code:java}
> /** Called for leaf cells that intersects the leaf to test if the point
> matches to the query
> * In case it matches, the implementor must call {@link
> IntersectVisitor#visit(int)} with the
> * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if
> true then Coll the method IntersectVisitor#visit(int) for all documents
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor
> interface:
> {code:java}
> /** Called for documents in a leaf cell that crosses the query. The consumer
> * should scrutinize the packedValue to decide whether to accept it. If
> accepted it should
> * consider only the {@code numberDocs} documents starting at {@code
> offset} In the 1D case,
> * values are visited in increasing order, and in the case of ties, in
> increasing
> * docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[]
> packedValue) throws IOException {
> for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
> }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to
> the implementors to override this method and get the benefits for this
> optimisation.The biggest downside is that it assumes that the codec has doc
> IDs available in an int[] slice as opposed to streaming them from disk
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to
> hearing opining if we should do this change at all and if so, how to approach
> it. My +1 goes to 1).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]