Re: Questions about the new vector API

Michael Sokolov Wed, 17 Mar 2021 06:51:49 -0700

I see, right, we can create a Codec that applies the values takes from
the schema for a given field, sure, that works.


On Wed, Mar 17, 2021 at 3:10 AM Adrien Grand <[email protected]> wrote:
>
> Configuring the codec based on the schema is something that Solr does via 
> SchemaCodecFactory. 
> https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
>
> Would a similar approach work in your case?
>
> Le mar. 16 mars 2021 à 22:21, Michael Sokolov <[email protected]> a écrit :
>>
>> > I was more thinking of moving VectorValues#search to 
>> > LeafReader#searchNearestVectors or something along those lines. I agree 
>> > that VectorReader should only be exposed on CodecReader.
>>
>> Ah, OK, yes that makes sense to me. I guess we were maybe reluctant to
>> add such visible API changes early on in the project.
>>
>> > I wonder if things would be simpler if we were more opinionated and made 
>> > vectors specifically about nearest-neighbor search. Then we have a clearer 
>> > message, use vectors for NN search and doc values otherwise. As far as I 
>> > know, reinterpreting bytes as floats shouldn't add much overhead. The main 
>> > problem I know of is that the JVM won't auto-vectorize if you read floats 
>> > dynamically from a byte[], but this is something that should be alleviated 
>> > by the JDK vector API?
>>
>> > Actually this is the thing that feels odd to me: if we end up with 
>> > constants for both LSH and HNSW, then we are adding the requirement that 
>> > all vector formats must implement both LSH and HNSW as they will need to 
>> > support all SearchStrategy constants?
>>
>> Hmm I see I didn't think this all the way through ... I guess I had it
>> in mind that there would probably only ever be a single format with
>> internal variants for different vector index types, but as I have
>> worked more with Lucene's index formats I see that is awkward, and I'm
>> certainly open to restructuring it in a more natural way. Similarly
>> for the NONE format - BinaryDocValues can be used for such
>> (non-searchable) vectors. Indeed we had such an implementation and
>> although we recently switched it to use the NONE format for
>> uniformity, it could easily be switched back.
>>
>> Regarding the graph construction parameters (maxConn and beamWidth)
>> I'm not sure what the right approach is exactly. We struggled to find
>> the best API for this. I guess my concern about the PerField* approach
>> is (at least as I think I understand it) it needs to be configured in
>> code when creating a Codec. But we would like to be able to read such
>> parameters from a schema configuration. I think of them as in the same
>> spirit as an Analyzer. However I may not have fully appreciated the
>> intention of, or how to make the best use of PerField formats. It is
>> true we don't really need to write these parameters to the index;
>> we're free to use different values when merging for example.
>>
>> -Mike
>>
>> On Tue, Mar 16, 2021 at 2:15 PM Adrien Grand <[email protected]> wrote:
>> >
>> > Hello Mike,
>> >
>> > On Tue, Mar 16, 2021 at 5:05 PM Michael Sokolov <[email protected]> wrote:
>> >>
>> >> I think the reason we have search() on VectorValues is that we have
>> >> LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >> but no way to access the VectorReader. Do you think we should also
>> >> have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >
>> >
>> > I was more thinking of moving VectorValues#search to 
>> > LeafReader#searchNearestVectors or something along those lines. I agree 
>> > that VectorReader should only be exposed on CodecReader.
>> >
>> >>
>> >> Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> floating point values. Using BinaryDocValues for this will always
>> >> require an additional decoding step. I can see that the naming is
>> >> confusing there. The intent is that you index the vector values, but
>> >> no additional indexing data structure.
>> >
>> >
>> > I wonder if things would be simpler if we were more opinionated and made 
>> > vectors specifically about nearest-neighbor search. Then we have a clearer 
>> > message, use vectors for NN search and doc values otherwise. As far as I 
>> > know, reinterpreting bytes as floats shouldn't add much overhead. The main 
>> > problem I know of is that the JVM won't auto-vectorize if you read floats 
>> > dynamically from a byte[], but this is something that should be alleviated 
>> > by the JDK vector API?
>> >
>> >> Also: the reason HNSW is
>> >> mentioned in these SearchStrategy enums is to make room for other
>> >> vector indexing approaches, like LSH. There was a lot of discussion
>> >> that we wanted an API that allowed for experimenting with other
>> >> techniques for indexing and searching vector values.
>> >
>> >
>> > Actually this is the thing that feels odd to me: if we end up with 
>> > constants for both LSH and HNSW, then we are adding the requirement that 
>> > all vector formats must implement both LSH and HNSW as they will need to 
>> > support all SearchStrategy constants? Would it be possible to have a 
>> > single API and then two implementations of VectorsFormat, LSHVectorsFormat 
>> > on the one hand and HNSWVectorsFormat on the other hand?
>> >
>> >> Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >> but I think the situation is more akin to Points, where we have the
>> >> options on IndexableField. The metadata we store there (dimension and
>> >> score function) don't really result in different formats, ie code
>> >> paths for indexing and storage; they are more like parameters to the
>> >> format, in my mind. Perhaps the situation will look different when we
>> >> get our second vector indexing strategy (like LSH).
>> >
>> >
>> > Having the dimension count and the score function on the FieldType 
>> > actually makes sense to me. I was more wondering whether maxConn and 
>> > beamWidth actually belong to the FieldType, or if they should be made 
>> > constructor arguments of Lucene90VectorFormat.
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Questions about the new vector API

Reply via email to