Consistent plural naming makes sense to me. I think it ended up
singular because I am biased to avoid plural names unless there is a
useful distinction to be made. But consistency should trump my
predilections.

I think the reason we have search() on VectorValues is that we have
LeafReader.getVectorValues() (by analogy to the DocValues iterators),
but no way to access the VectorReader. Do you think we should also
have LeafReader.getVectorReader()? Today it's only on CodecReader.

Re: SearchStrategy.NONE; the idea is we support efficient access to
floating point values. Using BinaryDocValues for this will always
require an additional decoding step. I can see that the naming is
confusing there. The intent is that you index the vector values, but
no additional indexing data structure. Also: the reason HNSW is
mentioned in these SearchStrategy enums is to make room for other
vector indexing approaches, like LSH. There was a lot of discussion
that we wanted an API that allowed for experimenting with other
techniques for indexing and searching vector values.

Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
but I think the situation is more akin to Points, where we have the
options on IndexableField. The metadata we store there (dimension and
score function) don't really result in different formats, ie code
paths for indexing and storage; they are more like parameters to the
format, in my mind. Perhaps the situation will look different when we
get our second vector indexing strategy (like LSH).


On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
<tomoko.uchida.1...@gmail.com> wrote:
>
> > Should we rename VectorFormat to VectorsFormat? This would be more 
> > consistent with other file formats that use the plural, like 
> > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>
> +1 for using plural form for consistency - if we reconsider the names, how 
> about VectorValuesFormat so that it follows the naming convention for 
> XXXValues?
>
> DocValuesFormat / DocValues
> PointValuesFormat / PointValues
> VectorValuesFormat / VectorValues (currently, VectorFormat / VectorValues)
>
> > Should SearchStrategy constants avoid explicit references to HNSW?
>
> Also +1 for decoupling HNSW specific implementations from general vectors, 
> though I am not fully sure if we can strictly separate the similarity metrics 
> and search algorithms for vectors.
> LUCENE-9322 (unified vectors API) was resolved months ago, does it achieve 
> its goal? I haven't followed the issue in months because of my laziness...
>
> Thanks,
> Tomoko
>
>
> 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
>>
>> Hello,
>>
>> I've tried to catch up on the vector API and I have the following questions. 
>> I've tried to read through discussions on JIRA first in case it had been 
>> covered, but it's possible I missed some relevant ones.
>>
>> Should VectorValues#search be on VectorReader instead? It felt a bit odd to 
>> me to have the search logic on the iterator.
>>
>> Do we need SearchStrategy.NONE? Documentation suggests that it allows 
>> storing vectors but that NN search won't be supported. This looks like a 
>> use-case for binary doc values to me? It also slightly caught me by surprise 
>> due to the inconsistency with IndexOptions.NONE, which means "do not index 
>> this field" (and likewise for DocValuesType.NONE), so I first assumed that 
>> SearchStrategy.NONE also meant "do not index this field as a vector".
>>
>> While postings and doc-value formats allow per-field configuration via 
>> PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a different 
>> mechanism where VectorField#createHnswType sets attributes on the field type 
>> that the vectors writer then reads. Should we have a PerFieldVectorsFormat 
>> instead and configure these options via the vectors format?
>>
>> Should SearchStrategy constants avoid explicit references to HNSW? The rest 
>> of the API seems to try to be agnostic of the way that NN search is 
>> implemented. Could we make SearchStrategy only about the similarity metric 
>> that is used for vectors? This particular point seems discussed on 
>> LUCENE-9322 but I couldn't find the conclusion.
>>
>> Should we rename VectorFormat to VectorsFormat? This would be more 
>> consistent with other file formats that use the plural, like PostingsFormat, 
>> DocValuesFormat, TermVectorsFormat, etc.?
>>
>> --
>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to