Re: Questions about the new vector API

Robert Muir Tue, 16 Mar 2021 11:16:14 -0700

If you click the github link from here, it says in the README.md: "Focus on
datasets that fit in RAM. Out of core ANN could be the topic of a later
comparison."


But a quick google search on some of these out of core ANN algorithms shows
some promise, here is the summary of the first one i stumbled on:

We propose a novel approach to compute KNN on large datasets by leveraging
both disk and main memory efficiently. The main rationale of our approach
is to minimize random accesses to disk, maximize sequential accesses to
data and efficient usage of only the available memory. We evaluate our
approach on large datasets, in terms of performance andmemory consumption.
The evaluation shows that our approach requiresonly 7% of the time needed
by an in-memory baseline to compute a KNN graph.

https://hal.inria.fr/hal-01336673/document

On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <[email protected]> wrote:

> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> (approximate NN) algorithms. When we started this effort, HNSW was at
> the top of the heap in most of the benchmarks.
>
> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <[email protected]> wrote:
> >
> > Where are the alternative algorithms that work on sequential iterators
> and don't need random access?
> >
> > Seems like these should be the ones we initially add to lucene, and HNSW
> should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >
> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <[email protected]>
> wrote:
> >>
> >> There's also some good discussion on
> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
> >> vs iterator pattern that never got fully resolved. We said we would
> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> random access is pretty well-established there, maybe we should
> >> abandon the iterator API since it is redundant (you can always iterate
> >> over a random access API if you know the size)?
> >>
> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <[email protected]>
> wrote:
> >> >
> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> >> > sure unless someone revives
> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> >> > that
> >> >
> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <[email protected]>
> wrote:
> >> > >
> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> > > singular because I am biased to avoid plural names unless there is a
> >> > > useful distinction to be made. But consistency should trump my
> >> > > predilections.
> >> > >
> >> > > I think the reason we have search() on VectorValues is that we have
> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> > > but no way to access the VectorReader. Do you think we should also
> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> > >
> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> > > floating point values. Using BinaryDocValues for this will always
> >> > > require an additional decoding step. I can see that the naming is
> >> > > confusing there. The intent is that you index the vector values, but
> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> > > vector indexing approaches, like LSH. There was a lot of discussion
> >> > > that we wanted an API that allowed for experimenting with other
> >> > > techniques for indexing and searching vector values.
> >> > >
> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> > > but I think the situation is more akin to Points, where we have the
> >> > > options on IndexableField. The metadata we store there (dimension
> and
> >> > > score function) don't really result in different formats, ie code
> >> > > paths for indexing and storage; they are more like parameters to the
> >> > > format, in my mind. Perhaps the situation will look different when
> we
> >> > > get our second vector indexing strategy (like LSH).
> >> > >
> >> > >
> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> > > <[email protected]> wrote:
> >> > > >
> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >
> >> > > > +1 for using plural form for consistency - if we reconsider the
> names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> > > >
> >> > > > DocValuesFormat / DocValues
> >> > > > PointValuesFormat / PointValues
> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> > > >
> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> > > >
> >> > > > Also +1 for decoupling HNSW specific implementations from general
> vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does
> it achieve its goal? I haven't followed the issue in months because of my
> laziness...
> >> > > >
> >> > > > Thanks,
> >> > > > Tomoko
> >> > > >
> >> > > >
> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <[email protected]>:
> >> > > >>
> >> > > >> Hello,
> >> > > >>
> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> > > >>
> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a
> bit odd to me to have the search logic on the iterator.
> >> > > >>
> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it
> allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> > > >>
> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> > > >>
> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> > > >>
> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >>
> >> > > >> --
> >> > > >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Questions about the new vector API

Reply via email to