Re: Questions about the new vector API

Robert Muir Tue, 16 Mar 2021 10:28:23 -0700

Maybe that is so, but we should factor in everything: such as large scale
indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene!


Because HNSW has dominated the nightly benchmarks, I have been digging
through stacktraces and trying to figure out ways to make it work
efficiently, and I'm not sure what to do.
Especially merge is painful: it seems to cause a storm of page
faults/random accesses due to how it works, and I don't know yet how to
make it better.
It seems to rebuild the entire graph, spraying random accesses across a
"slow-wrapper" that binary searches each sub on every access.
I don't see any way to even amortize the pain with some kind of bulk merge
trick.

So if we find algorithms that scale better, I think we should lend a
preference towards them. For example, algorithms that allow
per-segment/sequential index and merge.

On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> wrote:

> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> (approximate NN) algorithms. When we started this effort, HNSW was at
> the top of the heap in most of the benchmarks.
>
> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote:
> >
> > Where are the alternative algorithms that work on sequential iterators
> and don't need random access?
> >
> > Seems like these should be the ones we initially add to lucene, and HNSW
> should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >
> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >>
> >> There's also some good discussion on
> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
> >> vs iterator pattern that never got fully resolved. We said we would
> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> random access is pretty well-established there, maybe we should
> >> abandon the iterator API since it is redundant (you can always iterate
> >> over a random access API if you know the size)?
> >>
> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >> >
> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
> >> > sure unless someone revives
> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
> >> > that
> >> >
> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >> > >
> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> > > singular because I am biased to avoid plural names unless there is a
> >> > > useful distinction to be made. But consistency should trump my
> >> > > predilections.
> >> > >
> >> > > I think the reason we have search() on VectorValues is that we have
> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> > > but no way to access the VectorReader. Do you think we should also
> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
> >> > >
> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
> >> > > floating point values. Using BinaryDocValues for this will always
> >> > > require an additional decoding step. I can see that the naming is
> >> > > confusing there. The intent is that you index the vector values, but
> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> > > vector indexing approaches, like LSH. There was a lot of discussion
> >> > > that we wanted an API that allowed for experimenting with other
> >> > > techniques for indexing and searching vector values.
> >> > >
> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> > > but I think the situation is more akin to Points, where we have the
> >> > > options on IndexableField. The metadata we store there (dimension
> and
> >> > > score function) don't really result in different formats, ie code
> >> > > paths for indexing and storage; they are more like parameters to the
> >> > > format, in my mind. Perhaps the situation will look different when
> we
> >> > > get our second vector indexing strategy (like LSH).
> >> > >
> >> > >
> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> > > <tomoko.uchida.1...@gmail.com> wrote:
> >> > > >
> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >
> >> > > > +1 for using plural form for consistency - if we reconsider the
> names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> > > >
> >> > > > DocValuesFormat / DocValues
> >> > > > PointValuesFormat / PointValues
> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> > > >
> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> > > >
> >> > > > Also +1 for decoupling HNSW specific implementations from general
> vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does
> it achieve its goal? I haven't followed the issue in months because of my
> laziness...
> >> > > >
> >> > > > Thanks,
> >> > > > Tomoko
> >> > > >
> >> > > >
> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
> >> > > >>
> >> > > >> Hello,
> >> > > >>
> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> > > >>
> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a
> bit odd to me to have the search logic on the iterator.
> >> > > >>
> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it
> allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> > > >>
> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> > > >>
> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> > > >>
> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> > > >>
> >> > > >> --
> >> > > >> Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Questions about the new vector API

Reply via email to