Re: Questions about the new vector API

Michael Sokolov Tue, 16 Mar 2021 13:47:46 -0700

Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
the need to completely recreate the graph. (2) searching across a
segmented index sacrifices much of the performance benefit of HNSW
since the cost of searching HNSW graphs scales ~logarithmically with
the size of the graph, so splitting into multiple graphs and then
merge sorting results is pretty expensive. I guess the random access /
scan forward dynamic is another problematic area.


On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote:
>
> Maybe that is so, but we should factor in everything: such as large scale 
> indexing, not requiring whole data set to be in RAM, etc. Hey, it's Lucene!
>
> Because HNSW has dominated the nightly benchmarks, I have been digging 
> through stacktraces and trying to figure out ways to make it work 
> efficiently, and I'm not sure what to do.
> Especially merge is painful: it seems to cause a storm of page faults/random 
> accesses due to how it works, and I don't know yet how to make it better.
> It seems to rebuild the entire graph, spraying random accesses across a 
> "slow-wrapper" that binary searches each sub on every access.
> I don't see any way to even amortize the pain with some kind of bulk merge 
> trick.
>
> So if we find algorithms that scale better, I think we should lend a 
> preference towards them. For example, algorithms that allow 
> per-segment/sequential index and merge.
>
> On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> wrote:
>>
>> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>> (approximate NN) algorithms. When we started this effort, HNSW was at
>> the top of the heap in most of the benchmarks.
>>
>> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote:
>> >
>> > Where are the alternative algorithms that work on sequential iterators and 
>> > don't need random access?
>> >
>> > Seems like these should be the ones we initially add to lucene, and HNSW 
>> > should be put aside for now? (is it a toy, or can we do it without 
>> > jazillions of random accesses?)
>> >
>> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com> 
>> > wrote:
>> >>
>> >> There's also some good discussion on
>> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random access
>> >> vs iterator pattern that never got fully resolved. We said we would
>> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
>> >> random access is pretty well-established there, maybe we should
>> >> abandon the iterator API since it is redundant (you can always iterate
>> >> over a random access API if you know the size)?
>> >>
>> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know for
>> >> > sure unless someone revives
>> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something like
>> >> > that
>> >> >
>> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <msoko...@gmail.com> 
>> >> > wrote:
>> >> > >
>> >> > > Consistent plural naming makes sense to me. I think it ended up
>> >> > > singular because I am biased to avoid plural names unless there is a
>> >> > > useful distinction to be made. But consistency should trump my
>> >> > > predilections.
>> >> > >
>> >> > > I think the reason we have search() on VectorValues is that we have
>> >> > > LeafReader.getVectorValues() (by analogy to the DocValues iterators),
>> >> > > but no way to access the VectorReader. Do you think we should also
>> >> > > have LeafReader.getVectorReader()? Today it's only on CodecReader.
>> >> > >
>> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access to
>> >> > > floating point values. Using BinaryDocValues for this will always
>> >> > > require an additional decoding step. I can see that the naming is
>> >> > > confusing there. The intent is that you index the vector values, but
>> >> > > no additional indexing data structure. Also: the reason HNSW is
>> >> > > mentioned in these SearchStrategy enums is to make room for other
>> >> > > vector indexing approaches, like LSH. There was a lot of discussion
>> >> > > that we wanted an API that allowed for experimenting with other
>> >> > > techniques for indexing and searching vector values.
>> >> > >
>> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and DocValues),
>> >> > > but I think the situation is more akin to Points, where we have the
>> >> > > options on IndexableField. The metadata we store there (dimension and
>> >> > > score function) don't really result in different formats, ie code
>> >> > > paths for indexing and storage; they are more like parameters to the
>> >> > > format, in my mind. Perhaps the situation will look different when we
>> >> > > get our second vector indexing strategy (like LSH).
>> >> > >
>> >> > >
>> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>> >> > > <tomoko.uchida.1...@gmail.com> wrote:
>> >> > > >
>> >> > > > > Should we rename VectorFormat to VectorsFormat? This would be 
>> >> > > > > more consistent with other file formats that use the plural, like 
>> >> > > > > PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> > > >
>> >> > > > +1 for using plural form for consistency - if we reconsider the 
>> >> > > > names, how about VectorValuesFormat so that it follows the naming 
>> >> > > > convention for XXXValues?
>> >> > > >
>> >> > > > DocValuesFormat / DocValues
>> >> > > > PointValuesFormat / PointValues
>> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / 
>> >> > > > VectorValues)
>> >> > > >
>> >> > > > > Should SearchStrategy constants avoid explicit references to HNSW?
>> >> > > >
>> >> > > > Also +1 for decoupling HNSW specific implementations from general 
>> >> > > > vectors, though I am not fully sure if we can strictly separate the 
>> >> > > > similarity metrics and search algorithms for vectors.
>> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, does it 
>> >> > > > achieve its goal? I haven't followed the issue in months because of 
>> >> > > > my laziness...
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Tomoko
>> >> > > >
>> >> > > >
>> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
>> >> > > >>
>> >> > > >> Hello,
>> >> > > >>
>> >> > > >> I've tried to catch up on the vector API and I have the following 
>> >> > > >> questions. I've tried to read through discussions on JIRA first in 
>> >> > > >> case it had been covered, but it's possible I missed some relevant 
>> >> > > >> ones.
>> >> > > >>
>> >> > > >> Should VectorValues#search be on VectorReader instead? It felt a 
>> >> > > >> bit odd to me to have the search logic on the iterator.
>> >> > > >>
>> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that it 
>> >> > > >> allows storing vectors but that NN search won't be supported. This 
>> >> > > >> looks like a use-case for binary doc values to me? It also 
>> >> > > >> slightly caught me by surprise due to the inconsistency with 
>> >> > > >> IndexOptions.NONE, which means "do not index this field" (and 
>> >> > > >> likewise for DocValuesType.NONE), so I first assumed that 
>> >> > > >> SearchStrategy.NONE also meant "do not index this field as a 
>> >> > > >> vector".
>> >> > > >>
>> >> > > >> While postings and doc-value formats allow per-field configuration 
>> >> > > >> via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors use a 
>> >> > > >> different mechanism where VectorField#createHnswType sets 
>> >> > > >> attributes on the field type that the vectors writer then reads. 
>> >> > > >> Should we have a PerFieldVectorsFormat instead and configure these 
>> >> > > >> options via the vectors format?
>> >> > > >>
>> >> > > >> Should SearchStrategy constants avoid explicit references to HNSW? 
>> >> > > >> The rest of the API seems to try to be agnostic of the way that NN 
>> >> > > >> search is implemented. Could we make SearchStrategy only about the 
>> >> > > >> similarity metric that is used for vectors? This particular point 
>> >> > > >> seems discussed on LUCENE-9322 but I couldn't find the conclusion.
>> >> > > >>
>> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be more 
>> >> > > >> consistent with other file formats that use the plural, like 
>> >> > > >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> > > >>
>> >> > > >> --
>> >> > > >> Adrien
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Questions about the new vector API

Reply via email to