I agree that Lucene should support vector sizes depending on the model
one is choosing.
For example Weaviate seems to do this
https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479
Thanks
Michael
Am 07.08.22 um 22:48 schrieb Marcus Eagan:
Hi Lucene Team,
In general, I have advised very strongly against our team at MongoDB
modifying the Lucene source, except in scenarios where we have strong
needs for a particular customization. Ultimately, people can do what
they would like to do.
That being said, we have a number of customers preparing to use Lucene
for dense vector search. There are many language models that are
optimized for > 1024 dimensions. I remember Michael Wechner's email
<https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html>
about one instance with Open API.
I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimension
It seems that customers who attempt to use these models should not be
turned away. It could be sufficient to explain the issues. The only
ones I have identified are two expected ones in very slow indexing
throughput, high CPU usage, and a maybe less defined risk of more
numerical errors.
I opened an issue <https://github.com/apache/lucene/issues/1060> and
PR <https://github.com/apache/lucene/pull/1061> for the discussion as
well. I would appreciate guidance on where we think the warning should
go. I feel like burying in a Javadoc is a less than ideal experience.
It would be better to be a warning on startup. In the PR, I increased
the max limit by a factor of twenty. We should let users use the
system based on their needs even if it was designed or optimized for
the models they bring because we need the feedback and the data from
the world.
Is there something I'm overlooking from a risk standpoint?
Best,
--
Marcus Eagan