I'm adding Lucene HNSW to Cassandra for vector search. One of my test harnesses loads 50k openai embeddings. Works as expected; as someone pointed out, it should be linear wrt vector size and that is what I see. I would not be afraid of increasing the max size.
In parallel, Cassandra is also adding numerical indexes using Lucene's k-d tree. We definitely expect people to want to compose the two (topK vector matches that also satisfy some other predicates). But I agree that classic term based relevance queries are probably less useful combined w/ vector search. On Tue, May 9, 2023 at 11:59 AM Jun Luo <luo.jun...@gmail.com> wrote: > The pr mentioned a Elasticsearch pr > <https://github.com/elastic/elasticsearch/pull/95257> that increased the > dim to 2048 in ElasticSearch. > > Curious how you use Lucene's KNN search. Lucene's KNN supports one vector > per document. Usually multiple/many vectors are needed for a document > content. We will have to split the document content into chunks and create > one Lucene document per document chunk. > > ChatGPT plugin directly stores the chunk text in the underline vector db. > If there are lots of documents, will it be a concern to store the full > document content in Lucene? In the traditional inverted index use case, is > it common to store the full document content in Lucene? > > Another question: if you use Lucene as a vector db, do you still need the > inverted index? Wondering what would be the use case to use inverted index > together with vector index. If we don't need the inverted index, will it be > better to use other vector dbs? For example, PostgreSQL also added vector > support recently. > > Thanks, > Jun > > On Sat, May 6, 2023 at 1:44 PM Michael Wechner <michael.wech...@wyona.com> > wrote: > >> there is already a pull request for Elasticsearch which is also >> mentioning the max size 1024 >> >> https://github.com/openai/chatgpt-retrieval-plugin/pull/83 >> >> >> >> Am 06.05.23 um 19:00 schrieb Michael Wechner: >> > Hi Together >> > >> > I recently setup ChatGPT retrieval plugin locally >> > >> > https://github.com/openai/chatgpt-retrieval-plugin >> > >> > I think it would be nice to consider to submit a Lucene implementation >> > for this plugin >> > >> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions >> > >> > The plugin is using by default OpenAI's model "text-embedding-ada-002" >> > with 1536 dimensions >> > >> > https://openai.com/blog/new-and-improved-embedding-model >> > >> > but which means one won't be able to use it out-of-the-box with Lucene. >> > >> > Similar request here >> > >> > >> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions >> > >> > >> > I understand we just recently had a lenghty discussion about >> > increasing the max dimension and whatever one thinks of OpenAI, fact >> > is, that it has a huge impact and I think it would be nice that Lucene >> > could be part of this "revolution". All we have to do is increase the >> > limit from 1024 to 1536 or even 2048 for example. >> > >> > Since the performace seems to be linear with the vector dimension and >> > several members have done performance tests successfully and 1024 >> > seems to have been chosen as max dimension quite arbitrarily in the >> > first place, I think it should not be a problem to increase the max >> > dimension by a factor 1.5 or 2. >> > >> > WDYT? >> > >> > Thanks >> > >> > Michael >> > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> -- Jonathan Ellis co-founder, http://www.datastax.com @spyced