I did track down a weird bug I was seeing to our cosine similarity
returning NaN with high dimension vectors.  Fix is here:
https://github.com/apache/lucene/pull/12281

On Tue, May 9, 2023 at 12:15 PM Jonathan Ellis <jbel...@gmail.com> wrote:

> I'm adding Lucene HNSW to Cassandra for vector search.  One of my test
> harnesses loads 50k openai embeddings.  Works as expected; as someone
> pointed out, it should be linear wrt vector size and that is what I see.  I
> would not be afraid of increasing the max size.
>
> In parallel, Cassandra is also adding numerical indexes using Lucene's k-d
> tree.  We definitely expect people to want to compose the two (topK vector
> matches that also satisfy some other predicates).
>
> But I agree that classic term based relevance queries are probably less
> useful combined w/ vector search.
>
>
> On Tue, May 9, 2023 at 11:59 AM Jun Luo <luo.jun...@gmail.com> wrote:
>
>> The pr mentioned a Elasticsearch pr
>> <https://github.com/elastic/elasticsearch/pull/95257> that increased the
>> dim to 2048 in ElasticSearch.
>>
>> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
>> per document. Usually multiple/many vectors are needed for a document
>> content. We will have to split the document content into chunks and create
>> one Lucene document per document chunk.
>>
>> ChatGPT plugin directly stores the chunk text in the underline vector db.
>> If there are lots of documents, will it be a concern to store the full
>> document content in Lucene? In the traditional inverted index use case, is
>> it common to store the full document content in Lucene?
>>
>> Another question: if you use Lucene as a vector db, do you still need the
>> inverted index? Wondering what would be the use case to use inverted index
>> together with vector index. If we don't need the inverted index, will it be
>> better to use other vector dbs? For example, PostgreSQL also added vector
>> support recently.
>>
>> Thanks,
>> Jun
>>
>> On Sat, May 6, 2023 at 1:44 PM Michael Wechner <michael.wech...@wyona.com>
>> wrote:
>>
>>> there is already a pull request for Elasticsearch which is also
>>> mentioning the max size 1024
>>>
>>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>>
>>>
>>>
>>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>>> > Hi Together
>>> >
>>> > I recently setup ChatGPT retrieval plugin locally
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin
>>> >
>>> > I think it would be nice to consider to submit a Lucene implementation
>>> > for this plugin
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>>> >
>>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>>> > with 1536 dimensions
>>> >
>>> > https://openai.com/blog/new-and-improved-embedding-model
>>> >
>>> > but which means one won't be able to use it out-of-the-box with Lucene.
>>> >
>>> > Similar request here
>>> >
>>> >
>>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>>> >
>>> >
>>> > I understand we just recently had a lenghty discussion about
>>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>>> > is, that it has a huge impact and I think it would be nice that Lucene
>>> > could be part of this "revolution". All we have to do is increase the
>>> > limit from 1024 to 1536 or even 2048 for example.
>>> >
>>> > Since the performace seems to be linear with the vector dimension and
>>> > several members have done performance tests successfully and 1024
>>> > seems to have been chosen as max dimension quite arbitrarily in the
>>> > first place, I think it should not be a problem to increase the max
>>> > dimension by a factor 1.5 or 2.
>>> >
>>> > WDYT?
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to