Re: Conneting Lucene with ChatGPT Retrieval Plugin

Michael Wechner Tue, 09 May 2023 11:20:29 -0700

Yes, you would split the document into multiple chunks, whereas theChatGPT retrieval plugin does this by itself, whereas AFAIK the defaultchunk size is 200 tokens(https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py).

Also it creates a unique ID for each document you upload, which is savedas "document_id" (at least for Weaviate) together with the chunk text.

Re a Lucene implementation, you might want to store the chunk textoutside of the Lucene index using only a chunk id.


HTH

Michael

Am 09.05.23 um 04:14 schrieb Jun Luo:

The pr mentioned a Elasticsearch pr<https://github.com/elastic/elasticsearch/pull/95257> that increasedthe dim to 2048 in ElasticSearch.
Curious how you use Lucene's KNN search. Lucene's KNN supports onevector per document. Usually multiple/many vectors are needed for adocument content. We will have to split the document content intochunks and create one Lucene document per document chunk.
ChatGPT plugin directly stores the chunk text in the underline vectordb. If there are lots of documents, will it be a concern to store thefull document content in Lucene? In the traditional inverted index usecase, is it common to store the full document content in Lucene?
Another question: if you use Lucene as a vector db, do you still needthe inverted index? Wondering what would be the use case to useinverted index together with vector index. If we don't need theinverted index, will it be better to use other vector dbs? Forexample, PostgreSQL also added vector support recently.
Thanks,
Jun
On Sat, May 6, 2023 at 1:44 PM Michael Wechner<michael.wech...@wyona.com> wrote:
    there is already a pull request for Elasticsearch which is also
    mentioning the max size 1024

    https://github.com/openai/chatgpt-retrieval-plugin/pull/83



    Am 06.05.23 um 19:00 schrieb Michael Wechner:
    > Hi Together
    >
    > I recently setup ChatGPT retrieval plugin locally
    >
    > https://github.com/openai/chatgpt-retrieval-plugin
    >
    > I think it would be nice to consider to submit a Lucene
    implementation
    > for this plugin
    >
    > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
    >
    > The plugin is using by default OpenAI's model
    "text-embedding-ada-002"
    > with 1536 dimensions
    >
    > https://openai.com/blog/new-and-improved-embedding-model
    >
    > but which means one won't be able to use it out-of-the-box with
    Lucene.
    >
    > Similar request here
    >
    >
    
https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions

    >
    >
    > I understand we just recently had a lenghty discussion about
    > increasing the max dimension and whatever one thinks of OpenAI,
    fact
    > is, that it has a huge impact and I think it would be nice that
    Lucene
    > could be part of this "revolution". All we have to do is
    increase the
    > limit from 1024 to 1536 or even 2048 for example.
    >
    > Since the performace seems to be linear with the vector
    dimension and
    > several members have done performance tests successfully and 1024
    > seems to have been chosen as max dimension quite arbitrarily in the
    > first place, I think it should not be a problem to increase the max
    > dimension by a factor 1.5 or 2.
    >
    > WDYT?
    >
    > Thanks
    >
    > Michael
    >
    >
    >
    >
    ---------------------------------------------------------------------
    > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    > For additional commands, e-mail: dev-h...@lucene.apache.org
    >


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Conneting Lucene with ChatGPT Retrieval Plugin

Reply via email to