Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Michael Wechner
I assumed that you would wrap Lucene into a mimimal REST service or use Solr or Elasticsearch Am 09.05.23 um 19:07 schrieb jim ferenczi: Lucene is a library. I don’t see how it would be exposed in this plugin which is about services. On Tue, 9 May 2023 at 18:00, Jun Luo wrote: The pr

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Michael McCandless
Besides not being able to use the bloom filter, seekCeil is also just more costly than seekExact since it is essentially both .seekExact and .next in a single operation. Are either of the two approaches using the intersect method of TermsEnum? It might be faster if the number of terms is over

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Michael Wechner
Yes, you would split the document into multiple chunks, whereas the ChatGPT retrieval plugin does this by itself, whereas AFAIK the default chunk size is 200 tokens (https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py). Also it creates a unique ID for each document

Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-09 Thread Michael Wechner
+1 Michael Wechner Am 09.05.23 um 14:08 schrieb Alessandro Benedetti: *Proposed option*: make the limit configurable *Motivation*: The system administrator can enforce a limit its users need to respect that it's in line with whatever the admin decided to be acceptable for them. The default

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Greg Miller
Thanks for the feedback Robert. This approach sounds like a better path to follow. I'll explore it. I agree that we should provide default behavior that is overall best for our users, and not for one specific use-case such as Amazon search :). Mike- TermInSetQuery used to use seekExact, and now

Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-09 Thread Marcus Eagan
Over the past month, and lots of working with Lucene, I've moved to Robert Muir's camp. *Proposed option*: We focus our efforts on improving the testing infrastructure, stability, and performance of the feature as is prior to introducing more complexity. Someone could benefit the community to

Call for Presentations, Community Over Code 2023

2023-05-09 Thread Rich Bowen
(Note: You are receiving this because you are subscribed to the dev@ list for one or more Apache Software Foundation projects.) The Call for Presentations (CFP) for Community Over Code (formerly Apachecon) 2023 is open at https://communityovercode.org/call-for-presentations/, and will close Thu,

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir
I remember the benefits from Terms.intersect being pretty huge. Rather than simple ping-pong, the whole monster gets handed off directly to the codec's term dictionary implementation. For the default terms dictionary using blocktree, this saves time seeking to terms you don't care about (because

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Alessandro Benedetti
I tried my best in the previous thread to set a plan of action to decide what should be done with that limit, I tried to summarise the possible next steps multiple times, but the discussion steered into other directions (fierce opposition, benchmarking, etc, etc). I created a new thread:

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread jim ferenczi
Lucene is a library. I don’t see how it would be exposed in this plugin which is about services. On Tue, 9 May 2023 at 18:00, Jun Luo wrote: > The pr mentioned a Elasticsearch pr > that increased the > dim to 2048 in ElasticSearch. > >

Re: Building the website

2023-05-09 Thread Alan Woodward
Ah never mind, it looks like GitHub is having problems today, will see if it picks things up again later... > On 9 May 2023, at 13:02, Alan Woodward wrote: > > I’m trying to update the website with details of the 9.6 release, and the > website staging build is failing with the following error

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Jonathan Ellis
I'm adding Lucene HNSW to Cassandra for vector search. One of my test harnesses loads 50k openai embeddings. Works as expected; as someone pointed out, it should be linear wrt vector size and that is what I see. I would not be afraid of increasing the max size. In parallel, Cassandra is also

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Jonathan Ellis
It looks like the framework is designed to support self-hosted plugins. On Tue, May 9, 2023 at 12:13 PM jim ferenczi wrote: > Lucene is a library. I don’t see how it would be exposed in this plugin > which is about services. > > > On Tue, 9 May 2023 at 18:00, Jun Luo wrote: > >> The pr

Re: HNSW questions

2023-05-09 Thread Jonathan Ellis
I don't see anything to make sure vectors are unique in IndexingChain down to FieldWriter, is that handled somewhere else? Or is it just up to the user to make sure no documents end up with duplicate vectors? On Wed, Apr 19, 2023 at 5:07 AM Michael Sokolov wrote: > Oh identical vectors.

Building the website

2023-05-09 Thread Alan Woodward
I’m trying to update the website with details of the 9.6 release, and the website staging build is failing with the following error messages: remote: Error running update hook: /x1/gitbox/hooks/update.d/01-sync-repo.py remote: error: hook declined to update refs/heads/asf-staging

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Bruno Roustant
I agree with Robert Muir that an increase of the 1024 limit as it is currently in FloatVectorValues or ByteVectorValues would bind the API, we could not decrease it after, even if we needed to change the vector engine. Would it be possible to move the limit definition to a HNSW specific

Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-09 Thread Alessandro Benedetti
*Proposed option*: make the limit configurable *Motivation*: The system administrator can enforce a limit its users need to respect that it's in line with whatever the admin decided to be acceptable for them. The default can stay the current one. This should open the doors for Apache Solr,

Dimensions Limit for KNN vectors - Next Steps

2023-05-09 Thread Alessandro Benedetti
We had a very long-running (and heated) thread about this (*[Proposal] Remove max number of dimensions for KNN vectors*). Without repeating any of it, I recommend we move this forward in this way: *We stop any discussion* and everyone interested proposes an option with a motivation, then we

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-05-09 Thread Alessandro Benedetti
To proceed in a pragmatic way I opened this new thread: *Dimensions Limit for KNN vectors - Next Steps* . This is meant to address the main point in this discussion. For the following points: 2) [medium task] We all want more benchmarks for Lucene vector-based search, with a good variety of

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Jun Luo
The pr mentioned a Elasticsearch pr that increased the dim to 2048 in ElasticSearch. Curious how you use Lucene's KNN search. Lucene's KNN supports one vector per document. Usually multiple/many vectors are needed for a document content. We

[RESULT] [VOTE] Release Lucene 9.6.0 RC2

2023-05-09 Thread Alan Woodward
It's been >72h since the vote was initiated and the result is: +1 9 (7 binding) 0 0 -1 0 This vote has PASSED - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir
The better solution is to use Terms.intersect. Then the postings format can do the right thing. But this query doesn't use Terms.intersect today, instead doing ping-ponging itself. That's the problem. We must *not* tune our algorithms for amazon's search but instead what is the best for users

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Greg Miller
Thanks Patrick. I tend to agree with you for the default behavior. Bloom filter usage seems like a bit of a less-common case on the surface at least (e.g., it's expected behavior for query terms to not be present in a given segment with enough frequency to justify the additional codec layer). A