Our use case is the following:
We have a dataset of several thousand questions and answers, for which
we generate vectors with various models resp. services, for example:
SentenceBERT, all-mpnet-base-v2
Aleph Alpha, luminous-base
Cohere, multilingual-22-12
OpenAI, text-similarity-ada-001 or text-similarity-davinci-001
These models produce vectors with different vector dimensions.
Depending on the model and on the benchmarks / datasets the accuracy is
higher when the vector dimension is higher.
We index these vectors with Lucene and we use Lucene to do a similarity
search.
I think this is exactly why Lucene KNN / ANN was developed, to do
similarity search, right?
Thanks
Michael W
Am 09.04.23 um 13:47 schrieb Robert Muir:
I don't care. you guys personally attacked me first. And then it turns
out, you were being dishonest the entire time and hiding your true
intent, which was not search at all but instead some chatgpt pyramid
scheme or similar.
i'm done with this thread.
On Sun, Apr 9, 2023 at 7:37 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:
I don't think this tone and language is appropriate for a community of
volunteers and men of science.
I personally find offensive to generalise Lucene people here to be "crazy people
hyped about chatGPT".
I personally don't give a damn about chatGPT except the fact it is a very
interesting technology.
As usual I see very little motivation and a lot of "convince me".
We're discussing here about a limit that raises an exception.
Improving performance is absolutely important and no-one here is saying we
won't address it, it's just a separate discussion.
On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcm...@gmail.com> wrote:
Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE
LIBRARY. not a vector database or whatever trash is being proposed
here.
i think we should table this and revisit it after chatgpt hype has dissipated.
this hype is causing ppl to behave irrationally, it is why i can't
converse with basically anyone on this thread because they are all
stating crazy things that don't make sense.
On Sun, Apr 9, 2023 at 6:25 AM Robert Muir <rcm...@gmail.com> wrote:
Yes, its very clear that folks on this thread are ignoring reason
entirely and completely swooned by chatgpt-hype.
And what happens when they make chatgpt-8 that uses even more dimensions?
backwards compatibility decisions can't be made by garbage hype such
as cryptocurrency or chatgpt.
Trying to convince me we should bump it because of chatgpt, well, i
think it has the opposite effect.
Please, lemme see real technical arguments why this limit needs to be
bumped. not including trash like chatgpt.
On Sat, Apr 8, 2023 at 7:50 PM Marcus Eagan <marcusea...@gmail.com> wrote:
Given the massive amounts of funding going into the development and investigation
of the project, I think it would be good to at least have Lucene be a part of the
conversation. Simply because academics typically focus on vectors <= 784
dimensions does not mean all users will. A large swathe of very important users of
the Lucene project never exceed 500k documents, though they are shifting to other
search engines to try out very popular embeddings.
I think giving our users the opportunity to build chat bots or LLM memory
machines using Lucene is a positive development, even if some datasets won't be
able to work well. We don't limit the number of fields someone can add in most
cases, though we did just undeprecate that API to better support multi-tenancy.
But people still add so many fields and can crash their clusters with mapping
explosions when unlimited. The limit to vectors feels similar. I expect more
people to dig into Lucene due to its openness and robustness as they run into
problems. Today, they are forced to consider other engines that are more
permissive.
Not everyone important or valuable Lucene workload is in the millions of
documents. Many of them only have lots of queries or computationally expensive
access patterns for B-trees. We can document that it is very ill-advised to
make a deployment with vectors too large. What others will do with it is on
them.
On Sat, Apr 8, 2023 at 2:29 PM Adrien Grand <jpou...@gmail.com> wrote:
As Dawid pointed out earlier on this thread, this is the rule for
Apache projects: a single -1 vote on a code change is a veto and
cannot be overridden. Furthermore, Robert is one of the people on this
project who worked the most on debugging subtle bugs, making Lucene
more robust and improving our test framework, so I'm listening when he
voices quality concerns.
The argument against removing/raising the limit that resonates with me
the most is that it is a one-way door. As MikeS highlighted earlier on
this thread, implementations may want to take advantage of the fact
that there is a limit at some point too. This is why I don't want to
remove the limit and would prefer a slight increase, such as 2048 as
suggested in the original issue, which would enable most of the things
that users who have been asking about raising the limit would like to
do.
I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.
On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
<a.benede...@sease.io> wrote:
I am very attentive to listen opinions but I am un-convinced here and I an not
sure that a single person opinion should be allowed to be detrimental for such
an important project.
The limit as far as I know is literally just raising an exception.
Removing it won't alter in any way the current performance for users in low
dimensional space.
Removing it will just enable more users to use Lucene.
If new users in certain situations will be unhappy with the performance, they
may contribute improvements.
This is how you make progress.
If it's a reputation thing, trust me that not allowing users to play with high
dimensional space will equally damage it.
To me it's really a no brainer.
Removing the limit and enable people to use high dimensional vectors will take
minutes.
Improving the hnsw implementation can take months.
Pick one to begin with...
And there's no-one paying me here, no company interest whatsoever, actually I
pay people to contribute, I am just convinced it's a good idea.
On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:
https://github.com/apache/lucene/pull/11905
Attacking me isn't helping the situation.
PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!
On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> wrote:
What you said about increasing dimensions requiring a bigger ram buffer on
merge is wrong. That's the point I was trying to make. Your concerns about
merge costs are not wrong, but your conclusion that we need to limit dimensions
is not justified.
You complain that hnsw sucks it doesn't scale, but when I show it scales
linearly with dimension you just ignore that and complain about something
entirely different.
You demand that people run all kinds of tests to prove you wrong but when they
do, you don't listen and you won't put in the work yourself or complain that
it's too hard.
Then you complain about people not meeting you half way. Wow
On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote:
On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
What exactly do you consider reasonable?
Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.
Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).
My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.
It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.
Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.
At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
--
Adrien
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
--
Marcus Eagan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org