Thanks Kent - I tried something similar to what you did I think. Took
a set of 256d vectors I had and concatenated them to make bigger ones,
then shifted the dimensions to make more of them. Here are a few
single-threaded indexing test runs. I ran all tests with M=16.


8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
buffer size=1994)
8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

increasing the vector dimension makes things take longer (scaling
*linearly*) but doesn't lead to RAM issues. I think we could get to
OOM while merging with a small heap and a large number of vectors, or
by increasing M, but none of this has anything to do with vector
dimensions. Also, if merge RAM usage is a problem I think we could
address it by adding accounting to the merge process and simply not
merging graphs when they exceed the buffer size (as we do with
flushing).

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.

On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <kent.fi...@gmail.com> wrote:
>
> Hi,
> I have been testing Lucene with a custom vector similarity and loaded 192m 
> vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
>
> As this was a performance test, the 192m vectors were derived by dithering 
> 47k original vectors in such a way to allow realistic ANN evaluation of HNSW. 
>  The original 47k vectors were generated by ada-002 on source newspaper 
> article text.  After dithering, I used PQ to reduce their dimensionality from 
> 1536 floats to 512 bytes - 3 source dimensions to a 1byte code, 512 code 
> tables, each learnt to reduce total encoding error using Lloyds algorithm 
> (hence the need for the custom similarity). BTW, HNSW retrieval was accurate 
> and fast enough for the use case I was investigating as long as a machine 
> with 128gb memory was available as the graph needs to be cached in memory for 
> reasonable query rates.
>
> Anyway, if you want them, you are welcome to those 47k vectors of 1532 floats 
> which can be readily dithered to generate very large and realistic test 
> vector sets.
>
> Best regards,
>
> Kent Fitch
>
>
> On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wech...@wyona.com> 
> wrote:
>>
>> you might want to use SentenceBERT to generate vectors
>>
>> https://sbert.net
>>
>> whereas for example the model "all-mpnet-base-v2" generates vectors with 
>> dimension 768
>>
>> We have SentenceBERT running as a web service, which we could open for these 
>> tests, but because of network latency it should be faster running locally.
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
>>
>> I've started to look on the internet, and surely someone will come, but the 
>> challenge I suspect is that these vectors are expensive to generate so 
>> people have not gone all in on generating such large vectors for large 
>> datasets. They certainly have not made them easy to find. Here is the most 
>> promising but it is too small, probably:  
>> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
>>
>>  I'm still in and out of the office at the moment, but when I return, I can 
>> ask my employer if they will sponsor a 10 million document collection so 
>> that you can test with that. Or, maybe someone from work will see and ask 
>> them on my behalf.
>>
>> Alternatively, next week, I may get some time to set up a server with an 
>> open source LLM to generate the vectors. It still won't be free, but it 
>> would be 99% cheaper than paying the LLM companies if we can be slow.
>>
>>
>>
>> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner <michael.wech...@wyona.com> 
>> wrote:
>>>
>>> Great, thank you!
>>>
>>> How much RAM; etc. did you run this test on?
>>>
>>> Do the vectors really have to be based on real data for testing the
>>> indexing?
>>> I understand, if you want to test the quality of the search results it
>>> does matter, but for testing the scalability itself it should not matter
>>> actually, right?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
>>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
>>> > minutes with a single thread. I have some 256K vectors, but only about
>>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
>>> > vectors I can use for testing? If all else fails I can test with
>>> > noise, but that tends to lead to meaningless results
>>> >
>>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
>>> > <michael.wech...@wyona.com> wrote:
>>> >>
>>> >>
>>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
>>> >>> Based on my own experience, I consider it unusable. It seems other
>>> >>> folks may have run into trouble too. If the project committers can't
>>> >>> even really use vectors with such high dimension counts, then its not
>>> >>> in an OK state for users, and we shouldn't bump the limit.
>>> >>>
>>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> >>> without addressing the underlying usability/scalability is a real
>>> >>> no-go,
>>> >> I agree that this needs to be adressed
>>> >>
>>> >>
>>> >>
>>> >>>    it is not really solving anything, nor is it giving users any
>>> >>> freedom or allowing them to do something they couldnt do before.
>>> >>> Because if it still doesnt work it still doesnt work.
>>> >> I disagree, because it *does work* with "smaller" document sets.
>>> >>
>>> >> Currently we have to compile Lucene ourselves to not get the exception
>>> >> when using a model with vector dimension greater than 1024,
>>> >> which is of course possible, but not really convenient.
>>> >>
>>> >> As I wrote before, to resolve this discussion, I think we should test
>>> >> and address possible issues.
>>> >>
>>> >> I will try to stop discussing now :-) and instead try to understand
>>> >> better the actual issues. Would be great if others could join on this!
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>> We all need to be on the same page, grounded in reality, not fantasy,
>>> >>> where if we set a limit of 1024 or 2048, that you can actually index
>>> >>> vectors with that many dimensions and it actually works and scales.
>>> >>>
>>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>>> >>> <a.benede...@sease.io> wrote:
>>> >>>> As I said earlier, a max limit limits usability.
>>> >>>> It's not forcing users with small vectors to pay the performance 
>>> >>>> penalty of big vectors, it's literally preventing some users to use 
>>> >>>> Lucene/Solr/Elasticsearch at all.
>>> >>>> As far as I know, the max limit is used to raise an exception, it's 
>>> >>>> not used to initialise or optimise data structures (please correct me 
>>> >>>> if I'm wrong).
>>> >>>>
>>> >>>> Improving the algorithm performance is a separate discussion.
>>> >>>> I don't see a correlation with the fact that indexing billions of 
>>> >>>> whatever dimensioned vector is slow with a usability parameter.
>>> >>>>
>>> >>>> What about potential users that need few high dimensional vectors?
>>> >>>>
>>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I 
>>> >>>> believe we need to remove the limit or size it in a way it's not a 
>>> >>>> problem for both users and internal data structure optimizations, if 
>>> >>>> any.
>>> >>>>
>>> >>>>
>>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
>>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to index
>>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
>>> >>>>>
>>> >>>>> IMO based on how painful it is, it seems the limit is already too
>>> >>>>> high, I realize that will sound controversial but please at least try
>>> >>>>> it out!
>>> >>>>>
>>> >>>>> voting +1 without at least doing this is really the
>>> >>>>> "weak/unscientifically minded" approach.
>>> >>>>>
>>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>>> >>>>> <michael.wech...@wyona.com> wrote:
>>> >>>>>> Thanks for your feedback!
>>> >>>>>>
>>> >>>>>> I agree, that it should not crash.
>>> >>>>>>
>>> >>>>>> So far we did not experience crashes ourselves, but we did not index
>>> >>>>>> millions of vectors.
>>> >>>>>>
>>> >>>>>> I will try to reproduce the crash, maybe this will help us to move 
>>> >>>>>> forward.
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>>
>>> >>>>>> Michael
>>> >>>>>>
>>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>> >>>>>>>> Can you describe your crash in more detail?
>>> >>>>>>> I can't. That experiment was a while ago and a quick test to see if 
>>> >>>>>>> I
>>> >>>>>>> could index rather large-ish USPTO (patent office) data as vectors.
>>> >>>>>>> Couldn't do it then.
>>> >>>>>>>
>>> >>>>>>>> How much RAM?
>>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for 
>>> >>>>>>> I/O
>>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the 
>>> >>>>>>> problem.
>>> >>>>>>> I recall segment merging grew slower and slower and then simply
>>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
>>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
>>> >>>>>>> is... I don't know - not elegant?
>>> >>>>>>>
>>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
>>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
>>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't 
>>> >>>>>>> invent
>>> >>>>>>> those rules).
>>> >>>>>>>
>>> >>>>>>> D.
>>> >>>>>>>
>>> >>>>>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>>>>>>
>>> >>>>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>>>>>
>>> >>>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>>>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>>
>> --
>> Marcus Eagan
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to