I've started to look on the internet, and surely someone will come,
but the challenge I suspect is that these vectors are expensive to
generate so people have not gone all in on generating such large
vectors for large datasets. They certainly have not made them easy to
find. Here is the most promising but it is too small, probably:
https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
I'm still in and out of the office at the moment, but when I return,
I can ask my employer if they will sponsor a 10 million document
collection so that you can test with that. Or, maybe someone from work
will see and ask them on my behalf.
Alternatively, next week, I may get some time to set up a server with
an open source LLM to generate the vectors. It still won't be free,
but it would be 99% cheaper than paying the LLM companies if we can be
slow.
On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner
<michael.wech...@wyona.com> wrote:
Great, thank you!
How much RAM; etc. did you run this test on?
Do the vectors really have to be based on real data for testing the
indexing?
I understand, if you want to test the quality of the search
results it
does matter, but for testing the scalability itself it should not
matter
actually, right?
Thanks
Michael
Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> minutes with a single thread. I have some 256K vectors, but only
about
> 2M of them. Can anybody point me to a large set (say 8M+) of
1024+ dim
> vectors I can use for testing? If all else fails I can test with
> noise, but that tends to lead to meaningless results
>
> On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
> <michael.wech...@wyona.com> wrote:
>>
>>
>> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> Well, I'm asking ppl actually try to test using such high
dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers
can't
>>> even really use vectors with such high dimension counts, then
its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go,
>> I agree that this needs to be adressed
>>
>>
>>
>>> it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>> I disagree, because it *does work* with "smaller" document sets.
>>
>> Currently we have to compile Lucene ourselves to not get the
exception
>> when using a model with vector dimension greater than 1024,
>> which is of course possible, but not really convenient.
>>
>> As I wrote before, to resolve this discussion, I think we
should test
>> and address possible issues.
>>
>> I will try to stop discussing now :-) and instead try to understand
>> better the actual issues. Would be great if others could join
on this!
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>> We all need to be on the same page, grounded in reality, not
fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually
index
>>> vectors with that many dimensions and it actually works and
scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>>> <a.benede...@sease.io> wrote:
>>>> As I said earlier, a max limit limits usability.
>>>> It's not forcing users with small vectors to pay the
performance penalty of big vectors, it's literally preventing some
users to use Lucene/Solr/Elasticsearch at all.
>>>> As far as I know, the max limit is used to raise an
exception, it's not used to initialise or optimise data structures
(please correct me if I'm wrong).
>>>>
>>>> Improving the algorithm performance is a separate discussion.
>>>> I don't see a correlation with the fact that indexing
billions of whatever dimensioned vector is slow with a usability
parameter.
>>>>
>>>> What about potential users that need few high dimensional
vectors?
>>>>
>>>> As I said before, I am a big +1 for NOT just raise it
blindly, but I believe we need to remove the limit or size it in a
way it's not a problem for both users and internal data structure
optimizations, if any.
>>>>
>>>>
>>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
>>>>> I'd ask anyone voting +1 to raise this limit to at least try
to index
>>>>> a few million vectors with 756 or 1024, which is allowed today.
>>>>>
>>>>> IMO based on how painful it is, it seems the limit is
already too
>>>>> high, I realize that will sound controversial but please at
least try
>>>>> it out!
>>>>>
>>>>> voting +1 without at least doing this is really the
>>>>> "weak/unscientifically minded" approach.
>>>>>
>>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>>>>> <michael.wech...@wyona.com> wrote:
>>>>>> Thanks for your feedback!
>>>>>>
>>>>>> I agree, that it should not crash.
>>>>>>
>>>>>> So far we did not experience crashes ourselves, but we did
not index
>>>>>> millions of vectors.
>>>>>>
>>>>>> I will try to reproduce the crash, maybe this will help us
to move forward.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>>>>>>>> Can you describe your crash in more detail?
>>>>>>> I can't. That experiment was a while ago and a quick test
to see if I
>>>>>>> could index rather large-ish USPTO (patent office) data as
vectors.
>>>>>>> Couldn't do it then.
>>>>>>>
>>>>>>>> How much RAM?
>>>>>>> My indexing jobs run with rather smallish heaps to give
space for I/O
>>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
the problem.
>>>>>>> I recall segment merging grew slower and slower and then
simply
>>>>>>> crashed. Lucene should work with low heap requirements,
even if it
>>>>>>> slows down. Throwing ram at the indexing/ segment merging
problem
>>>>>>> is... I don't know - not elegant?
>>>>>>>
>>>>>>> Anyway. My main point was to remind folks about how Apache
works -
>>>>>>> code is merged in when there are no vetoes. If Rob (or
anybody else)
>>>>>>> remains unconvinced, he or she can block the change. (I
didn't invent
>>>>>>> those rules).
>>>>>>>
>>>>>>> D.
>>>>>>>
>>>>>>>
---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>>
>>>>>>
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>
>>>>>
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>>
---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
--
Marcus Eagan