you might want to use SentenceBERT to generate vectors

https://sbert.net

whereas for example the model "all-mpnet-base-v2" generates vectors with dimension 768

We have SentenceBERT running as a web service, which we could open for these tests, but because of network latency it should be faster running locally.

HTH

Michael


Am 07.04.23 um 10:11 schrieb Marcus Eagan:
I've started to look on the internet, and surely someone will come, but the challenge I suspect is that these vectors are expensive to generate so people have not gone all in on generating such large vectors for large datasets. They certainly have not made them easy to find. Here is the most promising but it is too small, probably: https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download

 I'm still in and out of the office at the moment, but when I return, I can ask my employer if they will sponsor a 10 million document collection so that you can test with that. Or, maybe someone from work will see and ask them on my behalf.

Alternatively, next week, I may get some time to set up a server with an open source LLM to generate the vectors. It still won't be free, but it would be 99% cheaper than paying the LLM companies if we can be slow.



On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner <michael.wech...@wyona.com> wrote:

    Great, thank you!

    How much RAM; etc. did you run this test on?

    Do the vectors really have to be based on real data for testing the
    indexing?
    I understand, if you want to test the quality of the search
    results it
    does matter, but for testing the scalability itself it should not
    matter
    actually, right?

    Thanks

    Michael

    Am 07.04.23 um 01:19 schrieb Michael Sokolov:
    > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
    > minutes with a single thread. I have some 256K vectors, but only
    about
    > 2M of them. Can anybody point me to a large set (say 8M+) of
    1024+ dim
    > vectors I can use for testing? If all else fails I can test with
    > noise, but that tends to lead to meaningless results
    >
    > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
    > <michael.wech...@wyona.com> wrote:
    >>
    >>
    >> Am 06.04.23 um 17:47 schrieb Robert Muir:
    >>> Well, I'm asking ppl actually try to test using such high
    dimensions.
    >>> Based on my own experience, I consider it unusable. It seems other
    >>> folks may have run into trouble too. If the project committers
    can't
    >>> even really use vectors with such high dimension counts, then
    its not
    >>> in an OK state for users, and we shouldn't bump the limit.
    >>>
    >>> I'm happy to discuss/compromise etc, but simply bumping the limit
    >>> without addressing the underlying usability/scalability is a real
    >>> no-go,
    >> I agree that this needs to be adressed
    >>
    >>
    >>
    >>>    it is not really solving anything, nor is it giving users any
    >>> freedom or allowing them to do something they couldnt do before.
    >>> Because if it still doesnt work it still doesnt work.
    >> I disagree, because it *does work* with "smaller" document sets.
    >>
    >> Currently we have to compile Lucene ourselves to not get the
    exception
    >> when using a model with vector dimension greater than 1024,
    >> which is of course possible, but not really convenient.
    >>
    >> As I wrote before, to resolve this discussion, I think we
    should test
    >> and address possible issues.
    >>
    >> I will try to stop discussing now :-) and instead try to understand
    >> better the actual issues. Would be great if others could join
    on this!
    >>
    >> Thanks
    >>
    >> Michael
    >>
    >>
    >>
    >>> We all need to be on the same page, grounded in reality, not
    fantasy,
    >>> where if we set a limit of 1024 or 2048, that you can actually
    index
    >>> vectors with that many dimensions and it actually works and
    scales.
    >>>
    >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
    >>> <a.benede...@sease.io> wrote:
    >>>> As I said earlier, a max limit limits usability.
    >>>> It's not forcing users with small vectors to pay the
    performance penalty of big vectors, it's literally preventing some
    users to use Lucene/Solr/Elasticsearch at all.
    >>>> As far as I know, the max limit is used to raise an
    exception, it's not used to initialise or optimise data structures
    (please correct me if I'm wrong).
    >>>>
    >>>> Improving the algorithm performance is a separate discussion.
    >>>> I don't see a correlation with the fact that indexing
    billions of whatever dimensioned vector is slow with a usability
    parameter.
    >>>>
    >>>> What about potential users that need few high dimensional
    vectors?
    >>>>
    >>>> As I said before, I am a big +1 for NOT just raise it
    blindly, but I believe we need to remove the limit or size it in a
    way it's not a problem for both users and internal data structure
    optimizations, if any.
    >>>>
    >>>>
    >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
    >>>>> I'd ask anyone voting +1 to raise this limit to at least try
    to index
    >>>>> a few million vectors with 756 or 1024, which is allowed today.
    >>>>>
    >>>>> IMO based on how painful it is, it seems the limit is
    already too
    >>>>> high, I realize that will sound controversial but please at
    least try
    >>>>> it out!
    >>>>>
    >>>>> voting +1 without at least doing this is really the
    >>>>> "weak/unscientifically minded" approach.
    >>>>>
    >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
    >>>>> <michael.wech...@wyona.com> wrote:
    >>>>>> Thanks for your feedback!
    >>>>>>
    >>>>>> I agree, that it should not crash.
    >>>>>>
    >>>>>> So far we did not experience crashes ourselves, but we did
    not index
    >>>>>> millions of vectors.
    >>>>>>
    >>>>>> I will try to reproduce the crash, maybe this will help us
    to move forward.
    >>>>>>
    >>>>>> Thanks
    >>>>>>
    >>>>>> Michael
    >>>>>>
    >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
    >>>>>>>> Can you describe your crash in more detail?
    >>>>>>> I can't. That experiment was a while ago and a quick test
    to see if I
    >>>>>>> could index rather large-ish USPTO (patent office) data as
    vectors.
    >>>>>>> Couldn't do it then.
    >>>>>>>
    >>>>>>>> How much RAM?
    >>>>>>> My indexing jobs run with rather smallish heaps to give
    space for I/O
    >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
    the problem.
    >>>>>>> I recall segment merging grew slower and slower and then
    simply
    >>>>>>> crashed. Lucene should work with low heap requirements,
    even if it
    >>>>>>> slows down. Throwing ram at the indexing/ segment merging
    problem
    >>>>>>> is... I don't know - not elegant?
    >>>>>>>
    >>>>>>> Anyway. My main point was to remind folks about how Apache
    works -
    >>>>>>> code is merged in when there are no vetoes. If Rob (or
    anybody else)
    >>>>>>> remains unconvinced, he or she can block the change. (I
    didn't invent
    >>>>>>> those rules).
    >>>>>>>
    >>>>>>> D.
    >>>>>>>
    >>>>>>>
    ---------------------------------------------------------------------
    >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>>>
    >>>>>>
    ---------------------------------------------------------------------
    >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>>
    >>>>>
    ---------------------------------------------------------------------
    >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>
    >>>
    ---------------------------------------------------------------------
    >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>
    >>
    >>
    ---------------------------------------------------------------------
    >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>
    >
    ---------------------------------------------------------------------
    > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    > For additional commands, e-mail: dev-h...@lucene.apache.org
    >


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    For additional commands, e-mail: dev-h...@lucene.apache.org



--
Marcus Eagan

Reply via email to