Am 18.05.23 um 12:22 schrieb Michael McCandless:
I love all the energy and passion going into debating all the ways to
poke at this limit, but please let's also spend some of this passion
on actually improving the scalability of our aKNN implementation!
E.g. Robert opened an exciting "Plan B" (
https://github.com/apache/lucene/issues/12302 ) to workaround
OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU
instructions (the Java Vector API, JEP 426:
https://openjdk.org/jeps/426 ). This could help postings and doc
values performance too!
agreed, but I do not think the MAX_DIMENSIONS decision should depend on
this, because I think whatever improvements can be accomplished
eventually, very likely there will always be some limit.
Thanks
Michael
Mike McCandless
http://blog.mikemccandless.com
On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:
That's great and a good plan B, but let's try to focus this thread
of collecting votes for a week (let's keep discussions on the nice
PR opened by David or the discussion thread we have in the mailing
list already :)
On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
<ichattopadhy...@gmail.com> wrote:
That sounds promising, Michael. Can you share
scripts/steps/code to reproduce this?
On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
<michael.wech...@wyona.com> wrote:
I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and
it works very fine :-)
Thanks
Michael
Am 18.05.23 um 00:29 schrieb Michael Wechner:
IIUC KnnVectorField is deprecated and one is supposed to
use KnnFloatVectorField when using float as vector
values, right?
Am 17.05.23 um 16:41 schrieb Michael Sokolov:
see https://markmail.org/message/kf4nzoqyhwacb7ri
On Wed, May 17, 2023 at 10:09 AM David Smiley
<dsmi...@apache.org> wrote:
> easily be circumvented by a user
This is a revelation to me and others, if true.
Michael, please then point to a test or code snippet
that shows the Lucene user community what they want
to see so they are unblocked from their explorations
of vector search.
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
<msoko...@gmail.com> wrote:
I think I've said before on this list we don't
actually enforce the limit in any way that can't
easily be circumvented by a user. The codec
already supports any size vector - it doesn't
impose any limit. The way the API is written you
can *already today* create an index with max-int
sized vectors and we are committed to supporting
that going forward by our backwards
compatibility policy as Robert points out. This
wasn't intentional, I think, but it is the facts.
Given that, I think this whole discussion is not
really necessary.
On Tue, May 16, 2023 at 4:50 AM Alessandro
Benedetti <a.benede...@sease.io> wrote:
Hi all,
we have finalized all the options proposed
by the community and we are ready to vote
for the preferred one and then proceed with
the implementation.
*Option 1*
Keep it as it is (dimension limit hardcoded
to 1024)
*Motivation*:
We are close to improving on many fronts.
Given the criticality of Lucene in computing
infrastructure and the concerns raised by
one of the most active stewards of the
project, I think we should keep working
toward improving the feature as is and move
to up the limit after we can demonstrate
improvement unambiguously.
*Option 2*
make the limit configurable, for example
through a system property
*Motivation*:
The system administrator can enforce a limit
its users need to respect that it's in line
with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of
plugin development
*Option 3*
Move the max dimension limit lower level to
a HNSW specific implementation. Once there,
this limit would not bind any other
potential vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory
performance interpretations about the
current HNSW implementation. Some consider
its performance ok, some not, and it depends
on the target data set and use case.
Increasing the max dimension limit where it
is currently (in top level
FloatVectorValues) would not allow
potential alternatives (e.g. for other
use-cases) to be based on a lower limit.
*Option 4*
Make it configurable and move it to an
appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and
could happen in any order.
Someone suggested to perfect what the
_default_ limit should be, but I've not seen
an argument _against_ configurability.
Especially in this way -- a toggle that
doesn't bind Lucene's APIs in any way.
I'll keep this [VOTE] open for a week and
then proceed to the implementation.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn
<https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> |
Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>