Custom SliceExecutor and slices computation in IndexSearcher

2023-05-18 Thread SorabhApache
Hi All,

For concurrent segment search, lucene uses the *slices* method to compute
the number of work units which can be processed concurrently.

a) It calculates *slices* in the constructor of *IndexSearcher*

with default thresholds for document count and segment counts.
b) Provides an implementation of *SliceExecutor* (i.e.
QueueSizeBasedExecutor)

based on executor type which applies the backpressure in concurrent
execution based on a limiting factor of 1.5 times the passed in threadpool
maxPoolSize.

In OpenSearch, we have a search threadpool which serves the search request
to all the lucene indices (or OpenSearch shards) assigned to a node. Each
node can get the requests to some or all the indices on that node.
I am exploring a mechanism such that I can dynamically control the max
slices for each lucene index search request. For example: search requests
to some indices on that node to have max 4 slices each and others to have 2
slices each. Then the threadpool shared to execute these slices does not
have any limiting factor. In this model the top level search threadpool
will limit the number of active search requests which will limit the number
of work units in the SliceExecutor threadpool.

For this the derived implementation of IndexSearcher can get an input value
in the constructor to control the slice count computation. Even though the
slice method is protected it gets called from the constructor of base
IndexSearcher class which prevents the derived class from using the passed
in input.

To achieve this I can think of the following ways (in order of preference)
and would like to submit a pull request for it. But I wanted to get some
feedback if option 1 looks fine or take some other approach.

1. Provide another constructor in IndexSearcher which takes in 4 input
parameters:
  protected IndexSearcher(IndexReaderContext context, Executor executor,
SliceExecutor sliceExecutor, Function, LeafSlice[]>
sliceProvider)

2. Make the `leafSlices` member protected and non final. After it is
initialized by the IndexSearcher (using default mechanism in lucene), the
derived implementation can again update it if need be (like based on some
input parameter to its own constructor). Also make the constructor with
SliceExecutor input protected such that derived implementation can provide
its own implementation of SliceExecutor. This mechanism will have redundant
computation of leafSlices.


Thanks,
Sorabh


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Nicholas Knize
Difficult to keep up with this topic when it's spread across issues, PRs,
and email lists. My poll response is option 3. -1 to option 2, I think the
configuration should be moved to the HNSW specific implementation. At this
point of technical maturity, it doesn't make sense (to me) to have the
config be a global system property.

Given the conversation fragmentation I'll ask here what I asked in my
comment on the github issue
.

"Can anyone smart here post their benchmarks to substantiate their claims?"

For as enthusiastic a topic as vector dimensionality is, it sure is
discouraging there isn't empirical data to help make an informed decision
around what the recommended limit should be. I've only seen broad benchmark
claims like "We benchmarked a patched Lucene/Solr. We fully understand (we
measured it :-P)" It sure would be useful to see these benchmarks! Not
having them to help improve these arbitrary limits seems like a serious
disservice to the Lucene/Solr user community. I think until trustworthy
numbers are made available all we'll have is conjecture and opinions.

IMHO, given Java's lag in SIMD Vector support I'd rather see equal energy
put into Robert's Vector API Integration, Plan B
 proposal. I'm not trying to
minimize the importance of adding a configuration to the HNSW
dimensionality, I just think we have the requisite expertise on this
project to fix the bigger performance issues that are a direct result of
Java's bigger vector performance deficiencies.

Nicholas Knize, Ph.D., GISP
Principal Engineer - Search  |  Amazon
Apache Lucene PMC Member and Committer
nkn...@apache.org


On Thu, May 18, 2023 at 7:07 AM Michael Wechner 
wrote:

>
>
> Am 18.05.23 um 12:22 schrieb Michael McCandless:
>
>
> I love all the energy and passion going into debating all the ways to poke
> at this limit, but please let's also spend some of this passion on actually
> improving the scalability of our aKNN implementation!  E.g. Robert opened
> an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
> workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
> CPU instructions (the Java Vector API, JEP 426:
> https://openjdk.org/jeps/426 ).  This could help postings and doc values
> performance too!
>
>
>
> agreed, but I do not think the MAX_DIMENSIONS decision should depend on
> this, because I think whatever improvements can be accomplished eventually,
> very likely there will always be some limit.
>
> Thanks
>
> Michael
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti 
> wrote:
>
>> That's great and a good plan B, but let's try to focus this thread of
>> collecting votes for a week (let's keep discussions on the nice PR opened
>> by David or the discussion thread we have in the mailing list already :)
>>
>> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
>> ichattopadhy...@gmail.com> wrote:
>>
>>> That sounds promising, Michael. Can you share scripts/steps/code to
>>> reproduce this?
>>>
>>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, <
>>> michael.wech...@wyona.com> wrote:
>>>
 I just implemented it and tested it with OpenAI's
 text-embedding-ada-002, which is using 1536 dimensions and it works very
 fine :-)

 Thanks

 Michael



 Am 18.05.23 um 00:29 schrieb Michael Wechner:

 IIUC KnnVectorField is deprecated and one is supposed to use
 KnnFloatVectorField when using float as vector values, right?

 Am 17.05.23 um 16:41 schrieb Michael Sokolov:

 see https://markmail.org/message/kf4nzoqyhwacb7ri

 On Wed, May 17, 2023 at 10:09 AM David Smiley 
 wrote:

> > easily be circumvented by a user
>
> This is a revelation to me and others, if true.  Michael, please then
> point to a test or code snippet that shows the Lucene user community what
> they want to see so they are unblocked from their explorations of vector
> search.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov 
> wrote:
>
>> I think I've said before on this list we don't actually enforce the
>> limit in any way that can't easily be circumvented by a user. The codec
>> already supports any size vector - it doesn't impose any limit. The way 
>> the
>> API is written you can *already today* create an index with max-int sized
>> vectors and we are committed to supporting that going forward by our
>> backwards compatibility policy as Robert points out. This wasn't
>> intentional, I think, but it is the facts.
>>
>> Given that, I think this whole discussion is not really necessary.
>>
>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <

Re: Allowing tests to use multiple cores

2023-05-18 Thread Michael McCandless
Hmm, I think that setting just tells the JVM to pretend the underlying
hardware has only one core?  I.e. forcing
"Runtime.getRuntime().availableProcessors()"
to return 1.

But your test is still free to launch multiple threads to test concurrency
and they should run on multiple actual CPU cores if your hardware has it?

˜
Mike McCandless

http://blog.mikemccandless.com


On Tue, May 16, 2023 at 7:26 PM Jonathan Ellis  wrote:

> Hi all,
>
> I found out last week that my concurrent HNSW [1] was not as bug-free as I
> had thought.  It was passing the same tests as the serial HNSW, but the
> gradle configuration was limiting the test JVMs to a single core.  I had a
> much more interesting time debugging when I hacked out that limitation
> [2].  Is there a best practice way to opt into multi-cores tests without
> this blunt hammer?
>
> [1] https://github.com/apache/lucene/pull/12254
> [2]
> https://github.com/apache/lucene/pull/12254/commits/e6fbf0afb7da7af49a7a4fdbc578fde0da10d162
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner



Am 18.05.23 um 12:22 schrieb Michael McCandless:


I love all the energy and passion going into debating all the ways to 
poke at this limit, but please let's also spend some of this passion 
on actually improving the scalability of our aKNN implementation!  
E.g. Robert opened an exciting "Plan B" ( 
https://github.com/apache/lucene/issues/12302 ) to workaround 
OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU 
instructions (the Java Vector API, JEP 426: 
https://openjdk.org/jeps/426 ).  This could help postings and doc 
values performance too!



agreed, but I do not think the MAX_DIMENSIONS decision should depend on 
this, because I think whatever improvements can be accomplished 
eventually, very likely there will always be some limit.


Thanks

Michael



Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti 
 wrote:


That's great and a good plan B, but let's try to focus this thread
of collecting votes for a week (let's keep discussions on the nice
PR opened by David or the discussion thread we have in the mailing
list already :)

On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
 wrote:

That sounds promising, Michael. Can you share
scripts/steps/code to reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
 wrote:

I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and
it works very fine :-)

Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:

IIUC KnnVectorField is deprecated and one is supposed to
use KnnFloatVectorField when using float as vector
values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley
 wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true. 
Michael, please then point to a test or code snippet
that shows the Lucene user community what they want
to see so they are unblocked from their explorations
of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't
actually enforce the limit in any way that can't
easily be circumvented by a user. The codec
already supports any size vector - it doesn't
impose any limit. The way the API is written you
can *already today* create an index with max-int
sized vectors and we are committed to supporting
that going forward by our backwards
compatibility policy as Robert points out. This
wasn't intentional, I think, but it is the facts.

Given that, I think this whole discussion is not
really necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro
Benedetti  wrote:

Hi all,
we have finalized all the options proposed
by the community and we are ready to vote
for the preferred one and then proceed with
the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded
to 1024)
*Motivation*:
We are close to improving on many fronts.
Given the criticality of Lucene in computing
infrastructure and the concerns raised by
one of the most active stewards of the
project, I think we should keep working
toward improving the feature as is and move
to up the limit after we can demonstrate
improvement unambiguously.

*Option 2*
make the limit configurable, for example
through a system property
*Motivation*:
The system administrator can enforce a limit
its users need to respect that it's in line
with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doors for Apache 

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner

It is basically the code which Michael Sokolov posted at

https://markmail.org/message/kf4nzoqyhwacb7ri

except
 - that I have replaced KnnVectorField by KnnFloatVectorField, because 
KnnVectorField is deprecated.
 - that I don't hard code the  dimension as 2048 and the metric as 
EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction) 
used by the model. which are in the case of for example 
text-embedding-ada-002: 1536 and COSINE 
(https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)


HTH

Michael



Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
That sounds promising, Michael. Can you share scripts/steps/code to 
reproduce this?


On Thu, 18 May, 2023, 1:16 pm Michael Wechner, 
 wrote:


I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it
works very fine :-)

Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:

IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley
 wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true.  Michael,
please then point to a test or code snippet that shows the
Lucene user community what they want to see so they are
unblocked from their explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be
circumvented by a user. The codec already supports any
size vector - it doesn't impose any limit. The way the
API is written you can *already today* create an index
with max-int sized vectors and we are committed to
supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't
intentional, I think, but it is the facts.

Given that, I think this whole discussion is not really
necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred
one and then proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure
and the concerns raised by one of the most active
stewards of the project, I think we should keep
working toward improving the feature as is and move
to up the limit after we can demonstrate improvement
unambiguously.

*Option 2*
make the limit configurable, for example through a
system property
*Motivation*:
The system administrator can enforce a limit its
users need to respect that it's in line with
whatever the admin decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of plugin
development

*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit
would not bind any other potential vector engine
alternative/evolution.*
*
*Motivation:*There seem to be contradictory
performance interpretations about the current HNSW
implementation. Some consider its performance ok,
some not, and it depends on the target data set and
use case. Increasing the max dimension limit where
it is currently (in top level FloatVectorValues)
would not allow potential alternatives (e.g. for
other use-cases) to be based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate
place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and 

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael McCandless
This isn't really a VOTE (no specific code change is being proposed), but
rather a poll?

Anyway, I would prefer Option 3: put the limit check into the HNSW
algorithm itself.  This is the right place for the limit check, since HNSW
has its own scaling behaviour.  It might have other limits, like max
fanout, etc.  And we really should fix the loophole Mike S posted -- that's
just a dangerous long-term trap for users, thinking they have the back
compat promise of Lucene, when in fact they do not.

I love all the energy and passion going into debating all the ways to poke
at this limit, but please let's also spend some of this passion on actually
improving the scalability of our aKNN implementation!  E.g. Robert opened
an exciting "Plan B" ( https://github.com/apache/lucene/issues/12302 ) to
workaround OpenJDK's crazy slowness on enabling access to vectorized SIMD
CPU instructions (the Java Vector API, JEP 426: https://openjdk.org/jeps/426
).  This could help postings and doc values performance too!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti 
wrote:

> That's great and a good plan B, but let's try to focus this thread of
> collecting votes for a week (let's keep discussions on the nice PR opened
> by David or the discussion thread we have in the mailing list already :)
>
> On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, <
> ichattopadhy...@gmail.com> wrote:
>
>> That sounds promising, Michael. Can you share scripts/steps/code to
>> reproduce this?
>>
>> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, 
>> wrote:
>>
>>> I just implemented it and tested it with OpenAI's
>>> text-embedding-ada-002, which is using 1536 dimensions and it works very
>>> fine :-)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>>
>>> IIUC KnnVectorField is deprecated and one is supposed to use
>>> KnnFloatVectorField when using float as vector values, right?
>>>
>>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>>
>>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>>
>>> On Wed, May 17, 2023 at 10:09 AM David Smiley 
>>> wrote:
>>>
 > easily be circumvented by a user

 This is a revelation to me and others, if true.  Michael, please then
 point to a test or code snippet that shows the Lucene user community what
 they want to see so they are unblocked from their explorations of vector
 search.

 ~ David Smiley
 Apache Lucene/Solr Search Developer
 http://www.linkedin.com/in/davidwsmiley


 On Wed, May 17, 2023 at 7:51 AM Michael Sokolov 
 wrote:

> I think I've said before on this list we don't actually enforce the
> limit in any way that can't easily be circumvented by a user. The codec
> already supports any size vector - it doesn't impose any limit. The way 
> the
> API is written you can *already today* create an index with max-int sized
> vectors and we are committed to supporting that going forward by our
> backwards compatibility policy as Robert points out. This wasn't
> intentional, I think, but it is the facts.
>
> Given that, I think this whole discussion is not really necessary.
>
> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we
>> are ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of
>> Lucene in computing infrastructure and the concerns raised by one of the
>> most active stewards of the project, I think we should keep working 
>> toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to
>> respect that it's in line with whatever the admin decided to be 
>> acceptable
>> for them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch,
>> OpenSearch, and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance
>> interpretations about the current HNSW implementation. Some consider its
>> performance ok, some not, and it depends on the target data set and use
>> case. Increasing the max dimension limit where it is currently (in top
>> level 

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Alessandro Benedetti
That's great and a good plan B, but let's try to focus this thread of
collecting votes for a week (let's keep discussions on the nice PR opened
by David or the discussion thread we have in the mailing list already :)

On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya, 
wrote:

> That sounds promising, Michael. Can you share scripts/steps/code to
> reproduce this?
>
> On Thu, 18 May, 2023, 1:16 pm Michael Wechner, 
> wrote:
>
>> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
>> which is using 1536 dimensions and it works very fine :-)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>>
>> IIUC KnnVectorField is deprecated and one is supposed to use
>> KnnFloatVectorField when using float as vector values, right?
>>
>> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>>
>> see https://markmail.org/message/kf4nzoqyhwacb7ri
>>
>> On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:
>>
>>> > easily be circumvented by a user
>>>
>>> This is a revelation to me and others, if true.  Michael, please then
>>> point to a test or code snippet that shows the Lucene user community what
>>> they want to see so they are unblocked from their explorations of vector
>>> search.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov 
>>> wrote:
>>>
 I think I've said before on this list we don't actually enforce the
 limit in any way that can't easily be circumvented by a user. The codec
 already supports any size vector - it doesn't impose any limit. The way the
 API is written you can *already today* create an index with max-int sized
 vectors and we are committed to supporting that going forward by our
 backwards compatibility policy as Robert points out. This wasn't
 intentional, I think, but it is the facts.

 Given that, I think this whole discussion is not really necessary.

 On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
 a.benede...@sease.io> wrote:

> Hi all,
> we have finalized all the options proposed by the community and we are
> ready to vote for the preferred one and then proceed with the
> implementation.
>
> *Option 1*
> Keep it as it is (dimension limit hardcoded to 1024)
> *Motivation*:
> We are close to improving on many fronts. Given the criticality of
> Lucene in computing infrastructure and the concerns raised by one of the
> most active stewards of the project, I think we should keep working toward
> improving the feature as is and move to up the limit after we can
> demonstrate improvement unambiguously.
>
> *Option 2*
> make the limit configurable, for example through a system property
> *Motivation*:
> The system administrator can enforce a limit its users need to respect
> that it's in line with whatever the admin decided to be acceptable for
> them.
> The default can stay the current one.
> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
> and any sort of plugin development
>
> *Option 3*
> Move the max dimension limit lower level to a HNSW specific
> implementation. Once there, this limit would not bind any other potential
> vector engine alternative/evolution.
> *Motivation:* There seem to be contradictory performance
> interpretations about the current HNSW implementation. Some consider its
> performance ok, some not, and it depends on the target data set and use
> case. Increasing the max dimension limit where it is currently (in top
> level FloatVectorValues) would not allow potential alternatives (e.g. for
> other use-cases) to be based on a lower limit.
>
> *Option 4*
> Make it configurable and move it to an appropriate place.
> In particular, a
> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
> enough.
> *Motivation*:
> Both are good and not mutually exclusive and could happen in any order.
> Someone suggested to perfect what the _default_ limit should be, but
> I've not seen an argument _against_ configurability.  Especially in this
> way -- a toggle that doesn't bind Lucene's APIs in any way.
>
> I'll keep this [VOTE] open for a week and then proceed to the
> implementation.
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Ishan Chattopadhyaya
That sounds promising, Michael. Can you share scripts/steps/code to
reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner, 
wrote:

> I just implemented it and tested it with OpenAI's text-embedding-ada-002,
> which is using 1536 dimensions and it works very fine :-)
>
> Thanks
>
> Michael
>
>
>
> Am 18.05.23 um 00:29 schrieb Michael Wechner:
>
> IIUC KnnVectorField is deprecated and one is supposed to use
> KnnFloatVectorField when using float as vector values, right?
>
> Am 17.05.23 um 16:41 schrieb Michael Sokolov:
>
> see https://markmail.org/message/kf4nzoqyhwacb7ri
>
> On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:
>
>> > easily be circumvented by a user
>>
>> This is a revelation to me and others, if true.  Michael, please then
>> point to a test or code snippet that shows the Lucene user community what
>> they want to see so they are unblocked from their explorations of vector
>> search.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, May 17, 2023 at 7:51 AM Michael Sokolov 
>> wrote:
>>
>>> I think I've said before on this list we don't actually enforce the
>>> limit in any way that can't easily be circumvented by a user. The codec
>>> already supports any size vector - it doesn't impose any limit. The way the
>>> API is written you can *already today* create an index with max-int sized
>>> vectors and we are committed to supporting that going forward by our
>>> backwards compatibility policy as Robert points out. This wasn't
>>> intentional, I think, but it is the facts.
>>>
>>> Given that, I think this whole discussion is not really necessary.
>>>
>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>>
 Hi all,
 we have finalized all the options proposed by the community and we are
 ready to vote for the preferred one and then proceed with the
 implementation.

 *Option 1*
 Keep it as it is (dimension limit hardcoded to 1024)
 *Motivation*:
 We are close to improving on many fronts. Given the criticality of
 Lucene in computing infrastructure and the concerns raised by one of the
 most active stewards of the project, I think we should keep working toward
 improving the feature as is and move to up the limit after we can
 demonstrate improvement unambiguously.

 *Option 2*
 make the limit configurable, for example through a system property
 *Motivation*:
 The system administrator can enforce a limit its users need to respect
 that it's in line with whatever the admin decided to be acceptable for
 them.
 The default can stay the current one.
 This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
 and any sort of plugin development

 *Option 3*
 Move the max dimension limit lower level to a HNSW specific
 implementation. Once there, this limit would not bind any other potential
 vector engine alternative/evolution.
 *Motivation:* There seem to be contradictory performance
 interpretations about the current HNSW implementation. Some consider its
 performance ok, some not, and it depends on the target data set and use
 case. Increasing the max dimension limit where it is currently (in top
 level FloatVectorValues) would not allow potential alternatives (e.g. for
 other use-cases) to be based on a lower limit.

 *Option 4*
 Make it configurable and move it to an appropriate place.
 In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
 1024) should be enough.
 *Motivation*:
 Both are good and not mutually exclusive and could happen in any order.
 Someone suggested to perfect what the _default_ limit should be, but
 I've not seen an argument _against_ configurability.  Especially in this
 way -- a toggle that doesn't bind Lucene's APIs in any way.

 I'll keep this [VOTE] open for a week and then proceed to the
 implementation.
 --
 *Alessandro Benedetti*
 Director @ Sease Ltd.
 *Apache Lucene/Solr Committer*
 *Apache Solr PMC Member*

 e-mail: a.benede...@sease.io


 *Sease* - Information Retrieval Applied
 Consulting | Training | Open Source

 Website: Sease.io 
 LinkedIn  | Twitter
  | Youtube
  | Github
 

>>>
>
>


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner
I just implemented it and tested it with OpenAI's 
text-embedding-ada-002, which is using 1536 dimensions and it works very 
fine :-)


Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:
IIUC KnnVectorField is deprecated and one is supposed to use 
KnnFloatVectorField when using float as vector values, right?


Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true. Michael, please
then point to a test or code snippet that shows the Lucene user
community what they want to see so they are unblocked from their
explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be
circumvented by a user. The codec already supports any size
vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int
sized vectors and we are committed to supporting that going
forward by our backwards compatibility policy as Robert
points out. This wasn't intentional, I think, but it is the
facts.

Given that, I think this whole discussion is not really
necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred one
and then proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure and the
concerns raised by one of the most active stewards of the
project, I think we should keep working toward improving
the feature as is and move to up the limit after we can
demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system
property
*Motivation*:
The system administrator can enforce a limit its users
need to respect that it's in line with whatever the admin
decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit would not
bind any other potential vector engine
alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation.
Some consider its performance ok, some not, and it
depends on the target data set and use case. Increasing
the max dimension limit where it is currently (in top
level FloatVectorValues) would not allow
potential alternatives (e.g. for other use-cases) to be
based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen
in any order.
Someone suggested to perfect what the _default_ limit
should be, but I've not seen an argument _against_
configurability.  Especially in this way -- a toggle that
doesn't bind Lucene's APIs in any way.

I'll keep this [VOTE] open for a week and then proceed to
the implementation.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  |
Twitter  | Youtube
 |
Github