Re: Any recommended issues to work on for a newcomer?

2024-05-13 Thread Michael Wechner

Great, sounds like we have plan :-)

Hank and I can get started trying to understand the internals better ...

Thanks

Michael

Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
Sure, we can make it work but in a distributed environment you have to 
run first each query distributed (aggregating all nodes) and then RRF 
on top of the aggregated ranked lists.
Doing RRF per node first and then aggregate per shard won't return the 
same results I suspect.

When I go back to working on the task I'll be able to elaborate more!

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
<https://twitter.com/seaseltd> | Youtube 
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
<https://github.com/seaseltd>



On Mon, 13 May 2024 at 14:12, Adrien Grand  wrote:

> Maybe Adrien Grand and others might also have some feedback :-)

I'd suggest the signature to look something like `TopDocs
TopDocs#rrf(int topN, int k, TopDocs[] hits)` to be consistent
with `TopDocs#merge`. Internally, it should look at
`ScoreDoc#shardId` and `ScoreDoc#doc` to figure out which hits map
to the same document.

> Back in the day, I was reasoning on this and I didn't think
Lucene was the right place for an interleaving algorithm, given
that Reciprocal Rank Fusion is affected by distribution and it's
not supposed to work per node.

To me this is like `TopDocs#merge`. There are changes needed on
the application side to hook this call into the logic that
combines hits that come from multiple shards (multiple queries in
the case of RRF), but Lucene can still provide the merging logic.

On Mon, May 13, 2024 at 1:41 PM Michael Wechner
 wrote:

Thanks for your feedback Alessandro!

I am using Lucene independent of Solr or OpenSearch,
Elasticsearch, but would like to combine different result sets
using RRF, therefore think that Lucene itself could be a good
place actually.

Looking forward to your additional elaboration!

Thanks

Michael





Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti
:

This is not strictly related to Lucene, but I'll give a talk
at Berlin Buzzwords on how I am implementing Reciprocal Rank
Fusion in Apache Solr.
I'll resume my work on the contribution next week and have
more to share later.

Back in the day, I was reasoning on this and I didn't think
Lucene was the right place for an interleaving algorithm,
given that Reciprocal Rank Fusion is affected by distribution
and it's not supposed to work per node.
I think I evaluated the possibility of doing it as a Lucene
query or a Lucene component but then ended up with a
different approach.
I'll elaborate more when I go back to the task!

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>


On Sat, 11 May 2024 at 09:10, Michael Wechner
 wrote:

sure, no problem!

Maybe Adrien Grand and others might also have some
feedback :-)

Thanks

Michael

Am 10.05.24 um 23:03 schrieb Chang Hank:

Thank you for these useful resources, please allow me to
spend some time look into it.
I’ll let you know asap!!

Thanks

        Hank


On May 10, 2024, at 12:34 PM, Michael Wechner

<mailto:michael.wech...@wyona.com> wrote:

also we might want to consider how this relates to


https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

In vector search reranking has become quite popular, e.g.

https://docs.cohere.com/docs/reranking

IIUC LangChain (python) for example adds the reranker
as an argument to the searcher/retriever


https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

So maybe th

Re: Any recommended issues to work on for a newcomer?

2024-05-13 Thread Michael Wechner
Thanks for your feedback Alessandro!

I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but would 
like to combine different result sets using RRF, therefore think that Lucene 
itself could be a good place actually.

Looking forward to your additional elaboration!

Thanks

Michael




> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti :
> 
> This is not strictly related to Lucene, but I'll give a talk at Berlin 
> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
> I'll resume my work on the contribution next week and have more to share 
> later.
> 
> Back in the day, I was reasoning on this and I didn't think Lucene was the 
> right place for an interleaving algorithm, given that Reciprocal Rank Fusion 
> is affected by distribution and it's not supposed to work per node.
> I think I evaluated the possibility of doing it as a Lucene query or a Lucene 
> component but then ended up with a different approach.
> I'll elaborate more when I go back to the task!
> 
> Cheers
> --
> Alessandro Benedetti
> Director @ Sease Ltd.
> Apache Lucene/Solr Committer
> Apache Solr PMC Member
> 
> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io>
> 
> 
> Sease - Information Retrieval Applied
> Consulting | Training | Open Source
> 
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
> <https://twitter.com/seaseltd> | Youtube 
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
> <https://github.com/seaseltd>
> 
> On Sat, 11 May 2024 at 09:10, Michael Wechner  <mailto:michael.wech...@wyona.com>> wrote:
> sure, no problem!
> 
> Maybe Adrien Grand and others might also have some feedback :-)
> 
> Thanks
> 
> Michael
> 
> Am 10.05.24 um 23:03 schrieb Chang Hank:
>> Thank you for these useful resources, please allow me to spend some time 
>> look into it. 
>> I’ll let you know asap!!
>> 
>> Thanks
>> 
>> Hank
>> 
>>> On May 10, 2024, at 12:34 PM, Michael Wechner  
>>> <mailto:michael.wech...@wyona.com> wrote:
>>> 
>>> also we might want to consider how this relates to
>>> 
>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
>>>  
>>> <https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html>
>>> 
>>> In vector search reranking has become quite popular, e.g.
>>> 
>>> https://docs.cohere.com/docs/reranking 
>>> <https://docs.cohere.com/docs/reranking>
>>> 
>>> IIUC LangChain (python) for example adds the reranker as an argument to the 
>>> searcher/retriever
>>> 
>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
>>>  
>>> <https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/>
>>> 
>>> So maybe the following might make sense as well
>>> 
>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new 
>>> CohereReranker());
>>> 
>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, 
>>> topDocsVector);
>>> 
>>> WDYT?
>>> 
>>> Thanks
>>> 
>>> Michael
>>> 
>>> 
>>> Am 10.05.24 um 21:08 schrieb Michael Wechner:
>>>> great, yes, let's get started :-)
>>>> 
>>>> What about the following pseudo code, assuming that there might be 
>>>> alternative ranking algorithms to RRF
>>>> 
>>>> StoredFieldsKeyword storedFieldsKeyword = 
>>>> indexReaderKeyword.storedFields();
>>>> StoredFieldsVector storedFieldsVector = indexReaderKeyword.storedFields();
>>>> 
>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);
>>>> 
>>>> Ranker ranker = new RRFRanker();
>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);
>>>> 
>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>>>> Document docK = storedFieldsKeyword.document(scoreDoc.doc);
>>>> Document docV = storedFieldsVector.document(scoreDoc.doc);
>>>> 
>>>> } 
>>>> 
>>>> whereas also see 
>>>> 
>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/To

Re: Any recommended issues to work on for a newcomer?

2024-05-11 Thread Michael Wechner

sure, no problem!

Maybe Adrien Grand and others might also have some feedback :-)

Thanks

Michael

Am 10.05.24 um 23:03 schrieb Chang Hank:
Thank you for these useful resources, please allow me to spend some 
time look into it.

I’ll let you know asap!!

Thanks

Hank

On May 10, 2024, at 12:34 PM, Michael Wechner 
 wrote:


also we might want to consider how this relates to

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

In vector search reranking has become quite popular, e.g.

https://docs.cohere.com/docs/reranking

IIUC LangChain (python) for example adds the reranker as an argument 
to the searcher/retriever


https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

So maybe the following might make sense as well

TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector = vectorSearcher.search(query, 50, new 
CohereReranker());


TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, 
topDocsVector);


WDYT?

Thanks

Michael


Am 10.05.24 um 21:08 schrieb Michael Wechner:

great, yes, let's get started :-)

What about the following pseudo code, assuming that there might be 
alternative ranking algorithms to RRF


StoredFieldsKeyword storedFieldsKeyword = 
indexReaderKeyword.storedFields();
StoredFieldsVector storedFieldsVector = 
indexReaderKeyword.storedFields();


TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);

Ranker ranker = new RRFRanker();
TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    Document docK = storedFieldsKeyword.document(scoreDoc.doc);
    Document docV = storedFieldsVector.document(scoreDoc.doc);
    
}

whereas also see

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

WDYT?

Thanks

Michael




Am 10.05.24 um 20:01 schrieb Chang Hank:

Hi Michael,

Sounds good to me.
Let’s do it!!

Cheers,
Hank

On May 10, 2024, at 10:50 AM, Michael Wechner 
 wrote:


Hi Hank

Very cool!

Adrien Grand suggested to implement it as a utility method on the 
TopDocs class, and since Adrien worked for a decade on Lucene 
https://www.elastic.co/de/blog/author/adrien-grand I guess it 
makes sense to follow his advice :-) We could create a PR and work 
together on it, WDYT? All the best Michael

Am 10.05.24 um 18:51 schrieb Chang Hank:

Hi Michael,

Thank you for the reply.
This is really a cool issue to work on, I’m happy to work on this 
with you. I’ll try to do research on RRF first.

Also, are we going to implement this on the TopDocs class?

Best,
Hank


On May 9, 2024, at 11:08 PM, Michael Wechner 
 wrote:


Hi Hank

Thanks for offering your help!

I recently suggested to implement RRF (Reciprocal Rank Fusion)

https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

but still have not found the time to really work on this.

Maybe you would be interested to do this or that we work on it 
together somehow?


Thanks

Michael



Am 10.05.24 um 07:27 schrieb Chang Hank:

Hi everyone,

I’m Hank Chang, currently studying Information Retrieval 
topics. I’m really interested in contributing to Apache Lucene 
and enhance my understanding to the field.
I’ve reviewed several issues posted on the Github repository 
but haven’t found a straightforward starting point. Could 
someone please recommend suitable issues for a newcomer like me 
or suggest areas I could assist with?


Thank you for your time and guidance.

Best regards,
Hank Chang
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org















Re: Any recommended issues to work on for a newcomer?

2024-05-10 Thread Michael Wechner

also we might want to consider how this relates to

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

In vector search reranking has become quite popular, e.g.

https://docs.cohere.com/docs/reranking

IIUC LangChain (python) for example adds the reranker as an argument to 
the searcher/retriever


https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

So maybe the following might make sense as well

TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector = vectorSearcher.search(query, 50, new 
CohereReranker());


TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, 
topDocsVector);


WDYT?

Thanks

Michael


Am 10.05.24 um 21:08 schrieb Michael Wechner:

great, yes, let's get started :-)

What about the following pseudo code, assuming that there might be 
alternative ranking algorithms to RRF


StoredFieldsKeyword storedFieldsKeyword = 
indexReaderKeyword.storedFields();

StoredFieldsVector storedFieldsVector = indexReaderKeyword.storedFields();

TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);

Ranker ranker = new RRFRanker();
TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    Document docK = storedFieldsKeyword.document(scoreDoc.doc);
    Document docV = storedFieldsVector.document(scoreDoc.doc);
    
}

whereas also see

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

WDYT?

Thanks

Michael




Am 10.05.24 um 20:01 schrieb Chang Hank:

Hi Michael,

Sounds good to me.
Let’s do it!!

Cheers,
Hank

On May 10, 2024, at 10:50 AM, Michael Wechner 
 wrote:


Hi Hank

Very cool!

Adrien Grand suggested to implement it as a utility method on the 
TopDocs class, and since Adrien worked for a decade on Lucene 
https://www.elastic.co/de/blog/author/adrien-grand I guess it makes 
sense to follow his advice :-) We could create a PR and work 
together on it, WDYT? All the best Michael

Am 10.05.24 um 18:51 schrieb Chang Hank:

Hi Michael,

Thank you for the reply.
This is really a cool issue to work on, I’m happy to work on this 
with you. I’ll try to do research on RRF first.

Also, are we going to implement this on the TopDocs class?

Best,
Hank


On May 9, 2024, at 11:08 PM, Michael Wechner 
 wrote:


Hi Hank

Thanks for offering your help!

I recently suggested to implement RRF (Reciprocal Rank Fusion)

https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

but still have not found the time to really work on this.

Maybe you would be interested to do this or that we work on it 
together somehow?


Thanks

Michael



Am 10.05.24 um 07:27 schrieb Chang Hank:

Hi everyone,

I’m Hank Chang, currently studying Information Retrieval topics. 
I’m really interested in contributing to Apache Lucene and 
enhance my understanding to the field.
I’ve reviewed several issues posted on the Github repository but 
haven’t found a straightforward starting point. Could someone 
please recommend suitable issues for a newcomer like me or 
suggest areas I could assist with?


Thank you for your time and guidance.

Best regards,
Hank Chang
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org











Re: Any recommended issues to work on for a newcomer?

2024-05-10 Thread Michael Wechner

great, yes, let's get started :-)

What about the following pseudo code, assuming that there might be 
alternative ranking algorithms to RRF


StoredFieldsKeyword storedFieldsKeyword = indexReaderKeyword.storedFields();
StoredFieldsVector storedFieldsVector = indexReaderKeyword.storedFields();

TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);

Ranker ranker = new RRFRanker();
TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    Document docK = storedFieldsKeyword.document(scoreDoc.doc);
    Document docV = storedFieldsVector.document(scoreDoc.doc);
    
}

whereas also see

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

WDYT?

Thanks

Michael




Am 10.05.24 um 20:01 schrieb Chang Hank:

Hi Michael,

Sounds good to me.
Let’s do it!!

Cheers,
Hank

On May 10, 2024, at 10:50 AM, Michael Wechner 
 wrote:


Hi Hank

Very cool!

Adrien Grand suggested to implement it as a utility method on the 
TopDocs class, and since Adrien worked for a decade on Lucene 
https://www.elastic.co/de/blog/author/adrien-grand I guess it makes 
sense to follow his advice :-) We could create a PR and work together 
on it, WDYT? All the best Michael

Am 10.05.24 um 18:51 schrieb Chang Hank:

Hi Michael,

Thank you for the reply.
This is really a cool issue to work on, I’m happy to work on this 
with you. I’ll try to do research on RRF first.

Also, are we going to implement this on the TopDocs class?

Best,
Hank


On May 9, 2024, at 11:08 PM, Michael Wechner 
 wrote:


Hi Hank

Thanks for offering your help!

I recently suggested to implement RRF (Reciprocal Rank Fusion)

https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

but still have not found the time to really work on this.

Maybe you would be interested to do this or that we work on it 
together somehow?


Thanks

Michael



Am 10.05.24 um 07:27 schrieb Chang Hank:

Hi everyone,

I’m Hank Chang, currently studying Information Retrieval topics. 
I’m really interested in contributing to Apache Lucene and enhance 
my understanding to the field.
I’ve reviewed several issues posted on the Github repository but 
haven’t found a straightforward starting point. Could someone 
please recommend suitable issues for a newcomer like me or suggest 
areas I could assist with?


Thank you for your time and guidance.

Best regards,
Hank Chang
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org









Re: Any recommended issues to work on for a newcomer?

2024-05-10 Thread Michael Wechner

Hi Hank

Very cool!

Adrien Grand suggested to implement it as a utility method on the 
TopDocs class, and since Adrien worked for a decade on Lucene 
https://www.elastic.co/de/blog/author/adrien-grand I guess it makes 
sense to follow his advice :-) We could create a PR and work together on 
it, WDYT? All the best Michael

Am 10.05.24 um 18:51 schrieb Chang Hank:

Hi Michael,

Thank you for the reply.
This is really a cool issue to work on, I’m happy to work on this with 
you. I’ll try to do research on RRF first.

Also, are we going to implement this on the TopDocs class?

Best,
Hank


On May 9, 2024, at 11:08 PM, Michael Wechner 
 wrote:


Hi Hank

Thanks for offering your help!

I recently suggested to implement RRF (Reciprocal Rank Fusion)

https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

but still have not found the time to really work on this.

Maybe you would be interested to do this or that we work on it 
together somehow?


Thanks

Michael



Am 10.05.24 um 07:27 schrieb Chang Hank:

Hi everyone,

I’m Hank Chang, currently studying Information Retrieval topics. I’m 
really interested in contributing to Apache Lucene and enhance my 
understanding to the field.
I’ve reviewed several issues posted on the Github repository but 
haven’t found a straightforward starting point. Could someone please 
recommend suitable issues for a newcomer like me or suggest areas I 
could assist with?


Thank you for your time and guidance.

Best regards,
Hank Chang
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org





Re: Any recommended issues to work on for a newcomer?

2024-05-10 Thread Michael Wechner

Hi Hank

Thanks for offering your help!

I recently suggested to implement RRF (Reciprocal Rank Fusion)

https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

but still have not found the time to really work on this.

Maybe you would be interested to do this or that we work on it together 
somehow?


Thanks

Michael



Am 10.05.24 um 07:27 schrieb Chang Hank:

Hi everyone,

I’m Hank Chang, currently studying Information Retrieval topics. I’m really 
interested in contributing to Apache Lucene and enhance my understanding to the 
field.
I’ve reviewed several issues posted on the Github repository but haven’t found 
a straightforward starting point. Could someone please recommend suitable 
issues for a newcomer like me or suggest areas I could assist with?

Thank you for your time and guidance.

Best regards,
Hank Chang
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Announcing githubsearch!

2024-02-19 Thread Michael Wechner

thank you very much!

Am 19.02.24 um 17:39 schrieb Michael McCandless:

Hi Team,

~1.5 years ago (August 2022) we migrated our Lucene issue tracking 
from Jira to GitHub. Thank you Tomoko for all the hard work doing such 
a complex, multi-phased, high-fidelity migration!


I finally finished also migrating jirasearch to GitHub: 
githubsearch.mikemccandless.com 
. It was tricky because 
GitHub issues/PRs are fundamentally more complex than Jira's data 
model, and the GitHub REST API is also quite rich / heavily 
normalized. All of the source code for githubsearch lives here 
. 
The UI remains its barebones self ;)


Githubsearch 
 
is dog food for us: it showcases Lucene (currently 9.8.0), and many of 
its fun features like infix autosuggest, block join queries (each 
comment is a sub-document on the issue/PR), DrillSideways faceting, 
near-real-time indexing/searching, synonyms (try “oome 
”), 
expressions, non-relevance and blended-relevance sort, etc.  (This old 
blog post 
 goes 
into detail.)  Plus, it’s meta-fun to use Lucene to search its own 
issues, to help us be more productive in improving Lucene! Nicely 
recursive.


In addition to good ol’ searching by text, githubsearch 
 has some new/fun features:


  * Drill down to just PRs or issues
  * Filter by “review requested” for a given user: poor Adrien has 8
(open) now


(sorry)! Or see your mentions (Robert is mentioned in 27 open
issues/PRs

).
Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs

).
Or issues and PRs where a user has had any involvement at all
(Dawid has interacted on 197 issues/PRs

).
  * Find still-open PRs that were created by a New Contributor


(an author who has no changes merged into our repository) or
Contributor


(non-committer who has had some changes merged into our
repository) or Member


  * Here are the uber-stale (last touched more than a month ago) open
PRs by outside contributors

.
We should ideally keep this at 0, but it’s 83 now!
  * “Link to this search” to get a short-er, more permanent URL (it is
NOT a URL shortener, though!)
  * Save named searches you frequently run (they just save to local
cookie state on that one browser)

I’m sure there are exciting bugs, feedback/patches welcome!  If you 
see problems, please reply to this email or file an issue here 
.


Note that jirasearch  
remains running, to search Solr, Tika and Infra issues.


Happy Searching,

Mike McCandless

http://blog.mikemccandless.com


Re: The need for a Lucene 9.9.2 release

2024-01-23 Thread Michael Wechner

thanks for discovering and fixing!

Michael

Am 23.01.24 um 18:36 schrieb Chris Hegarty:

Hi,

We’ve encounter a serious issue with the recent Lucene 9.9.1 release, which 
warrants a 9.9.2.

The issue is a NPE when sampling for quantization in 
Lucene99HnswScalarQuantizedVectorsFormat [1]. Thankfully Ben has already 
resolved the issue, and backported it to the appropriate branches.

I don’t see any other potential issues that would warrant being pulled into 
this release.

I’m happy to be Release Manager for 9.9.2 (given my recent experience on 
9.9.1). I’ll start the release process tomorrow and notify this list when 
artifacts are ready.

Thanks,
-Chris.

[1] https://github.com/apache/lucene/pull/13027
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Stefan Vodita as Lucene committter

2024-01-19 Thread Michael Wechner
Hi Stefan, thank you very much for your contributions and helping to 
improve Lucene!


All the best

Michael

Am 19.01.24 um 20:03 schrieb Stefan Vodita:

Thank you all! It's an honor to join the project as a committer.

I'm originally from a small town in southern Romania 
, so I'm really looking
forward to seeing #12172  
resolved, since both the characters in question (ș, ț)

are supposed to show up in my name.

In university , I had 
professors who contributed to open software 
 and I was

lucky enough to be given a taste of the open source world. I had become a
teaching assistant for a few of the courses (Data Structures, Control 
Theory),
and it had crossed my mind to stay at the university. Then I got an 
offer to
come work at Amazon, in Ireland 
. They gave me a list of 
teams I could join that

only had the names of the teams - I thought Search Engine Tech sounded the
coolest. I was right! That's how I first learned about Lucene and started
working with/on it. It's a privilege, Lucene is an amazing piece of 
software and

I'm proud to be contributing.

Outside programming, I like history and philosophy. I've been a voracious
reader basically since I learned how to read. Recently, I've been 
going down
a spiral of increasingly obscure books, but nothing has topped 
Dostoevsky's
classic, The Brothers Karamazov 
. Knowing books also 
happens to be useful
for thinking up faceting examples 
, 
so that's a plus.
When I was in middle-school, I half-willingly went through 4 years of 
classical

guitar training and was left with a life-long desire to be a good musician
despite my inconsistent practice habits. Practice will have to wait 
until I

finish up the next PR - looking forward to many more in the future!

Cheers,
Stefan

On Thu, 18 Jan 2024 at 15:56, Michael McCandless 
 wrote:


Hi Team,

I'm pleased to announce that Stefan Vodita has accepted the Lucene
PMC's invitation to become a committer!

Stefan, the tradition is that new committers introduce themselves
with a brief bio.

Congratulations, welcome, and thank you for all your improvements
to Lucene and our community,

Mike McCandless

http://blog.mikemccandless.com



Re: SPLADE implementation

2023-11-15 Thread Michael Wechner
I got it running now :-) thanks again, whereas see the code below, which 
might help others as well.


I don't quite understand the correlation between weights, scores, etc. 
yet, but will try to figure out from the documentation at


https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/FeatureField.html

Thanks

Michael

String question ="What animals live in the rainforests of Brazil?";

Query questionQuery = parser.parse(question);

List features = getFeatures(question); // For example "jungle" as an alternatie 
to "rainforests"
if (features.size() >0) {
BooleanQuery.Builder bqb =new BooleanQuery.Builder();
bqb.add(questionQuery, BooleanClause.Occur.SHOULD);
for (String feature : features) {
// TODO: Replace hard-coded weight bqb.add(new 
BooleanClause(FeatureField.newLinearQuery("feature_field_name", feature,0.3F), 
BooleanClause.Occur.SHOULD));}
BooleanQuery termExpansionQuery = bqb.build();
log.info("Term expansion query: " + termExpansionQuery);
return termExpansionQuery;
}else {
log.info("Regular query: " + questionQuery);
return questionQuery;
}



Am 15.11.23 um 11:35 schrieb Michael Wechner:

thank you very much, will try this :-)


Am 15.11.23 um 11:25 schrieb Adrien Grand:

Say your model produces a set of weighted terms:
 - At index time, for each (term, weight) pair, you add a "new 
FeatureField(fieldName, term, weight)` field to your document.
 - At search time, for each (term, weight) pair, you add a "new 
BooleanClause(FeatureField.newLinearQuery(fieldName, term, weight))" 
to your BooleanQuery.


On Wed, Nov 15, 2023 at 11:08 AM Michael Wechner 
 wrote:


Hi Adrien

Ah ok, I did not realize this, thanks for pointing this out!

I don't quite understand though, how you would implement the
"SPLADE" approach using FeatureField from the documentation at


https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/FeatureField.html

For example when indexing a document or doing a query and I use
some language model (e.g. BERT) to do the term expansion, how
do I then make use of FeatureField exactly?

I tried to find some code examples, but couldn't, do you maybe
have some pointers?

Thanks

Michael


Am 15.11.23 um 10:34 schrieb Adrien Grand:

Hi Michael,

What functionality are you missing? Lucene already supports
indexing/querying weighted terms using FeatureField.

On Wed, Nov 15, 2023 at 10:03 AM Michael Wechner
 wrote:

Hi

I have found the following issue re a possible SPLADE
implementation

https://github.com/apache/lucene/issues/11799

Is somebody still working on this?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-- 
Adrien




--
Adrien




Re: SPLADE implementation

2023-11-15 Thread Michael Wechner

thank you very much, will try this :-)


Am 15.11.23 um 11:25 schrieb Adrien Grand:

Say your model produces a set of weighted terms:
 - At index time, for each (term, weight) pair, you add a "new 
FeatureField(fieldName, term, weight)` field to your document.
 - At search time, for each (term, weight) pair, you add a "new 
BooleanClause(FeatureField.newLinearQuery(fieldName, term, weight))" 
to your BooleanQuery.


On Wed, Nov 15, 2023 at 11:08 AM Michael Wechner 
 wrote:


Hi Adrien

Ah ok, I did not realize this, thanks for pointing this out!

I don't quite understand though, how you would implement the
"SPLADE" approach using FeatureField from the documentation at


https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/FeatureField.html

For example when indexing a document or doing a query and I use
some language model (e.g. BERT) to do the term expansion, how
do I then make use of FeatureField exactly?

I tried to find some code examples, but couldn't, do you maybe
have some pointers?

Thanks

Michael


Am 15.11.23 um 10:34 schrieb Adrien Grand:

Hi Michael,

What functionality are you missing? Lucene already supports
indexing/querying weighted terms using FeatureField.

On Wed, Nov 15, 2023 at 10:03 AM Michael Wechner
 wrote:

Hi

I have found the following issue re a possible SPLADE
implementation

https://github.com/apache/lucene/issues/11799

Is somebody still working on this?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-- 
Adrien




--
Adrien


Re: SPLADE implementation

2023-11-15 Thread Michael Wechner

Hi Adrien

Ah ok, I did not realize this, thanks for pointing this out!

I don't quite understand though, how you would implement the "SPLADE" 
approach using FeatureField from the documentation at


https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/FeatureField.html

For example when indexing a document or doing a query and I use some 
language model (e.g. BERT) to do the term expansion, how

do I then make use of FeatureField exactly?

I tried to find some code examples, but couldn't, do you maybe have some 
pointers?


Thanks

Michael


Am 15.11.23 um 10:34 schrieb Adrien Grand:

Hi Michael,

What functionality are you missing? Lucene already supports 
indexing/querying weighted terms using FeatureField.


On Wed, Nov 15, 2023 at 10:03 AM Michael Wechner 
 wrote:


Hi

I have found the following issue re a possible SPLADE implementation

https://github.com/apache/lucene/issues/11799

Is somebody still working on this?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



--
Adrien


SPLADE implementation

2023-11-15 Thread Michael Wechner

Hi

I have found the following issue re a possible SPLADE implementation

https://github.com/apache/lucene/issues/11799

Is somebody still working on this?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Quantization for vector search

2023-11-04 Thread Michael Wechner

Hi Ben

Am 04.11.23 um 14:41 schrieb Benjamin Trent:

Hey Michael,

In short, it's being worked on :).


cool, thanks!



Could you point to the LinkedIN post?


https://www.linkedin.com/posts/reimersnils_%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F%3F-%3F%3F-%3F%3F%3F-%3F%3F%3F-activity-7125863813064581120-bO6N/?utm_source=share_medium=member_desktop


Is Nils talking about the model output quantized output or that their 
default output is easily compressible because of how the embeddings 
are built?


it is not clear to me from the post, but maybe you understand the post 
(link above) better




I have done a bad job of linking back against that original issue the 
work that is being done:


The initial implementation of adding int8 (really, its int7 because of 
signed bytes...): https://github.com/apache/lucene/pull/12582


A significant refactor to make adding new quantized storage easier: 
https://github.com/apache/lucene/pull/12729


Lucene already supports folks just giving it signed `byte[]` values. 
But this only gets so far. The additional work should get Lucene 
further down the road towards better lossy-compression for vectors.


very cool, thank you!

All the best

Michael





Thanks!

Ben

On Sat, Nov 4, 2023 at 4:07 AM Michael Wechner 
 wrote:


Hi

If I understand correctly some devs are working on introducing
quantization for vector search or at least considering it

https://github.com/apache/lucene/issues/12497

Just being curious what is the status on this resp. is somebody
working on this actively?


It came to my mind, because Cohere recently made their new
embedding model "Embed v3" available

https://txt.cohere.com/introducing-embed-v3/

whereas IIUC, Cohere intends to also provide embeddings optimized
for compression soon.

Nils Reimers recently wrote on LinkedIn:


"... what we see on the BioASQ dataset:
4x - 99.99% search quality
16x - 99.9% search quality
32x - 95% search quality
64x - 85% search quality
But it requires that the respective vector DB supports these
modes, what we currently work on with partners."


This might be interesting for Lucene as well, resp. I am not sure
whether somebody at Lucene is already working on something like this.

Thanks

Michael



Quantization for vector search

2023-11-04 Thread Michael Wechner

Hi

If I understand correctly some devs are working on introducing 
quantization for vector search or at least considering it


https://github.com/apache/lucene/issues/12497

Just being curious what is the status on this resp. is somebody working 
on this actively?



It came to my mind, because Cohere recently made their new embedding 
model "Embed v3" available


https://txt.cohere.com/introducing-embed-v3/

whereas IIUC, Cohere intends to also provide embeddings optimized for 
compression soon.


Nils Reimers recently wrote on LinkedIn:


"... what we see on the BioASQ dataset:
4x - 99.99% search quality
16x - 99.9% search quality
32x - 95% search quality
64x - 85% search quality
But it requires that the respective vector DB supports these modes, what 
we currently work on with partners."



This might be interesting for Lucene as well, resp. I am not sure 
whether somebody at Lucene is already working on something like this.


Thanks

Michael

Re: Update TermInSetQuery Example?

2023-10-21 Thread Michael Wechner

Hi Uwe

Thanks for the hints re the other source code samples, will do this and 
will create a PR.


Thanks

Michael

Am 21.10.23 um 09:52 schrieb Uwe Schindler:

Hi Michael,

Go ahead. Maybe scan through the remaining source files with a 
grep/regex:


$ fgrep -R 'new BooleanQuery(' *
lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: 
return new BooleanQuery(minimumNumberShouldMatch, clauses.toArray(new 
BooleanClause[0]));
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: * 
BooleanQuery bq = new BooleanQuery();
lucene/queries/src/java/org/apache/lucene/queries/spans/package-info.java: 
* Query query = new BooleanQuery();
lucene/spatial-extras/src/java/org/apache/lucene/spatial/bbox/BBoxStrategy.java: 
// BooleanQuery qNotDisjoint = new BooleanQuery();


The first one is a false positive (the builder calls the BQ ctor, but 
all others should possibly be fixed. There may be other combinations 
not detected because of source code formatting.


Uwe

Am 20.10.2023 um 23:46 schrieb Michael Wechner:

Hi

I recently found TermInSetQuery example at

https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/TermInSetQuery.html 



but if I understand correctly one should use now BooleanQuery.Builder 
instead BooleanQuery itself, right?


BooleanQuery.Builder bqb = new BooleanQuery.Builder();
bqb.add(new TermQuery(new Term("field", "foo")), 
BooleanClause.Occcur.SHOULD);
bqb.add(new TermQuery(new Term("field", "bar")), 
BooleanClause.Occcur.SHOULD);

Query q2 = new ConstantScoreQuery(bqb.build());

If so, I would be happy to do a minor pull request or feel free to 
update it directly.


Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Update TermInSetQuery Example?

2023-10-20 Thread Michael Wechner

Hi

I recently found TermInSetQuery example at

https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/TermInSetQuery.html

but if I understand correctly one should use now BooleanQuery.Builder 
instead BooleanQuery itself, right?


BooleanQuery.Builder bqb = new BooleanQuery.Builder();
bqb.add(new TermQuery(new Term("field", "foo")), 
BooleanClause.Occcur.SHOULD);
bqb.add(new TermQuery(new Term("field", "bar")), 
BooleanClause.Occcur.SHOULD);

Query q2 = new ConstantScoreQuery(bqb.build());

If so, I would be happy to do a minor pull request or feel free to 
update it directly.


Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Multimodal search

2023-10-16 Thread Michael Wechner
btw, here are some other examples of hybrid search implementations, 
using RRF


https://weaviate.io/blog/hybrid-search-explained
https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

but as written below, I don't think this really addresses the problem of 
accuracy at its core.


Thanks

Michael


Am 15.10.23 um 21:05 schrieb Michael Wechner:

Hi Navneet

I also observe that various "vector search DBs" are implementing 
hybrid search, because the accuracy with embeddings is often not good 
enough.
Vectors are often too "mushy" and hybrid search can help to improve 
accuracy, just as re-ranking does, but I think there is a better way.


Depending on the dataset and the expertise of a human, answers by 
"humans" are much more accurate, because I think "humans" are 
extracting "features" from input and then operate on these "features". 
See for example


https://medium.com/aleph-alpha-blog/multimodality-attention-is-all-you-need-is-all-we-needed-526c45abdf0

and see the principles behind DALL-E and CLIP.

I think the same or similar principles could be re-used to implement a 
more accurate search.


I have built a very simple PoC and it looks promising, that using this 
approach provides a much higher accuracy, because the similarity score 
is much more distinct.


Of course there are various challenges, but I think it is worth exploring.

I also understand that within an existing "ecosystem" change, resp. 
trying something new can be difficult, but I guess I am not the only 
one seeing low accuracy as a fundamental problem, right?


Thanks

Michael





Am 14.10.23 um 09:38 schrieb Navneet Verma:

Hi Michael,
Please correct me if I am wrong, I think what you are trying to say 
with multimodal search is to combine both text search and vector 
search to improve the accuracy of search results. As per my 
understanding of search space people are coining this as Hybrid 
search. We recently launched a query clause in OpenSearch called 
"hybrid" which takes this hybrid approach and combines scores of text 
and vector search 
globally(https://opensearch.org/blog/hybrid-search/). As per our 
experiments we saw accuracy being better than text search and vector 
search alone. Just curious if you are thinking something like this or 
you have a completely different thought.


I agree that currently to improve the accuracy of search results 
there have been techniques like re-ranking that are very popular.



Thanks
Navneet

On Fri, Oct 13, 2023 at 8:53 AM Michael Wechner 
 wrote:


Thanks for your feedback and the link to the OpenSearch
implementation!

I think the embedding approach as it exists today is not and will
not be able to provide good enough accuracy.
Many people try to fix this with re-ranking, which helps, but
does not really fix the actual problem.

I think we focus too much on text, because text/language is
actually just a representation of the "models" we create in our
minds from the reality we perceive via our senses.

When you take multimodality into account from the very beginning,
then you will be forced to approach search differently
and I would argue that this will lead to a much more powerful
search implementation, which is able to provide better accuracy
and also the capability that the implementation knows much better
what it does not know.

I do not mean to sound philosophical, but actually have a quite
clear implementation in my mind resp. on paper, but I would be
interested
to know whether the Lucene community is interested to reconsider
search from the ground up?

I think the Lucene community has a fantastic knowledge /
expertise, but I think it is time to evolve quite radically, and
not just do another vector search implementation.

WDYT?

Thanks

Michael







Am 13.10.23 um 00:49 schrieb Michael Froh:

We recently added multimodal search in OpenSearch:
https://github.com/opensearch-project/neural-search/pull/359

Since Lucene ultimately just cares about embeddings, does Lucene
itself really need to be multimodal? Wherever the embeddings
come from, Lucene can index the vectors and combine with textual
queries, right?

Thanks,
Froh

On Thu, Oct 12, 2023 at 12:59 PM Michael Wechner
 wrote:

Hi

Did anyone of the Lucene committers consider making Lucene
multimodal?

With a quick Google search I found for example

https://dl.acm.org/doi/abs/10.1145/3503161.3548768

https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org







Re: Multimodal search

2023-10-15 Thread Michael Wechner

Hi Navneet

I also observe that various "vector search DBs" are implementing hybrid 
search, because the accuracy with embeddings is often not good enough.
Vectors are often too "mushy" and hybrid search can help to improve 
accuracy, just as re-ranking does, but I think there is a better way.


Depending on the dataset and the expertise of a human, answers by 
"humans" are much more accurate, because I think "humans" are extracting 
"features" from input and then operate on these "features". See for example


https://medium.com/aleph-alpha-blog/multimodality-attention-is-all-you-need-is-all-we-needed-526c45abdf0

and see the principles behind DALL-E and CLIP.

I think the same or similar principles could be re-used to implement a 
more accurate search.


I have built a very simple PoC and it looks promising, that using this 
approach provides a much higher accuracy, because the similarity score 
is much more distinct.


Of course there are various challenges, but I think it is worth exploring.

I also understand that within an existing "ecosystem" change, resp. 
trying something new can be difficult, but I guess I am not the only one 
seeing low accuracy as a fundamental problem, right?


Thanks

Michael





Am 14.10.23 um 09:38 schrieb Navneet Verma:

Hi Michael,
Please correct me if I am wrong, I think what you are trying to say 
with multimodal search is to combine both text search and vector 
search to improve the accuracy of search results. As per my 
understanding of search space people are coining this as Hybrid 
search. We recently launched a query clause in OpenSearch called 
"hybrid" which takes this hybrid approach and combines scores of text 
and vector search 
globally(https://opensearch.org/blog/hybrid-search/). As per our 
experiments we saw accuracy being better than text search and vector 
search alone. Just curious if you are thinking something like this or 
you have a completely different thought.


I agree that currently to improve the accuracy of search results there 
have been techniques like re-ranking that are very popular.



Thanks
Navneet

On Fri, Oct 13, 2023 at 8:53 AM Michael Wechner 
 wrote:


Thanks for your feedback and the link to the OpenSearch
implementation!

I think the embedding approach as it exists today is not and will
not be able to provide good enough accuracy.
Many people try to fix this with re-ranking, which helps, but does
not really fix the actual problem.

I think we focus too much on text, because text/language is
actually just a representation of the "models" we create in our
minds from the reality we perceive via our senses.

When you take multimodality into account from the very beginning,
then you will be forced to approach search differently
and I would argue that this will lead to a much more powerful
search implementation, which is able to provide better accuracy
and also the capability that the implementation knows much better
what it does not know.

I do not mean to sound philosophical, but actually have a quite
clear implementation in my mind resp. on paper, but I would be
interested
to know whether the Lucene community is interested to reconsider
search from the ground up?

I think the Lucene community has a fantastic knowledge /
expertise, but I think it is time to evolve quite radically, and
not just do another vector search implementation.

WDYT?

Thanks

Michael







Am 13.10.23 um 00:49 schrieb Michael Froh:

We recently added multimodal search in OpenSearch:
https://github.com/opensearch-project/neural-search/pull/359

Since Lucene ultimately just cares about embeddings, does Lucene
itself really need to be multimodal? Wherever the embeddings come
from, Lucene can index the vectors and combine with textual
queries, right?

Thanks,
Froh

On Thu, Oct 12, 2023 at 12:59 PM Michael Wechner
 wrote:

Hi

Did anyone of the Lucene committers consider making Lucene
multimodal?

With a quick Google search I found for example

https://dl.acm.org/doi/abs/10.1145/3503161.3548768

https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org





Re: Multimodal search

2023-10-13 Thread Michael Wechner

Thanks for your feedback and the link to the OpenSearch implementation!

I think the embedding approach as it exists today is not and will not be 
able to provide good enough accuracy.
Many people try to fix this with re-ranking, which helps, but does not 
really fix the actual problem.


I think we focus too much on text, because text/language is actually 
just a representation of the "models" we create in our minds from the 
reality we perceive via our senses.


When you take multimodality into account from the very beginning, then 
you will be forced to approach search differently
and I would argue that this will lead to a much more powerful search 
implementation, which is able to provide better accuracy and also the 
capability that the implementation knows much better what it does not know.


I do not mean to sound philosophical, but actually have a quite clear 
implementation in my mind resp. on paper, but I would be interested
to know whether the Lucene community is interested to reconsider search 
from the ground up?


I think the Lucene community has a fantastic knowledge / expertise, but 
I think it is time to evolve quite radically, and not just do another 
vector search implementation.


WDYT?

Thanks

Michael







Am 13.10.23 um 00:49 schrieb Michael Froh:
We recently added multimodal search in OpenSearch: 
https://github.com/opensearch-project/neural-search/pull/359


Since Lucene ultimately just cares about embeddings, does Lucene 
itself really need to be multimodal? Wherever the embeddings come 
from, Lucene can index the vectors and combine with textual queries, 
right?


Thanks,
Froh

On Thu, Oct 12, 2023 at 12:59 PM Michael Wechner 
 wrote:


Hi

Did anyone of the Lucene committers consider making Lucene multimodal?

With a quick Google search I found for example

https://dl.acm.org/doi/abs/10.1145/3503161.3548768

https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Multimodal search

2023-10-12 Thread Michael Wechner

Hi

Did anyone of the Lucene committers consider making Lucene multimodal?

With a quick Google search I found for example

https://dl.acm.org/doi/abs/10.1145/3503161.3548768

https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Vector Search with OpenAI Embeddings: Lucene Is All You Need

2023-08-31 Thread Michael Wechner

Hi Together

You might be interesed in this paper / article

https://arxiv.org/abs/2308.14963

Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.7 release

2023-06-09 Thread Michael Wechner

Thank you very much, Adrien!

Am 09.06.23 um 18:20 schrieb Tomás Fernández Löbbe:

+1
Thanks Adrien

On Fri, Jun 9, 2023 at 9:19 AM Michael McCandless 
 wrote:


+1, thanks Adrien!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jun 9, 2023 at 12:11 PM Patrick Zhai 
wrote:

+1, thank you Adrien!

On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:

Hello all,

There is some good stuff that is scheduled for 9.7
already, I found the following changes in the changelog
that look especially interesting:
 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of
the Panama vector API.
 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I
volunteer to be the release manager. I suggest the
following schedule:
 - Feature freeze on June 16th, one week from now. This is
when the 9.7 branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if
blockers get identified.

-- 
Adrien




Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner



Am 18.05.23 um 12:22 schrieb Michael McCandless:


I love all the energy and passion going into debating all the ways to 
poke at this limit, but please let's also spend some of this passion 
on actually improving the scalability of our aKNN implementation!  
E.g. Robert opened an exciting "Plan B" ( 
https://github.com/apache/lucene/issues/12302 ) to workaround 
OpenJDK's crazy slowness on enabling access to vectorized SIMD CPU 
instructions (the Java Vector API, JEP 426: 
https://openjdk.org/jeps/426 ).  This could help postings and doc 
values performance too!



agreed, but I do not think the MAX_DIMENSIONS decision should depend on 
this, because I think whatever improvements can be accomplished 
eventually, very likely there will always be some limit.


Thanks

Michael



Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 5:24 AM Alessandro Benedetti 
 wrote:


That's great and a good plan B, but let's try to focus this thread
of collecting votes for a week (let's keep discussions on the nice
PR opened by David or the discussion thread we have in the mailing
list already :)

On Thu, 18 May 2023, 10:10 Ishan Chattopadhyaya,
 wrote:

That sounds promising, Michael. Can you share
scripts/steps/code to reproduce this?

On Thu, 18 May, 2023, 1:16 pm Michael Wechner,
 wrote:

I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and
it works very fine :-)

Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:

IIUC KnnVectorField is deprecated and one is supposed to
use KnnFloatVectorField when using float as vector
values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley
 wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true. 
Michael, please then point to a test or code snippet
that shows the Lucene user community what they want
to see so they are unblocked from their explorations
of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't
actually enforce the limit in any way that can't
easily be circumvented by a user. The codec
already supports any size vector - it doesn't
impose any limit. The way the API is written you
can *already today* create an index with max-int
sized vectors and we are committed to supporting
that going forward by our backwards
compatibility policy as Robert points out. This
wasn't intentional, I think, but it is the facts.

Given that, I think this whole discussion is not
really necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro
Benedetti  wrote:

Hi all,
we have finalized all the options proposed
by the community and we are ready to vote
for the preferred one and then proceed with
the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded
to 1024)
*Motivation*:
We are close to improving on many fronts.
Given the criticality of Lucene in computing
infrastructure and the concerns raised by
one of the most active stewards of the
project, I think we should keep working
toward improving the feature as is and move
to up the limit after we can demonstrate
improvement unambiguously.

*Option 2*
make the limit configurable, for example
through a system property
*Motivation*:
The system administrator can enforce a limit
its users need to respect that it's in line
with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doo

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner

It is basically the code which Michael Sokolov posted at

https://markmail.org/message/kf4nzoqyhwacb7ri

except
 - that I have replaced KnnVectorField by KnnFloatVectorField, because 
KnnVectorField is deprecated.
 - that I don't hard code the  dimension as 2048 and the metric as 
EUCLIDEAN, but take the dimension and metric (VectorSimilarityFunction) 
used by the model. which are in the case of for example 
text-embedding-ada-002: 1536 and COSINE 
(https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)


HTH

Michael



Am 18.05.23 um 11:10 schrieb Ishan Chattopadhyaya:
That sounds promising, Michael. Can you share scripts/steps/code to 
reproduce this?


On Thu, 18 May, 2023, 1:16 pm Michael Wechner, 
 wrote:


I just implemented it and tested it with OpenAI's
text-embedding-ada-002, which is using 1536 dimensions and it
works very fine :-)

Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:

IIUC KnnVectorField is deprecated and one is supposed to use
KnnFloatVectorField when using float as vector values, right?

Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley
 wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true.  Michael,
please then point to a test or code snippet that shows the
Lucene user community what they want to see so they are
unblocked from their explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be
circumvented by a user. The codec already supports any
size vector - it doesn't impose any limit. The way the
API is written you can *already today* create an index
with max-int sized vectors and we are committed to
supporting that going forward by our backwards
compatibility policy as Robert points out. This wasn't
intentional, I think, but it is the facts.

Given that, I think this whole discussion is not really
necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred
one and then proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure
and the concerns raised by one of the most active
stewards of the project, I think we should keep
working toward improving the feature as is and move
to up the limit after we can demonstrate improvement
unambiguously.

*Option 2*
make the limit configurable, for example through a
system property
*Motivation*:
The system administrator can enforce a limit its
users need to respect that it's in line with
whatever the admin decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of plugin
development

*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit
would not bind any other potential vector engine
alternative/evolution.*
*
*Motivation:*There seem to be contradictory
performance interpretations about the current HNSW
implementation. Some consider its performance ok,
some not, and it depends on the target data set and
use case. Increasing the max dimension limit where
it is currently (in top level FloatVectorValues)
would not allow potential alternatives (e.g. for
other use-cases) to be based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate
place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-18 Thread Michael Wechner
I just implemented it and tested it with OpenAI's 
text-embedding-ada-002, which is using 1536 dimensions and it works very 
fine :-)


Thanks

Michael



Am 18.05.23 um 00:29 schrieb Michael Wechner:
IIUC KnnVectorField is deprecated and one is supposed to use 
KnnFloatVectorField when using float as vector values, right?


Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true. Michael, please
then point to a test or code snippet that shows the Lucene user
community what they want to see so they are unblocked from their
explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be
circumvented by a user. The codec already supports any size
vector - it doesn't impose any limit. The way the API is
written you can *already today* create an index with max-int
sized vectors and we are committed to supporting that going
forward by our backwards compatibility policy as Robert
points out. This wasn't intentional, I think, but it is the
facts.

Given that, I think this whole discussion is not really
necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred one
and then proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure and the
concerns raised by one of the most active stewards of the
project, I think we should keep working toward improving
the feature as is and move to up the limit after we can
demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system
property
*Motivation*:
The system administrator can enforce a limit its users
need to respect that it's in line with whatever the admin
decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit would not
bind any other potential vector engine
alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation.
Some consider its performance ok, some not, and it
depends on the target data set and use case. Increasing
the max dimension limit where it is currently (in top
level FloatVectorValues) would not allow
potential alternatives (e.g. for other use-cases) to be
based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen
in any order.
Someone suggested to perfect what the _default_ limit
should be, but I've not seen an argument _against_
configurability.  Especially in this way -- a toggle that
doesn't bind Lucene's APIs in any way.

I'll keep this [VOTE] open for a week and then proceed to
the implementation.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>





Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-17 Thread Michael Wechner
IIUC KnnVectorField is deprecated and one is supposed to use 
KnnFloatVectorField when using float as vector values, right?


Am 17.05.23 um 16:41 schrieb Michael Sokolov:

see https://markmail.org/message/kf4nzoqyhwacb7ri

On Wed, May 17, 2023 at 10:09 AM David Smiley  wrote:

> easily be circumvented by a user

This is a revelation to me and others, if true. Michael, please
then point to a test or code snippet that shows the Lucene user
community what they want to see so they are unblocked from their
explorations of vector search.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 17, 2023 at 7:51 AM Michael Sokolov
 wrote:

I think I've said before on this list we don't actually
enforce the limit in any way that can't easily be circumvented
by a user. The codec already supports any size vector - it
doesn't impose any limit. The way the API is written you can
*already today* create an index with max-int sized vectors and
we are committed to supporting that going forward by our
backwards compatibility policy as Robert points out. This
wasn't intentional, I think, but it is the facts.

Given that, I think this whole discussion is not really necessary.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the
community and we are ready to vote for the preferred one
and then proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure and the
concerns raised by one of the most active stewards of the
project, I think we should keep working toward improving
the feature as is and move to up the limit after we can
demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system
property
*Motivation*:
The system administrator can enforce a limit its users
need to respect that it's in line with whatever the admin
decided to be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch,
OpenSearch, and any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit would not
bind any other potential vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation.
Some consider its performance ok, some not, and it depends
on the target data set and use case. Increasing the max
dimension limit where it is currently (in top level
FloatVectorValues) would not allow potential alternatives
(e.g. for other use-cases) to be based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen
in any order.
Someone suggested to perfect what the _default_ limit
should be, but I've not seen an argument _against_
configurability.  Especially in this way -- a toggle that
doesn't bind Lucene's APIs in any way.

I'll keep this [VOTE] open for a week and then proceed to
the implementation.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  |
Twitter  | Youtube
 |
Github 



Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-17 Thread Michael Wechner
I try to better understand the code, so IIUC vector MAX_DIMENSIONS is 
currently used inside


lucene/core/src/java/org/apache/lucene/document/FieldType.java
lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
lucene/core/src/java/org/apache/lucene/document/KnnByteVectorField.java
lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java
public static final int MAX_DIMENSIONS = 1024;
lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java
public static final int MAX_DIMENSIONS = 1024;

and when you are writing that it should be moved to the hnsw-specific 
code, then you mean somewhere to


lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java

?

Thanks

Michael




Am 17.05.23 um 03:50 schrieb Robert Muir:
by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the 
hsnw-specific code.


This way, someone can write alternative codec with vectors using some 
other completely different approach that incorporates a different more 
appropriate limit (maybe lower, maybe higher) depending upon their 
tradeoffs. We should encourage this as I think it is the "only true 
fix" to the scalability issues: use a scalable algorithm! Also, 
alternative codecs don't force the project into many years of index 
backwards compatibility, which is really my penultimate concern. We 
can lock ourselves into a truly bad place and become irrelevant 
(especially with scalar code implementing all this vector stuff, it is 
really senseless). In the meantime I suggest we try to reduce pain for 
the default codec with the current implementation if possible. If it 
is not possible, we need a new codec that performs.


On Tue, May 16, 2023 at 8:53 PM Robert Muir  wrote:

Gus, I think i explained myself multiple times on issues and in
this thread. the performance is unacceptable, everyone knows it,
but nobody is talking about.
I don't need to explain myself time and time again here.
You don't seem to understand the technical issues (at least you
sure as fuck don't know how service loading works or you wouldnt
have opened https://github.com/apache/lucene/issues/12300 )

I'm just the only one here completely unconstrained by any of
silicon valley's influences to speak my true mind, without any
repercussions, so I do it. Don't give any fucks about ChatGPT.

I'm standing by my technical veto. If you bypass it, I'll revert
the offending commit.

As far as fixing the technical performance, I just opened an issue
with some ideas to at least improve cpu usage by a factor of N. It
does not help with the crazy heap memory usage or other issues of
KNN implementation causing shit like OOM on merge. But it is one
step: https://github.com/apache/lucene/issues/12302



On Tue, May 16, 2023 at 7:45 AM Gus Heck  wrote:

Robert,

Can you explain in clear technical terms the standard that
must be met for performance? A benchmark that must run in X
time on Y hardware for example (and why that test is
suitable)? Or some other reproducible criteria? So far I've
heard you give an *opinion* that it's unusable, but that's not
a technical criteria, others may have a different concept of
what is usable to them.

Forgive me if I misunderstand, but the essence of your
argument has seemed to be

"Performance isn't good enough, therefore we should force
anyone who wants to experiment with something bigger to fork
the code base to do it"

Thus, it is necessary to have a clear unambiguous standard
that anyone can verify for "good enough". A clear standard
would also focus efforts at improvement.

Where are the goal posts?

FWIW I'm +1 on any of 2-4 since I believe the existence of a
hard limit is fundamentally counterproductive in an open
source setting, as it will lead to *fewer people* pushing
the limits. Extremely few people are going to get into the
nitty-gritty of optimizing things unless they are staring at
code 

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Michael Wechner

+1 to Gus' reply.

I think that Robert's veto or anyone else's veto is fair enough, but I 
also think that anyone who is vetoing should be very clear about the 
objectives / goals to be achieved, in order to get a +1.


If no clear objectives / goals can be defined and agreed on, then the 
whole thing becomes arbitrary.


Therefore I would also be interested to know the objectives / goals to 
be met that there will be a +1 re this vote?


Thanks

Michael



Am 16.05.23 um 13:45 schrieb Gus Heck:

Robert,

Can you explain in clear technical terms the standard that must be met 
for performance? A benchmark that must run in X time on Y hardware for 
example (and why that test is suitable)? Or some other reproducible 
criteria? So far I've heard you give an *opinion* that it's unusable, 
but that's not a technical criteria, others may have a different 
concept of what is usable to them.


Forgive me if I misunderstand, but the essence of your argument has 
seemed to be


"Performance isn't good enough, therefore we should force anyone who 
wants to experiment with something bigger to fork the code base to do it"


Thus, it is necessary to have a clear unambiguous standard that anyone 
can verify for "good enough". A clear standard would also focus 
efforts at improvement.


Where are the goal posts?

FWIW I'm +1 on any of 2-4 since I believe the existence of a hard 
limit is fundamentally counterproductive in an open source setting, as 
it will lead to *fewer people* pushing the limits. Extremely few 
people are going to get into the nitty-gritty of optimizing things 
unless they are staring at code that they can prove does something 
interesting, but doesn't run fast enough for their purposes. If people 
hit a hard limit, more of them give up and never develop the code that 
will motivate them to look for optimizations.


-Gus

On Tue, May 16, 2023 at 6:04 AM Robert Muir  wrote:

i still feel -1 (veto) on increasing this limit. sending more
emails does not change the technical facts or make the veto go away.

On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti
 wrote:

Hi all,
we have finalized all the options proposed by the community
and we are ready to vote for the preferred one and then
proceed with the implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the
criticality of Lucene in computing infrastructure and the
concerns raised by one of the most active stewards of the
project, I think we should keep working toward improving the
feature as is and move to up the limit after we can
demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system property
*Motivation*:
The system administrator can enforce a limit its users need to
respect that it's in line with whatever the admin decided to
be acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch,
OpenSearch, and any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW specific
implementation. Once there, this limit would not bind any
other potential vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation. Some
consider its performance ok, some not, and it depends on the
target data set and use case. Increasing the max dimension
limit where it is currently (in top level FloatVectorValues)
would not allow potential alternatives (e.g. for other
use-cases) to be based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen in
any order.
Someone suggested to perfect what the _default_ limit should
be, but I've not seen an argument _against_ configurability. 
Especially in this way -- a toggle that doesn't bind Lucene's
APIs in any way.

I'll keep this [VOTE] open for a week and then proceed to the
implementation.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Michael Wechner

my non-binding vote goes to Option 2 resp. Option 4

Thanks

Michael Wechner


Am 16.05.23 um 10:51 schrieb Alessandro Benedetti:

My vote goes to *Option 4*.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
<https://twitter.com/seaseltd> | Youtube 
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
<https://github.com/seaseltd>



On Tue, 16 May 2023 at 09:50, Alessandro Benedetti 
 wrote:


Hi all,
we have finalized all the options proposed by the community and we
are ready to vote for the preferred one and then proceed with the
implementation.

*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the criticality of
Lucene in computing infrastructure and the concerns raised by one
of the most active stewards of the project, I think we should keep
working toward improving the feature as is and move to up the
limit after we can demonstrate improvement unambiguously.

*Option 2*
make the limit configurable, for example through a system property
*Motivation*:
The system administrator can enforce a limit its users need to
respect that it's in line with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch,
OpenSearch, and any sort of plugin development

*Option 3*
Move the max dimension limit lower level to a HNSW specific
implementation. Once there, this limit would not bind any other
potential vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation. Some
consider its performance ok, some not, and it depends on the
target data set and use case. Increasing the max dimension limit
where it is currently (in top level FloatVectorValues) would not
allow potential alternatives (e.g. for other use-cases) to be
based on a lower limit.

*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen in any
order.
Someone suggested to perfect what the _default_ limit should be,
but I've not seen an argument _against_ configurability. 
Especially in this way -- a toggle that doesn't bind Lucene's APIs
in any way.

I'll keep this [VOTE] open for a week and then proceed to the
implementation.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>



Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Michael Wechner

Hi Alessandro

Thank you very much for summarizing and starting the vote.

I am not sure whether I really understand the difference between Option 
2 and Option 4, or is it just about implementation details?


Thanks

Michael



Am 16.05.23 um 10:50 schrieb Alessandro Benedetti:

Hi all,
we have finalized all the options proposed by the community and we are 
ready to vote for the preferred one and then proceed with the 
implementation.


*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the criticality of 
Lucene in computing infrastructure and the concerns raised by one of 
the most active stewards of the project, I think we should keep 
working toward improving the feature as is and move to up the limit 
after we can demonstrate improvement unambiguously.


*Option 2*
make the limit configurable, for example through a system property
*Motivation*:
The system administrator can enforce a limit its users need to respect 
that it's in line with whatever the admin decided to be acceptable for 
them.

The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch, OpenSearch, 
and any sort of plugin development


*Option 3*
Move the max dimension limit lower level to a HNSW specific 
implementation. Once there, this limit would not bind any other 
potential vector engine alternative/evolution.*

*
*Motivation:*There seem to be contradictory performance 
interpretations about the current HNSW implementation. Some consider 
its performance ok, some not, and it depends on the target data set 
and use case. Increasing the max dimension limit where it is currently 
(in top level FloatVectorValues) would not allow 
potential alternatives (e.g. for other use-cases) to be based on a 
lower limit.


*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a 
simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be 
enough.

*Motivation*:
Both are good and not mutually exclusive and could happen in any order.
Someone suggested to perfect what the _default_ limit should be, but 
I've not seen an argument _against_ configurability.  Especially in 
this way -- a toggle that doesn't bind Lucene's APIs in any way.


I'll keep this [VOTE] open for a week and then proceed to the 
implementation.

--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter 
 | Youtube 
 | Github 



Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-09 Thread Michael Wechner

+1

Michael Wechner

Am 09.05.23 um 14:08 schrieb Alessandro Benedetti:


*Proposed option*: make the limit configurable
*Motivation*:
The system administrator can enforce a limit its users need to respect 
that it's in line with whatever the admin decided to be acceptable for 
them.

The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch, OpenSearch, 
and any sort of plugin development

--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
<https://twitter.com/seaseltd> | Youtube 
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
<https://github.com/seaseltd>



On Tue, 9 May 2023 at 13:07, Alessandro Benedetti 
 wrote:


We had a very long-running (and heated) thread about this
(/[Proposal] Remove max number of dimensions for KNN vectors/).
Without repeating any of it, I recommend we move this forward in
this way:
*We stop any discussion* and everyone interested proposes an
option with a motivation, then we aggregate the options and create
a Vote.

_Please, DO NOT use this thread for anything else than your
proposed option._
All e-mails in this thread should be structured:
*Proposed Option:*
*Motivation:*

Let's keep this open for 1 week and then I'll aggregate the
options and set up the VOTE thread.
If you have anything else to add, please use the old thread.

Cheers

--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>



Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Michael Wechner
I assumed that you would wrap Lucene into a mimimal REST service or use 
Solr or Elasticsearch


Am 09.05.23 um 19:07 schrieb jim ferenczi:
Lucene is a library. I don’t see how it would be exposed in this 
plugin which is about services.



On Tue, 9 May 2023 at 18:00, Jun Luo  wrote:

The pr mentioned a Elasticsearch pr
<https://github.com/elastic/elasticsearch/pull/95257> that
increased the dim to 2048 in ElasticSearch.

Curious how you use Lucene's KNN search. Lucene's KNN supports one
vector per document. Usually multiple/many vectors are needed for
a document content. We will have to split the document content
into chunks and create one Lucene document per document chunk.

ChatGPT plugin directly stores the chunk text in the underline
vector db. If there are lots of documents, will it be a concern to
store the full document content in Lucene? In the traditional
inverted index use case, is it common to store the full document
content in Lucene?

Another question: if you use Lucene as a vector db, do you still
need the inverted index? Wondering what would be the use case to
use inverted index together with vector index. If we don't need
the inverted index, will it be better to use other vector dbs? For
example, PostgreSQL also added vector support recently.

Thanks,
Jun

On Sat, May 6, 2023 at 1:44 PM Michael Wechner
 wrote:

there is already a pull request for Elasticsearch which is also
mentioning the max size 1024

https://github.com/openai/chatgpt-retrieval-plugin/pull/83



Am 06.05.23 um 19:00 schrieb Michael Wechner:
> Hi Together
>
> I recently setup ChatGPT retrieval plugin locally
>
> https://github.com/openai/chatgpt-retrieval-plugin
>
> I think it would be nice to consider to submit a Lucene
implementation
> for this plugin
>
>
https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>
> The plugin is using by default OpenAI's model
"text-embedding-ada-002"
> with 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> but which means one won't be able to use it out-of-the-box
with Lucene.
>
> Similar request here
>
>

https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions

>
>
> I understand we just recently had a lenghty discussion about
> increasing the max dimension and whatever one thinks of
OpenAI, fact
> is, that it has a huge impact and I think it would be nice
that Lucene
> could be part of this "revolution". All we have to do is
increase the
> limit from 1024 to 1536 or even 2048 for example.
>
> Since the performace seems to be linear with the vector
dimension and
> several members have done performance tests successfully and
1024
> seems to have been chosen as max dimension quite arbitrarily
in the
> first place, I think it should not be a problem to increase
the max
> dimension by a factor 1.5 or 2.
>
> WDYT?
>
> Thanks
>
> Michael
>
>
>
>
-
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Michael Wechner
Yes, you would split the document into multiple chunks, whereas the 
ChatGPT retrieval plugin does this by itself, whereas AFAIK the default 
chunk size is 200 tokens 
(https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py).


Also it creates a unique ID for each document you upload, which is saved 
as "document_id" (at least for Weaviate) together with the chunk text.


Re a Lucene implementation, you might want to store the chunk text 
outside of the Lucene index using only a chunk id.


HTH

Michael

Am 09.05.23 um 04:14 schrieb Jun Luo:
The pr mentioned a Elasticsearch pr 
<https://github.com/elastic/elasticsearch/pull/95257> that increased 
the dim to 2048 in ElasticSearch.


Curious how you use Lucene's KNN search. Lucene's KNN supports one 
vector per document. Usually multiple/many vectors are needed for a 
document content. We will have to split the document content into 
chunks and create one Lucene document per document chunk.


ChatGPT plugin directly stores the chunk text in the underline vector 
db. If there are lots of documents, will it be a concern to store the 
full document content in Lucene? In the traditional inverted index use 
case, is it common to store the full document content in Lucene?


Another question: if you use Lucene as a vector db, do you still need 
the inverted index? Wondering what would be the use case to use 
inverted index together with vector index. If we don't need the 
inverted index, will it be better to use other vector dbs? For 
example, PostgreSQL also added vector support recently.


Thanks,
Jun

On Sat, May 6, 2023 at 1:44 PM Michael Wechner 
 wrote:


there is already a pull request for Elasticsearch which is also
mentioning the max size 1024

https://github.com/openai/chatgpt-retrieval-plugin/pull/83



Am 06.05.23 um 19:00 schrieb Michael Wechner:
> Hi Together
>
> I recently setup ChatGPT retrieval plugin locally
>
> https://github.com/openai/chatgpt-retrieval-plugin
>
> I think it would be nice to consider to submit a Lucene
implementation
> for this plugin
>
> https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>
> The plugin is using by default OpenAI's model
"text-embedding-ada-002"
> with 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> but which means one won't be able to use it out-of-the-box with
Lucene.
>
> Similar request here
>
>

https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions

>
>
> I understand we just recently had a lenghty discussion about
> increasing the max dimension and whatever one thinks of OpenAI,
fact
> is, that it has a huge impact and I think it would be nice that
Lucene
> could be part of this "revolution". All we have to do is
increase the
> limit from 1024 to 1536 or even 2048 for example.
>
> Since the performace seems to be linear with the vector
dimension and
> several members have done performance tests successfully and 1024
> seems to have been chosen as max dimension quite arbitrarily in the
> first place, I think it should not be a problem to increase the max
> dimension by a factor 1.5 or 2.
>
> WDYT?
>
> Thanks
>
> Michael
>
>
>
>
-
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-06 Thread Michael Wechner
there is already a pull request for Elasticsearch which is also 
mentioning the max size 1024


https://github.com/openai/chatgpt-retrieval-plugin/pull/83



Am 06.05.23 um 19:00 schrieb Michael Wechner:

Hi Together

I recently setup ChatGPT retrieval plugin locally

https://github.com/openai/chatgpt-retrieval-plugin

I think it would be nice to consider to submit a Lucene implementation 
for this plugin


https://github.com/openai/chatgpt-retrieval-plugin#future-directions

The plugin is using by default OpenAI's model "text-embedding-ada-002" 
with 1536 dimensions


https://openai.com/blog/new-and-improved-embedding-model

but which means one won't be able to use it out-of-the-box with Lucene.

Similar request here

https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions 



I understand we just recently had a lenghty discussion about 
increasing the max dimension and whatever one thinks of OpenAI, fact 
is, that it has a huge impact and I think it would be nice that Lucene 
could be part of this "revolution". All we have to do is increase the 
limit from 1024 to 1536 or even 2048 for example.


Since the performace seems to be linear with the vector dimension and 
several members have done performance tests successfully and 1024 
seems to have been chosen as max dimension quite arbitrarily in the 
first place, I think it should not be a problem to increase the max 
dimension by a factor 1.5 or 2.


WDYT?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-06 Thread Michael Wechner

Hi Together

I recently setup ChatGPT retrieval plugin locally

https://github.com/openai/chatgpt-retrieval-plugin

I think it would be nice to consider to submit a Lucene implementation 
for this plugin


https://github.com/openai/chatgpt-retrieval-plugin#future-directions

The plugin is using by default OpenAI's model "text-embedding-ada-002" 
with 1536 dimensions


https://openai.com/blog/new-and-improved-embedding-model

but which means one won't be able to use it out-of-the-box with Lucene.

Similar request here

https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions

I understand we just recently had a lenghty discussion about increasing 
the max dimension and whatever one thinks of OpenAI, fact is, that it 
has a huge impact and I think it would be nice that Lucene could be part 
of this "revolution". All we have to do is increase the limit from 1024 
to 1536 or even 2048 for example.


Since the performace seems to be linear with the vector dimension and 
several members have done performance tests successfully and 1024 seems 
to have been chosen as max dimension quite arbitrarily in the first 
place, I think it should not be a problem to increase the max dimension 
by a factor 1.5 or 2.


WDYT?

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Seeking Tools and Methods to Measure Lucene's Indexing Performance

2023-05-06 Thread Michael Wechner

thanks for the pointer!

I have added it to the Lucene FAQ

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-HowisLucene'sindexingandsearchperformancemeasured?

Thanks

Michael



Am 06.05.23 um 06:18 schrieb Ishan Chattopadhyaya:

Check Lucene bench: https://home.apache.org/~mikemccand/lucenebench/

On Sat, 6 May, 2023, 9:30 am donghai tang,  wrote:

Hello Lucene Community,

I am in the process of learning about Lucene's indexing
capabilities, and I'm keen on conducting experiments to evaluate
its performance. However, I haven't come across any official tools
specifically designed for measuring Lucene's indexing performance.

I would be extremely grateful if any of you could share your
experiences with tools you've used in the past or suggest
alternative methods for evaluating Lucene's indexing performance.



Re: Concurrent HNSW index

2023-04-27 Thread Michael Wechner

+1 for a pull request

Thanks

Michael

Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:

+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis,  wrote:

Hi all,

I've created an HNSW index implementation that allows for
concurrent build and querying.  On my i9-12900 (8 performance
cores and 8 efficiency) I get a bit less than 10x speedup of wall
clock time for building and querying the "siftsmall" and "sift"
datasets from http://corpus-texmex.irisa.fr/. The small dataset is
10k vectors while the large is 1M. This speedup feels pretty good
for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are
drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
use threadlocals to work around the places where the existing API
assumes no concurrency.

The concurrent classes also pass the existing test suite with the
exception of the ram usage ones; the estimator doesn't know about
AtomicReference etc.  (Big thanks to Michael Sokolov for
testAknnDiverse which made it much easier to track down subtle
problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to
query multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building
of the index, we still come out way ahead in terms of total
efficiency vs creating per-thread indexes and combining them,
since combining such indexes boils down to "pick the largest and
then add all the other nodes normally," you don't really benefit
from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but
my preference would be to upstream it.  Is Lucene open to a pull
request?

-- 
Jonathan Ellis

co-founder, http://www.datastax.com
@spyced



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-21 Thread Michael Wechner
yes, they are, whereas it should help us to test performance and 
scalability :-)


Am 21.04.23 um 09:24 schrieb Ishan Chattopadhyaya:

Seems like they were all 768 dimensions.

On Fri, 21 Apr, 2023, 11:48 am Michael Wechner, 
 wrote:


Hi Together

Cohere just published approx. 100Mio embeddings based on Wikipedia
content

https://txt.cohere.com/embedding-archives-wikipedia/

resp.

https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings


HTH

Michael



Am 13.04.23 um 07:58 schrieb Michael Wechner:

Hi Kent

Great, thank you very much!

Will download it later today :-)

All the best

Michael

Am 13.04.23 um 01:35 schrieb Kent Fitch:

Hi Michael (and anyone else who wants just over 240K "real
world" ada-002 vectors of dimension 1536),
you are welcome to retrieve a tar.gz file which contains:
- 47K embeddings of Canberra Times news article text from 1994
- 38K embeddings of the first paragraphs of wikipedia articles
about organisations
- 156.6K embeddings of the first paragraphs of wikipedia
articles about people


https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing

The file is about 1.7GB and will expand to about 4.4GB. This
file will be accessible for at least a week, and I hope you dont
hit any google drive download limits trying to retrieve it.

The embeddings were generated using my openAI account and you
are welcome to use them for any purpose you like.

best wishes,

Kent Fitch

On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner
 wrote:

thank you very much for your feedback!

In a previous post (April 7) you wrote you could make
availlable the 47K ada-002 vectors, which would be great!

Would it make sense to setup a public gitub repo, such that
others could use or also contribute vectors?

Thanks

    Michael Wechner


Am 12.04.23 um 04:51 schrieb Kent Fitch:

I only know some characteristics of the openAI ada-002
vectors, although they are a very popular as
embeddings/text-characterisations as they allow more
accurate/"human meaningful" semantic search results with
fewer dimensions than their predecessors - I've evaluated a
few different embedding models, including some BERT
variants, CLIP ViT-L-14 (with 768 dims, which was quite
good), openAI's ada-001 (1024 dims) and babbage-001 (2048
dims), and ada-002 are qualitatively the best, although
that will certainly change!

In any case, ada-002 vectors have interesting
characteristics that I think mean you could confidently
create synthetic vectors which would be hard to distinguish
from "real" vectors.  I found this from looking at 47K
ada-002 vectors generated across a full year (1994) of
newspaper articles from the Canberra Times and 200K
wikipedia articles:
- there is no discernible/significant correlation between
values in any pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical
distribution of values shown in the central blob on these
graphs (that just show a few of these 1531 dimensions with
clumped values and the 5 "outlier" dimensions, but all 1531
non-outlier dims are in there, which makes for some easy
quantisation from float to byte if you dont want to go the
full kmeans/clustering/Lloyds-algorithm approach):

https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is
characteristic:

https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how
the ada-002 embeddings are created, but I think it also
means creating "realistic" values is possible.  I did not
use this information when testing recall & performance on
Lucene's HNSW implementation on 192m documents, as I
slightly dithered the values of a "real" set on 47K docs
and stored other fields in the doc that referenced the
"base" document that the dithers were made from, and used
different dithering magnitudes so that I could test recall
with different neighbour sizes ("M"),
construction-beamwidth and search-beamwidths.

    best reg

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-21 Thread Michael Wechner

Hi Together

Cohere just published approx. 100Mio embeddings based on Wikipedia content

https://txt.cohere.com/embedding-archives-wikipedia/

resp.

https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings


HTH

Michael



Am 13.04.23 um 07:58 schrieb Michael Wechner:

Hi Kent

Great, thank you very much!

Will download it later today :-)

All the best

Michael

Am 13.04.23 um 01:35 schrieb Kent Fitch:
Hi Michael (and anyone else who wants just over 240K "real world" 
ada-002 vectors of dimension 1536),

you are welcome to retrieve a tar.gz file which contains:
- 47K embeddings of Canberra Times news article text from 1994
- 38K embeddings of the first paragraphs of wikipedia articles about 
organisations
- 156.6K embeddings of the first paragraphs of wikipedia articles 
about people


https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing

The file is about 1.7GB and will expand to about 4.4GB. This file 
will be accessible for at least a week, and I hope you dont hit any 
google drive download limits trying to retrieve it.


The embeddings were generated using my openAI account and you are 
welcome to use them for any purpose you like.


best wishes,

Kent Fitch

On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner 
 wrote:


thank you very much for your feedback!

In a previous post (April 7) you wrote you could make availlable
the 47K ada-002 vectors, which would be great!

Would it make sense to setup a public gitub repo, such that
others could use or also contribute vectors?

Thanks

    Michael Wechner


Am 12.04.23 um 04:51 schrieb Kent Fitch:

I only know some characteristics of the openAI ada-002 vectors,
although they are a very popular as
embeddings/text-characterisations as they allow more
accurate/"human meaningful" semantic search results with fewer
dimensions than their predecessors - I've evaluated a few
different embedding models, including some BERT variants, CLIP
ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
(1024 dims) and babbage-001 (2048 dims), and ada-002 are
qualitatively the best, although that will certainly change!

In any case, ada-002 vectors have interesting characteristics
that I think mean you could confidently create synthetic vectors
which would be hard to distinguish from "real" vectors.  I found
this from looking at 47K ada-002 vectors generated across a full
year (1994) of newspaper articles from the Canberra Times and
200K wikipedia articles:
- there is no discernible/significant correlation between values
in any pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical
distribution of values shown in the central blob on these graphs
(that just show a few of these 1531 dimensions with clumped
values and the 5 "outlier" dimensions, but all 1531 non-outlier
dims are in there, which makes for some easy quantisation from
float to byte if you dont want to go the full
kmeans/clustering/Lloyds-algorithm approach):

https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:

https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how the
ada-002 embeddings are created, but I think it also means
creating "realistic" values is possible.  I did not use this
information when testing recall & performance on Lucene's HNSW
implementation on 192m documents, as I slightly dithered the
values of a "real" set on 47K docs and stored other fields in
the doc that referenced the "base" document that the dithers
were made from, and used different dithering magnitudes so that
I could test recall with different neighbour sizes ("M"),
construction-beamwidth and search-beamwidths.

best regards

Kent Fitch




On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner
 wrote:

I understand what you mean that it seems to be artificial,
but I don't
understand why this matters to test performance and
scalability of the
indexing?

Let's assume the limit of Lucene would be 4 instead of 1024
and there
are only open source models generating vectors with 4
dimensions, for
example


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814



Re: Lucene 9.6 release

2023-04-19 Thread Michael Wechner

+1

Thanks!

Michael

Am 19.04.23 um 18:09 schrieb Benjamin Trent:

+1 !

You rock Alan!

On Wed, Apr 19, 2023, 9:54 AM Ignacio Vera  wrote:

+1

Thanks Alan!

On Wed, Apr 19, 2023 at 1:27 PM Alan Woodward
 wrote:

Hi all,

It’s been a while since our last release, and we have a number
of nice improvements and optimisations sitting in the 9x
branch.  I propose that we start the process for a 9.6
release, and I will volunteer to be the release manager.  If
there are no objections, I will cut a release branch one week
today, April 26th.

- Alan
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-12 Thread Michael Wechner

Hi Kent

Great, thank you very much!

Will download it later today :-)

All the best

Michael

Am 13.04.23 um 01:35 schrieb Kent Fitch:
Hi Michael (and anyone else who wants just over 240K "real world" 
ada-002 vectors of dimension 1536),

you are welcome to retrieve a tar.gz file which contains:
- 47K embeddings of Canberra Times news article text from 1994
- 38K embeddings of the first paragraphs of wikipedia articles about 
organisations
- 156.6K embeddings of the first paragraphs of wikipedia articles 
about people


https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing

The file is about 1.7GB and will expand to about 4.4GB. This file will 
be accessible for at least a week, and I hope you dont hit any google 
drive download limits trying to retrieve it.


The embeddings were generated using my openAI account and you are 
welcome to use them for any purpose you like.


best wishes,

Kent Fitch

On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner 
 wrote:


thank you very much for your feedback!

In a previous post (April 7) you wrote you could make availlable
the 47K ada-002 vectors, which would be great!

Would it make sense to setup a public gitub repo, such that others
could use or also contribute vectors?

Thanks

    Michael Wechner


Am 12.04.23 um 04:51 schrieb Kent Fitch:

I only know some characteristics of the openAI ada-002 vectors,
although they are a very popular as
embeddings/text-characterisations as they allow more
accurate/"human meaningful" semantic search results with fewer
dimensions than their predecessors - I've evaluated a few
different embedding models, including some BERT variants, CLIP
ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
(1024 dims) and babbage-001 (2048 dims), and ada-002 are
qualitatively the best, although that will certainly change!

In any case, ada-002 vectors have interesting characteristics
that I think mean you could confidently create synthetic vectors
which would be hard to distinguish from "real" vectors.  I found
this from looking at 47K ada-002 vectors generated across a full
year (1994) of newspaper articles from the Canberra Times and
200K wikipedia articles:
- there is no discernible/significant correlation between values
in any pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical
distribution of values shown in the central blob on these graphs
(that just show a few of these 1531 dimensions with clumped
values and the 5 "outlier" dimensions, but all 1531 non-outlier
dims are in there, which makes for some easy quantisation from
float to byte if you dont want to go the full
kmeans/clustering/Lloyds-algorithm approach):

https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:

https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how the
ada-002 embeddings are created, but I think it also means
creating "realistic" values is possible.  I did not use this
information when testing recall & performance on Lucene's HNSW
implementation on 192m documents, as I slightly dithered the
values of a "real" set on 47K docs and stored other fields in the
doc that referenced the "base" document that the dithers were
made from, and used different dithering magnitudes so that I
could test recall with different neighbour sizes ("M"),
construction-beamwidth and search-beamwidths.

best regards

Kent Fitch




On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner
 wrote:

I understand what you mean that it seems to be artificial,
but I don't
understand why this matters to test performance and
scalability of the
indexing?

Let's assume the limit of Lucene would be 4 instead of 1024
and there
are only open source models generating vectors with 4
dimensions, for
example


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814


0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844


-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106


-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114

and now I concatenate 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-12 Thread Michael Wechner

thank you very much for your feedback!

In a previous post (April 7) you wrote you could make availlable the 47K 
ada-002 vectors, which would be great!


Would it make sense to setup a public gitub repo, such that others could 
use or also contribute vectors?


Thanks

Michael Wechner


Am 12.04.23 um 04:51 schrieb Kent Fitch:
I only know some characteristics of the openAI ada-002 vectors, 
although they are a very popular as embeddings/text-characterisations 
as they allow more accurate/"human meaningful" semantic search results 
with fewer dimensions than their predecessors - I've evaluated a few 
different embedding models, including some BERT variants, CLIP 
ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001 (1024 
dims) and babbage-001 (2048 dims), and ada-002 are qualitatively the 
best, although that will certainly change!


In any case, ada-002 vectors have interesting characteristics that I 
think mean you could confidently create synthetic vectors which 
would be hard to distinguish from "real" vectors.  I found this from 
looking at 47K ada-002 vectors generated across a full year (1994) of 
newspaper articles from the Canberra Times and 200K wikipedia articles:
- there is no discernible/significant correlation between values in 
any pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical 
distribution of values shown in the central blob on these graphs (that 
just show a few of these 1531 dimensions with clumped values and the 5 
"outlier" dimensions, but all 1531 non-outlier dims are in there, 
which makes for some easy quantisation from float to byte if you dont 
want to go the full kmeans/clustering/Lloyds-algorithm approach):

https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:
https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how the ada-002 
embeddings are created, but I think it also means creating "realistic" 
values is possible.  I did not use this information when testing 
recall & performance on Lucene's HNSW implementation on 192m 
documents, as I slightly dithered the values of a "real" set on 47K 
docs and stored other fields in the doc that referenced the "base" 
document that the dithers were made from, and used different dithering 
magnitudes so that I could test recall with different neighbour sizes 
("M"), construction-beamwidth and search-beamwidths.


best regards

Kent Fitch




On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner 
 wrote:


I understand what you mean that it seems to be artificial, but I
don't
understand why this matters to test performance and scalability of
the
indexing?

Let's assume the limit of Lucene would be 4 instead of 1024 and there
are only open source models generating vectors with 4 dimensions, for
example


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814


0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844


-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106


-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114

and now I concatenate them to vectors with 8 dimensions



0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844


-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114

and normalize them to length 1.

Why should this be any different to a model which is acting like a
black
box generating vectors with 8 dimensions?




Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> What exactly do you consider real vector data? Vector data
which is based on texts written by humans?
> We have plenty of text; the problem is coming up with a realistic
> vector model that requires as many dimensions as people seem to be
> demanding. As I said above, after surveying huggingface I couldn't
> find any text-based model using more than 768 dimensions. So far we
> have some ideas of generating higher-dimensional data by
dithering or
> concatenating existing data, but it seems artificial.
>
> On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
>  wrote:
>> What exactly do you consider real vector data? 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-11 Thread Michael Wechner
I understand what you mean that it seems to be artificial, but I don't 
understand why this matters to test performance and scalability of the 
indexing?


Let's assume the limit of Lucene would be 4 instead of 1024 and there 
are only open source models generating vectors with 4 dimensions, for 
example


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814

0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106

-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114

and now I concatenate them to vectors with 8 dimensions


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.02666585892435,0.044495150446891785,-0.038030195981264114

and normalize them to length 1.

Why should this be any different to a model which is acting like a black 
box generating vectors with 8 dimensions?





Am 11.04.23 um 19:05 schrieb Michael Sokolov:

What exactly do you consider real vector data? Vector data which is based on 
texts written by humans?

We have plenty of text; the problem is coming up with a realistic
vector model that requires as many dimensions as people seem to be
demanding. As I said above, after surveying huggingface I couldn't
find any text-based model using more than 768 dimensions. So far we
have some ideas of generating higher-dimensional data by dithering or
concatenating existing data, but it seems artificial.

On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
 wrote:

What exactly do you consider real vector data? Vector data which is based on 
texts written by humans?

I am asking, because I recently attended the following presentation by 
Anastassia Shaitarova (UZH Institute for Computational Linguistics, 
https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)



Can we Identify Machine-Generated Text? An Overview of Current Approaches
by Anastassia Shaitarova (UZH Institute for Computational Linguistics)

The detection of machine-generated text has become increasingly important due 
to the prevalence of automated content generation and its potential for misuse. 
In this talk, we will discuss the motivation for automatic detection of 
generated text. We will present the currently available methods, including 
feature-based classification as a “first line-of-defense.” We will provide an 
overview of the detection tools that have been made available so far and 
discuss their limitations. Finally, we will reflect on some open problems 
associated with the automatic discrimination of generated texts.



and her conclusion was that it has become basically impossible to differentiate 
between text generated by humans and text generated by for example ChatGPT.

Whereas others have a slightly different opinion, see for example

https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/

But I would argue that real world and synthetic have become close enough that 
testing performance and scalability of indexing should be possible with 
synthetic data.

I completely agree that we have to base our discussions and decisions on 
scientific methods and that we have to make sure that Lucene performs and 
scales well and that we understand the limits and what is going on under the 
hood.

Thanks

Michael W





Am 11.04.23 um 14:29 schrieb Michael McCandless:

+1 to test on real vector data -- if you test on synthetic data you draw 
synthetic conclusions.

Can someone post the theoretical performance (CPU and RAM required) of HNSW 
construction?  Do we know/believe our HNSW implementation has achieved that 
theoretical big-O performance?  Maybe we have some silly performance bug that's 
causing it not to?

As I understand it, HNSW makes the tradeoff of costly construction for faster 
searching, which is typically the right tradeoff for search use cases.  We do 
this in other parts of the Lucene index too.

Lucene will do a logarithmic number of merges over time, i.e. each doc will be 
merged O(log(N)) times in its lifetime in the index.  We need to multiply that 
by the cost of re-building the whole HNSW graph on each merge.  BTW, other 
things in Lucene, like BKD/dimensional points, also rebuild the whole data 
structure on each merge, I think?  But, as Rob pointed out, stored fields 
merging do indeed do some sneaky tricks to avoid excessive block 
decompress/recompress on each merge.


As I understand it, vetoes must have technical merit. I'm not sure that this veto rises 
to "technical merit" on 2 counts:

Actually I think Robert's veto stands on its technical merit already.  Robert's 
take on technical matters very much resonate with me, even if he is sometimes 
prickly

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-11 Thread Michael Wechner
05
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code"
I didn't mean it in
> >> any kind of demeaning fashion really. I meant to
describe the current
> >> state of usability with respect to indexing a few
million docs with
> >> high dimensions. You can scroll up the thread and
see that at least
> >> one other committer on the project experienced
similar pain as me.
> >> Then, think about users who aren't committers
trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov
 wrote:
> >> >
> >> > What you said about increasing dimensions
requiring a bigger ram buffer on merge is wrong.
That's the point I was trying to make. Your concerns
about merge costs are not wrong, but your conclusion
that we need to limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale,
but when I show it scales linearly with dimension you
just ignore that and complain about something entirely
different.
> >> >
> >> > You demand that people run all kinds of tests
to prove you wrong but when they do, you don't listen
and you won't put in the work yourself or complain
that it's too hard.
> >> >
> >> > Then you complain about people not meeting you
half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir
 wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
> >> >>  wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST
about the current
> >> >> status. Please put politically correct or your
own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the
code can set a
> >> >> multi-gigabyte ram buffer and index a small
dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one
guy, I want it to be
> >> >> usable. Increasing dimensions just means even
bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility
decision, we have to
> >> >> support it once we do this and we can't just
say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram
buffer is really to
> >> >> avoid merges because they are so slow and it
would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial
and error (means
> >> >> experiencing OOM on merge!!!) before you get
those heap values correct
> >> >> for your dataset. This usually means starting
over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory
usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the
   

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Michael Wechner
I think for testing the performance and scalability one can also use 
synthetic data and it does not have to be real world data in the sense 
of vectors generated from real world text.


But I think the more people revisit the testing of performance and 
scalability the better and any help on this would be great!


Thanks

Michael W



Am 09.04.23 um 20:43 schrieb Dawid Weiss:

We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 
300 dimensional varieties and can easily enough generate large numbers of 
vector documents from the articles data. To go higher we could concatenate 
vectors from that and I believe the performance numbers would be plausible.

Apologies - I wasn't clear - I thought of building the 1k or 2k
vectors that would be realistic. Perhaps using glove or perhaps using
some other software but something that would reflect a true 2k
dimensional space accurately with "real" data underneath. I am not
familiar enough with the field to tell whether a simple concatenation
is a good enough simulation - perhaps it is.

I would really prefer to focus on doing this kind of assessment of
feasibility/ limitations rather than arguing back and forth. I did my
experiment a while ago and I can't really tell whether there have been
improvements in the indexing/ merging part - your email contradicts my
experience Mike, so I'm a bit intrigued and would like to revisit it.
But it'd be ideal to work with real vectors rather than a simulation.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-09 Thread Michael Wechner
en asking about raising the limit would like to
do.

I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.

On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
 wrote:

I am very attentive to listen opinions but I am un-convinced here and I an not 
sure that a single person opinion should be allowed to be detrimental for such 
an important project.

The limit as far as I know is literally just raising an exception.
Removing it won't alter in any way the current performance for users in low 
dimensional space.
Removing it will just enable more users to use Lucene.

If new users in certain situations will be unhappy with the performance, they 
may contribute improvements.
This is how you make progress.

If it's a reputation thing, trust me that not allowing users to play with high 
dimensional space will equally damage it.

To me it's really a no brainer.
Removing the limit and enable people to use high dimensional vectors will take 
minutes.
Improving the hnsw implementation can take months.
Pick one to begin with...

And there's no-one paying me here, no company interest whatsoever, actually I 
pay people to contribute, I am just convinced it's a good idea.


On Sat, 8 Apr 2023, 18:57 Robert Muir,  wrote:

I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:

https://github.com/apache/lucene/pull/11905

Attacking me isn't helping the situation.

PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!

On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov  wrote:

What you said about increasing dimensions requiring a bigger ram buffer on 
merge is wrong. That's the point I was trying to make. Your concerns about 
merge costs are not wrong, but your conclusion that we need to limit dimensions 
is not justified.

You complain that hnsw sucks it doesn't scale, but when I show it scales 
linearly with dimension you just ignore that and complain about something 
entirely different.

You demand that people run all kinds of tests to prove you wrong but when they 
do, you don't listen and you won't put in the work yourself or complain that 
it's too hard.

Then you complain about people not meeting you half way. Wow

On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:

On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
 wrote:

What exactly do you consider reasonable?

Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horr

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Michael Wechner

What exactly do you consider reasonable?

I think it would help if we could specify concrete requirements re 
performance and scalability, because then we have a concrete goal which 
we can work with.

Do such requirements already exist or what would be a good starting point?

Re 2x worse, I think Michael Sokolov already pointed out that things 
take longer linearly with vector dimension, which is quite obvious for 
example for a brute force implementation. I would argue this will be the 
case for any implementation.


And last I would like to ask again, slightly different, do we want 
people to use Lucene, which will give us an opportunity to learn from 
and progress?


Thanks

Michael



Am 08.04.23 um 13:04 schrieb Robert Muir:

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!

Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57 PM Michael Wechner
 wrote:

sorry to interrupt, but I think we get side-tracked from the original 
discussion to increase the vector dimension limit.

I think improving the vector indexing performance is one thing and making sure 
Lucene does not crash when increasing the vector dimension limit is another.

I think it is great to find better ways to index vectors, but I think this 
should not prevent people from being able to use models with higher vector 
dimensions than 1024.

The following comparison might not be perfect, but imagine we have invented a 
combustion engine, which is strong enough to move a car in the flat area, but 
when applying it to a truck to move things over mountains it will fail, because 
it is not strong enough. Would you prevent people from using the combustion 
engine for a car in the flat area?

Thanks

Michael



Am 08.04.23 um 00:15 schrieb jim ferenczi:


Keep in mind, there may be other ways to do it. In general if merging

something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without prior 
knowledge of the vectors. Faiss has a nice implementation that fits naturally 
with Lucene called IVF (
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to provide 
the clusters for the entire index before indexing the first vector.
It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir  wrote:

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi  wrote:

I am also not sure that diskann would solve the merging issue. The idea 
describe in the paper is to run kmeans first to create multiple graphs, one per 
cluster. In our case the vectors in each segment could belong to different 
cluster so I don’t see how we could merge them efficiently.

On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:

The inference time (and cost) to generate these big vectors must be quite large 
too ;).
Regarding the ram buffer, we could drastically reduce the size by writing the 
vectors on disk instead of keeping them in the heap. With 1k dimensions the ram 
buffer is filled with these vectors quite rapidly.

On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:

On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  wrote:

8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.


My scale concerns are both space and time. What does the execution

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Wechner
sorry to interrupt, but I think we get side-tracked from the original 
discussion to increase the vector dimension limit.


I think improving the vector indexing performance is one thing and 
making sure Lucene does not crash when increasing the vector dimension 
limit is another.


I think it is great to find better ways to index vectors, but I think 
this should not prevent people from being able to use models with higher 
vector dimensions than 1024.


The following comparison might not be perfect, but imagine we have 
invented a combustion engine, which is strong enough to move a car in 
the flat area, but when applying it to a truck to move things over 
mountains it will fail, because it is not strong enough. Would you 
prevent people from using the combustion engine for a car in the flat area?


Thanks

Michael



Am 08.04.23 um 00:15 schrieb jim ferenczi:

> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without 
prior knowledge of the vectors. Faiss has a nice implementation that 
fits naturally with Lucene called IVF (

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to 
provide the clusters for the entire index before indexing the first 
vector.

It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir  wrote:

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi
 wrote:
>
> I am also not sure that diskann would solve the merging issue.
The idea describe in the paper is to run kmeans first to create
multiple graphs, one per cluster. In our case the vectors in each
segment could belong to different cluster so I don’t see how we
could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi
 wrote:
>>
>> The inference time (and cost) to generate these big vectors
must be quite large too ;).
>> Regarding the ram buffer, we could drastically reduce the size
by writing the vectors on disk instead of keeping them in the
heap. With 1k dimensions the ram buffer is filled with these
vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
>>>
>>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov
 wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW
buffer size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can
find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems
under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost
2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>>
-
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Michael Wechner

you might want to use SentenceBERT to generate vectors

https://sbert.net

whereas for example the model "all-mpnet-base-v2" generates vectors with 
dimension 768


We have SentenceBERT running as a web service, which we could open for 
these tests, but because of network latency it should be faster running 
locally.


HTH

Michael


Am 07.04.23 um 10:11 schrieb Marcus Eagan:
I've started to look on the internet, and surely someone will come, 
but the challenge I suspect is that these vectors are expensive to 
generate so people have not gone all in on generating such large 
vectors for large datasets. They certainly have not made them easy to 
find. Here is the most promising but it is too small, probably: 
https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download 



 I'm still in and out of the office at the moment, but when I return, 
I can ask my employer if they will sponsor a 10 million document 
collection so that you can test with that. Or, maybe someone from work 
will see and ask them on my behalf.


Alternatively, next week, I may get some time to set up a server with 
an open source LLM to generate the vectors. It still won't be free, 
but it would be 99% cheaper than paying the LLM companies if we can be 
slow.




On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner 
 wrote:


Great, thank you!

How much RAM; etc. did you run this test on?

Do the vectors really have to be based on real data for testing the
indexing?
I understand, if you want to test the quality of the search
results it
does matter, but for testing the scalability itself it should not
matter
actually, right?

Thanks

Michael

Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> minutes with a single thread. I have some 256K vectors, but only
about
> 2M of them. Can anybody point me to a large set (say 8M+) of
1024+ dim
> vectors I can use for testing? If all else fails I can test with
> noise, but that tends to lead to meaningless results
>
> On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
>  wrote:
>>
>>
>> Am 06.04.23 um 17:47 schrieb Robert Muir:
>>> Well, I'm asking ppl actually try to test using such high
dimensions.
>>> Based on my own experience, I consider it unusable. It seems other
>>> folks may have run into trouble too. If the project committers
can't
>>> even really use vectors with such high dimension counts, then
its not
>>> in an OK state for users, and we shouldn't bump the limit.
>>>
>>> I'm happy to discuss/compromise etc, but simply bumping the limit
>>> without addressing the underlying usability/scalability is a real
>>> no-go,
>> I agree that this needs to be adressed
>>
>>
>>
>>>    it is not really solving anything, nor is it giving users any
>>> freedom or allowing them to do something they couldnt do before.
>>> Because if it still doesnt work it still doesnt work.
>> I disagree, because it *does work* with "smaller" document sets.
>>
>> Currently we have to compile Lucene ourselves to not get the
exception
>> when using a model with vector dimension greater than 1024,
>> which is of course possible, but not really convenient.
>>
>> As I wrote before, to resolve this discussion, I think we
should test
>> and address possible issues.
>>
>> I will try to stop discussing now :-) and instead try to understand
>> better the actual issues. Would be great if others could join
on this!
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>> We all need to be on the same page, grounded in reality, not
fantasy,
>>> where if we set a limit of 1024 or 2048, that you can actually
index
>>> vectors with that many dimensions and it actually works and
scales.
>>>
>>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>>>  wrote:
>>>> As I said earlier, a max limit limits usability.
>>>> It's not forcing users with small vectors to pay the
performance penalty of big vectors, it's literally preventing some
users to use Lucene/Solr/Elasticsearch at all.
>>>> As far as I know, the max limit is used to raise an
exception, it's not used to initialise or optimise data structures
(please correct me if I'm wrong).
>>>>
>>>> Improving the algorithm performance is a separate discussion.
>>>> I

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner

Great, thank you!

How much RAM; etc. did you run this test on?

Do the vectors really have to be based on real data for testing the 
indexing?
I understand, if you want to test the quality of the search results it 
does matter, but for testing the scalability itself it should not matter 
actually, right?


Thanks

Michael

Am 07.04.23 um 01:19 schrieb Michael Sokolov:

I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to lead to meaningless results

On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
 wrote:



Am 06.04.23 um 17:47 schrieb Robert Muir:

Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go,

I agree that this needs to be adressed




   it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.

I disagree, because it *does work* with "smaller" document sets.

Currently we have to compile Lucene ourselves to not get the exception
when using a model with vector dimension greater than 1024,
which is of course possible, but not really convenient.

As I wrote before, to resolve this discussion, I think we should test
and address possible issues.

I will try to stop discussing now :-) and instead try to understand
better the actual issues. Would be great if others could join on this!

Thanks

Michael




We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
 wrote:

As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of big 
vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch 
at all.
As far as I know, the max limit is used to raise an exception, it's not used to 
initialise or optimise data structures (please correct me if I'm wrong).

Improving the algorithm performance is a separate discussion.
I don't see a correlation with the fact that indexing billions of whatever 
dimensioned vector is slow with a usability parameter.

What about potential users that need few high dimensional vectors?

As I said before, I am a big +1 for NOT just raise it blindly, but I believe we 
need to remove the limit or size it in a way it's not a problem for both users 
and internal data structure optimizations, if any.


On Wed, 5 Apr 2023, 18:54 Robert Muir,  wrote:

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
 wrote:

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index
millions of vectors.

I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:

Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.


How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h.

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner




Am 06.04.23 um 17:47 schrieb Robert Muir:

Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for users, and we shouldn't bump the limit.

I'm happy to discuss/compromise etc, but simply bumping the limit
without addressing the underlying usability/scalability is a real
no-go,


I agree that this needs to be adressed




  it is not really solving anything, nor is it giving users any
freedom or allowing them to do something they couldnt do before.
Because if it still doesnt work it still doesnt work.


I disagree, because it *does work* with "smaller" document sets.

Currently we have to compile Lucene ourselves to not get the exception 
when using a model with vector dimension greater than 1024,

which is of course possible, but not really convenient.

As I wrote before, to resolve this discussion, I think we should test 
and address possible issues.


I will try to stop discussing now :-) and instead try to understand 
better the actual issues. Would be great if others could join on this!


Thanks

Michael





We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales.

On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
 wrote:

As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of big 
vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch 
at all.
As far as I know, the max limit is used to raise an exception, it's not used to 
initialise or optimise data structures (please correct me if I'm wrong).

Improving the algorithm performance is a separate discussion.
I don't see a correlation with the fact that indexing billions of whatever 
dimensioned vector is slow with a usability parameter.

What about potential users that need few high dimensional vectors?

As I said before, I am a big +1 for NOT just raise it blindly, but I believe we 
need to remove the limit or size it in a way it's not a problem for both users 
and internal data structure optimizations, if any.


On Wed, 5 Apr 2023, 18:54 Robert Muir,  wrote:

I'd ask anyone voting +1 to raise this limit to at least try to index
a few million vectors with 756 or 1024, which is allowed today.

IMO based on how painful it is, it seems the limit is already too
high, I realize that will sound controversial but please at least try
it out!

voting +1 without at least doing this is really the
"weak/unscientifically minded" approach.

On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
 wrote:

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index
millions of vectors.

I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:

Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.


How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner

Thanks!

I will try to run some tests to be on the safe side :-)

Am 06.04.23 um 16:28 schrieb Michael Sokolov:

yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs for the little segments, and then for the big one
when merging, and it will eventually use the same amount of RAM to
build the big graph, although I don't believe it will have to load the
vectors en masse into RAM while merging.

On Thu, Apr 6, 2023 at 10:20 AM Michael Wechner
 wrote:

thanks very much for these insights!

Does it make a difference re RAM when I do a batch import, for example
import 1000 documents and close the IndexWriter and do a forceMerge or
import 1Mio documents at once?

I would expect so, or do I misunderstand this?

Thanks

Michael



Am 06.04.23 um 16:11 schrieb Michael Sokolov:

re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless
 wrote:

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

In fact we must accept all vetos by any committer as a veto, for a change to 
Lucene's source code, regardless of that committer's reasoning.  This is the 
power of Apache's model.

Of course we all can and will work together to convince one another (this is 
where the scientifically motivated part comes in) to change our votes, one way 
or another.


I'd ask anyone voting +1 to raise this limit to at least try to index a few 
million vectors with 756 or 1024, which is allowed today.

+1, if the current implementation really does not scale / needs more and more 
RAM for merging, let's understand what's going on here, first, before 
increasing limits.  I rescind my hasty +1 for now!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti  
wrote:

Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to release and 
guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if there's concrete 
evidence for setting a limit rather than another, or that a certain limit is 
detrimental to the project, I think that veto should be valid.

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

The problem I see is that more than voting we should first decide this limit 
and I don't know how we can operate.
I am imagining like a poll where each entry is a limit + motivation  and PMCs 
maybe vote/add entries?

Did anything similar happen in the past? How was the current limit added?


On Wed, 5 Apr 2023, 14:50 Dawid Weiss,  wrote:

Should create a VOTE thread, where we propose some values with a justification 
and we vote?

Technically, a vote thread won't help much if there's no full consensus - a 
single veto will make the patch unacceptable for merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner

thanks very much for these insights!

Does it make a difference re RAM when I do a batch import, for example 
import 1000 documents and close the IndexWriter and do a forceMerge or 
import 1Mio documents at once?


I would expect so, or do I misunderstand this?

Thanks

Michael



Am 06.04.23 um 16:11 schrieb Michael Sokolov:

re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless
 wrote:

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

In fact we must accept all vetos by any committer as a veto, for a change to 
Lucene's source code, regardless of that committer's reasoning.  This is the 
power of Apache's model.

Of course we all can and will work together to convince one another (this is 
where the scientifically motivated part comes in) to change our votes, one way 
or another.


I'd ask anyone voting +1 to raise this limit to at least try to index a few 
million vectors with 756 or 1024, which is allowed today.

+1, if the current implementation really does not scale / needs more and more 
RAM for merging, let's understand what's going on here, first, before 
increasing limits.  I rescind my hasty +1 for now!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti  
wrote:

Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to release and 
guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if there's concrete 
evidence for setting a limit rather than another, or that a certain limit is 
detrimental to the project, I think that veto should be valid.

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

The problem I see is that more than voting we should first decide this limit 
and I don't know how we can operate.
I am imagining like a poll where each entry is a limit + motivation  and PMCs 
maybe vote/add entries?

Did anything similar happen in the past? How was the current limit added?


On Wed, 5 Apr 2023, 14:50 Dawid Weiss,  wrote:



Should create a VOTE thread, where we propose some values with a justification 
and we vote?


Technically, a vote thread won't help much if there's no full consensus - a 
single veto will make the patch unacceptable for merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
I think we should focus on testing where the limits are and what might 
cause the limits.


Let's get out of this fog :-)

Thanks

Michael



Am 06.04.23 um 11:47 schrieb Michael McCandless:
> We shouldn't accept weakly/not scientifically motivated vetos anyway 
right?


In fact we must accept all vetos by any committer as a veto, for a 
change to Lucene's source code, regardless of that committer's 
reasoning.  This is the power of Apache's model.


Of course we all can and will work together to convince one another 
(this is where the scientifically motivated part comes in) to change 
our votes, one way or another.


> I'd ask anyone voting +1 to raise this limit to at least try to 
index a few million vectors with 756 or 1024, which is allowed today.


+1, if the current implementation really does not scale / needs more 
and more RAM for merging, let's understand what's going on here, 
first, before increasing limits.  I rescind my hasty +1 for now!


Mike McCandless

http://blog.mikemccandless.com


On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti 
 wrote:


Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to
release and guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if
there's concrete evidence for setting a limit rather than another,
or that a certain limit is detrimental to the project, I think
that veto should be valid.

We shouldn't accept weakly/not scientifically motivated vetos
anyway right?

The problem I see is that more than voting we should first decide
this limit and I don't know how we can operate.
I am imagining like a poll where each entry is a limit +
motivation  and PMCs maybe vote/add entries?

Did anything similar happen in the past? How was the current limit
added?


On Wed, 5 Apr 2023, 14:50 Dawid Weiss,  wrote:

Should create a VOTE thread, where we propose some values
with a justification and we vote?


Technically, a vote thread won't help much if there's no full
consensus - a single veto will make the patch unacceptable for
merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-05 Thread Michael Wechner

Thanks for your feedback!

I agree, that it should not crash.

So far we did not experience crashes ourselves, but we did not index 
millions of vectors.


I will try to reproduce the crash, maybe this will help us to move forward.

Thanks

Michael

Am 05.04.23 um 18:30 schrieb Dawid Weiss:

Can you describe your crash in more detail?

I can't. That experiment was a while ago and a quick test to see if I
could index rather large-ish USPTO (patent office) data as vectors.
Couldn't do it then.


How much RAM?

My indexing jobs run with rather smallish heaps to give space for I/O
buffers. Think 4-8GB at most. So yes, it could have been the problem.
I recall segment merging grew slower and slower and then simply
crashed. Lucene should work with low heap requirements, even if it
slows down. Throwing ram at the indexing/ segment merging problem
is... I don't know - not elegant?

Anyway. My main point was to remind folks about how Apache works -
code is merged in when there are no vetoes. If Rob (or anybody else)
remains unconvinced, he or she can block the change. (I didn't invent
those rules).

D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-05 Thread Michael Wechner

Hi Dawid

Can you describe your crash in more detail?

How many millions vectors exactly?
What was the vector dimension?
How much RAM?
etc.

Thanks

Michael



Am 05.04.23 um 17:48 schrieb Dawid Weiss:

Ok, so what should we do then?

I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go. It
does not specify any way to "override" such a veto, perhaps counting
on disagreeing parties to resolve conflicting points of views in a
civil manner so that veto can be retracted (or a different solution
suggested).

I think Robert's point is not about a particular limit value but about
the algorithm itself - the current implementation does not scale. I
don't want to be an advocate for either side - I'm all for freedom of
choice but at the same time last time I tried indexing a few million
vectors, I couldn't get far before segment merging blew up with
OOMs...


Did anything similar happen in the past? How was the current limit added?

I honestly don't know, you'd have to git blame or look at the mailing
list archives of the original contribution.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-05 Thread Michael Wechner



Am 05.04.23 um 12:34 schrieb Alessandro Benedetti:

Thanks Mike for the insight!

What would be the next steps then?
I see agreement but also the necessity of identifying a candidate MAX.

Should create a VOTE thread, where we propose some values with a 
justification and we vote?



+1

Thanks

Michael





In this way we can create a pull request and merge relatively soon.

Cheers

On Tue, 4 Apr 2023, 14:47 Michael Wechner,  
wrote:


IIUC we all agree that the limit could be raised, but we need some
solid reasoning what limit makes sense, resp. why do we set this
particular limit (e.g. 2048), right?

Thanks

Michael


Am 04.04.23 um 15:32 schrieb Michael McCandless:

> I am not in favor of just doubling it as suggested by some
people, I would ideally prefer a solution that remains there to a
decent extent, rather than having to modifying it anytime someone
requires a higher limit.

The problem with this approach is it is a one-way door, once
released.  We would not be able to lower the limit again in the
future without possibly breaking some applications.

> For example, we don't limit the number of docs per index to an
arbitrary maximum of N, you push how many docs you like and if
they are too much for your system, you get terrible
performance/crashes/whatever.

Correction: we do check this limit and throw a specific exception
now: https://github.com/apache/lucene/issues/6905

+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti
 wrote:

... and what would be the next limit?
I guess we'll need to motivate it better than the 1024 one.
I appreciate the fact that a limit is pretty much wanted by
everyone but I suspect we'll need some solid foundation for
deciding the amount (and it should be high enough to avoid
continuous changes)

Cheers

On Sun, 2 Apr 2023, 07:29 Michael Wechner,
 wrote:

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:

I'm also in favor of raising this limit. We do see some
datasets with higher than 1024 dims. I also think we
need to keep a limit. For example we currently need to
keep all the vectors in RAM while indexing and we want
to be able to support reasonable numbers of vectors in
an index segment. Also we don't know what innovations
might come down the road. Maybe someday we want to do
product quantization and enforce that (k, m) both fit in
a byte -- we wouldn't be able to do that if a vector's
dimension were to exceed 32K.

On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
 wrote:

I am also curious what would be the worst-case
scenario if we remove the constant at all (so
automatically the limit becomes the Java
Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
throw new IllegalArgumentException(
"cannot index vectors with dimension greater
than " + ByteVectorValues.MAX_DIMENSIONS);
}


in relation to:

These limits allow us to
better tune our data structures, prevent
overflows, help ensure we
have good test coverage, etc.

I agree 100% especially for typing stuff properly
and avoiding resource waste here and there, but I am
not entirely sure this is the case for the current
implementation i.e. do we have optimizations in
place that assume the max dimension to be 1024?
If I missed that (and I likely have), I of course
suggest the contribution should not just blindly
remove the limit, but do it appropriately.
I am not in favor of just doubling it as suggested
by some people, I would ideally prefer a solution
that remains there to a decent extent, rather than
having to modifying it anytime someone requires a
higher limit.

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Informatio

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-04 Thread Michael Wechner
IIUC we all agree that the limit could be raised, but we need some solid 
reasoning what limit makes sense, resp. why do we set this particular 
limit (e.g. 2048), right?


Thanks

Michael


Am 04.04.23 um 15:32 schrieb Michael McCandless:
> I am not in favor of just doubling it as suggested by some people, I 
would ideally prefer a solution that remains there to a decent extent, 
rather than having to modifying it anytime someone requires a higher 
limit.


The problem with this approach is it is a one-way door, once 
released.  We would not be able to lower the limit again in the future 
without possibly breaking some applications.


> For example, we don't limit the number of docs per index to an 
arbitrary maximum of N, you push how many docs you like and if they 
are too much for your system, you get terrible 
performance/crashes/whatever.


Correction: we do check this limit and throw a specific exception now: 
https://github.com/apache/lucene/issues/6905


+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti 
 wrote:


... and what would be the next limit?
I guess we'll need to motivate it better than the 1024 one.
I appreciate the fact that a limit is pretty much wanted by
everyone but I suspect we'll need some solid foundation for
deciding the amount (and it should be high enough to avoid
continuous changes)

Cheers

On Sun, 2 Apr 2023, 07:29 Michael Wechner,
 wrote:

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:

I'm also in favor of raising this limit. We do see some
datasets with higher than 1024 dims. I also think we need to
keep a limit. For example we currently need to keep all the
vectors in RAM while indexing and we want to be able to
support reasonable numbers of vectors in an index segment.
Also we don't know what innovations might come down the road.
Maybe someday we want to do product quantization and enforce
that (k, m) both fit in a byte -- we wouldn't be able to do
that if a vector's dimension were to exceed 32K.

On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
 wrote:

I am also curious what would be the worst-case scenario
if we remove the constant at all (so automatically the
limit becomes the Java Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
throw new IllegalArgumentException(
"cannot index vectors with dimension greater than " +
ByteVectorValues.MAX_DIMENSIONS);
}


in relation to:

These limits allow us to
better tune our data structures, prevent overflows,
help ensure we
have good test coverage, etc.

I agree 100% especially for typing stuff properly and
avoiding resource waste here and there, but I am not
entirely sure this is the case for the current
implementation i.e. do we have optimizations in place
that assume the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest
the contribution should not just blindly remove the
limit, but do it appropriately.
I am not in favor of just doubling it as suggested by
some people, I would ideally prefer a solution that
remains there to a decent extent, rather than having to
modifying it anytime someone requires a higher limit.

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>


    On Fri, 31 Mar 2023 at 16:12, Michael Wechner
 wrote:

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions
with sometimes
slightly better accuracy

Thanks

  

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-02 Thread Michael Wechner

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:
I'm also in favor of raising this limit. We do see some datasets with 
higher than 1024 dims. I also think we need to keep a limit. For 
example we currently need to keep all the vectors in RAM while 
indexing and we want to be able to support reasonable numbers of 
vectors in an index segment. Also we don't know what innovations might 
come down the road. Maybe someday we want to do product quantization 
and enforce that (k, m) both fit in a byte -- we wouldn't be able to 
do that if a vector's dimension were to exceed 32K.


On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti 
 wrote:


I am also curious what would be the worst-case scenario if we
remove the constant at all (so automatically the limit becomes the
Java Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
throw new IllegalArgumentException(
"cannot index vectors with dimension greater than " +
ByteVectorValues.MAX_DIMENSIONS);
}


in relation to:

These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

I agree 100% especially for typing stuff properly and avoiding
resource waste here and there, but I am not entirely sure this is
the case for the current implementation i.e. do we have
optimizations in place that assume the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest the
contribution should not just blindly remove the limit, but do it
appropriately.
I am not in favor of just doubling it as suggested by some people,
I would ideally prefer a solution that remains there to a decent
extent, rather than having to modifying it anytime someone
requires a higher limit.

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>


On Fri, 31 Mar 2023 at 16:12, Michael Wechner
 wrote:

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions with
sometimes
slightly better accuracy

Thanks

Michael


Am 31.03.23 um 14:45 schrieb Adrien Grand:
> I'm supportive of bumping the limit on the maximum dimension for
> vectors to something that is above what the majority of
users need,
> but I'd like to keep a limit. We have limits for other
things like the
> max number of docs per index, the max term length, the max
number of
> dimensions of points, etc. and there are a few things that
we don't
> have limits on that I wish we had limits on. These limits
allow us to
> better tune our data structures, prevent overflows, help
ensure we
> have good test coverage, etc.
>
> That said, these other limits we have in place are quite
high. E.g.
> the 32kB term limit, nobody would ever type a 32kB term in a
text box.
> Likewise for the max of 8 dimensions for points: a segment
cannot
> possibly have 2 splits per dimension on average if it
doesn't have
> 512*2^(8*2)=34M docs, a sizable dataset already, so more
dimensions
> than 8 would likely defeat the point of indexing. In
contrast, our
> limit on the number of dimensions of vectors seems to be
under what
> some users would like, and while I understand the
performance argument
> against bumping the limit, it doesn't feel to me like
something that
> would be so bad that we need to prevent users from using
numbers of
> dimensions in the low thousands, e.g. top-k KNN searches
would still
> look at a very small subset of the full dataset.
>
> So overall, my vote would be to bump the limit to 2048 as
    suggested by
    > Mayya on the issue that you linked.
>
> On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
>  wrote:
>> Thanks Alessandro for

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-03-31 Thread Michael Wechner

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions with sometimes 
slightly better accuracy


Thanks

Michael


Am 31.03.23 um 14:45 schrieb Adrien Grand:

I'm supportive of bumping the limit on the maximum dimension for
vectors to something that is above what the majority of users need,
but I'd like to keep a limit. We have limits for other things like the
max number of docs per index, the max term length, the max number of
dimensions of points, etc. and there are a few things that we don't
have limits on that I wish we had limits on. These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

That said, these other limits we have in place are quite high. E.g.
the 32kB term limit, nobody would ever type a 32kB term in a text box.
Likewise for the max of 8 dimensions for points: a segment cannot
possibly have 2 splits per dimension on average if it doesn't have
512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
than 8 would likely defeat the point of indexing. In contrast, our
limit on the number of dimensions of vectors seems to be under what
some users would like, and while I understand the performance argument
against bumping the limit, it doesn't feel to me like something that
would be so bad that we need to prevent users from using numbers of
dimensions in the low thousands, e.g. top-k KNN searches would still
look at a very small subset of the full dataset.

So overall, my vote would be to bump the limit to 2048 as suggested by
Mayya on the issue that you linked.

On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
 wrote:

Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the best embedding 
size, whereas I think heuristic approaches like described by the following link 
can be helpful

https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higher dimensions 
than 1024, like for example OpenAI, Cohere and Aleph Alpha.

And it would be great if we could run benchmarks without having to recompile 
Lucene ourselves.

Therefore I would to suggest to either increase the limit or even better to 
remove the limit and add a disclaimer, that people should be aware of possible 
crashes etc.

Thanks

Michael




Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:


I've been monitoring various discussions on Pull Requests about changing the 
max number of dimensions allowed for Lucene HNSW vectors:

https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507


I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of favor in 
this direction.


Motivation

We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus 
Eagan, and David Smiley about some neural search integrations in Solr: 
https://github.com/openai/chatgpt-retrieval-plugin


Proposal

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the system to the 
limit of their resources and get terrible performances or crashes if they want.


What we are NOT discussing

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit


Benefits

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations


Cons

  - if you go for vectors with high dimensions, there's no guarantee you get 
acceptable performance for your use case



I want to keep it simple, right now in many Lucene areas, you can push the 
system to not acceptable performance/ crashes.

For example, we don't limit the number of docs per index to an arbitrary 
maximum of N, you push how many docs you like and if they are too much for your 
system, you get terrible performance/crashes/whatever.


Limits caused by primitive java types will stay there behind the scene, and 
that's acceptable, but I would prefer to not have arbitrary hard-coded ones 
that may limit the software usability and integration which is extremely 
important for a library.


I strongly encourage people to add benefits and cons, that I missed (I am sure 
I missed some of them, but wanted to keep it simple)


Cheers

--
Alessandro Benedetti
Director @ Sease Ltd.
Apache Lucene/Solr Committer
Apache Solr PMC Member

e-mail: a.benede...@sease.io


Sease - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io
LinkedIn | Twitter | Youtube | Github







-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-03-31 Thread Michael Wechner

Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the best 
embedding size, whereas I think heuristic approaches like described by 
the following link can be helpful


https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higher 
dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.


And it would be great if we could run benchmarks without having to 
recompile Lucene ourselves.


Therefore I would to suggest to either increase the limit or even better 
to remove the limit and add a disclaimer, that people should be aware of 
possible crashes etc.


Thanks

Michael




Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:


I've been monitoring various discussions on Pull Requests about 
changing the max number of dimensions allowed for Lucene HNSW vectors:


https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507


I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of 
favor in this direction.



*Motivation*

We were discussing in the Solr slack channel with Ishan 
Chattopadhyaya, Marcus Eagan, and David Smiley about some neural 
search integrations in Solr: 
https://github.com/openai/chatgpt-retrieval-plugin



*Proposal*

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the 
system to the limit of their resources and get terrible performances 
or crashes if they want.



*What we are NOT discussing*

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit


*Benefits*

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations


*Cons*

 - if you go for vectors with high dimensions, there's no guarantee 
you get acceptable performance for your use case


*
*

*
*

I want to keep it simple, right now in many Lucene areas, you can push 
the system to not acceptable performance/ crashes.


For example, we don't limit the number of docs per index to an 
arbitrary maximum of N, you push how many docs you like and if they 
are too much for your system, you get terrible 
performance/crashes/whatever.



Limits caused by primitive java types will stay there behind the 
scene, and that's acceptable, but I would prefer to not have arbitrary 
hard-coded ones that may limit the software usability and integration 
which is extremely important for a library.



I strongly encourage people to add benefits and cons, that I missed (I 
am sure I missed some of them, but wanted to keep it simple)



Cheers

--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter 
 | Youtube 
 | Github 



Re: Lucene 9.5.0 release

2023-01-23 Thread Michael Wechner

thanks :-)

Am 23.01.23 um 12:31 schrieb Alessandro Benedetti:

Yes Luca, doing it right now!

For Michael, it's just few getters.

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
<https://twitter.com/seaseltd> | Youtube 
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
<https://github.com/seaseltd>



On Mon, 23 Jan 2023 at 11:21, Luca Cavanna  
wrote:


Hi all,
I meant to start the release today and I see this PR is not merged
yet: https://github.com/apache/lucene/pull/12029 . Alessandro, do
you still plan on merging it shortly?

Thanks
Luca

On Sat, Jan 21, 2023 at 11:41 AM Michael Wechner
 wrote:

I tried to understand the issue described on github, but
unfortunately do not really understand it.

Can you explain a little more?

Thanks

Michael



Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:

Hi,
this would be nice to have in 9.5 :
https://github.com/apache/lucene/issues/12099

It's a minor (adding getters to KnnQuery) but can be
beneficial in Apache Solr as soon as possible.
Planning to merge in a few hours if no objections.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>


On Thu, 19 Jan 2023 at 14:38, Luca Cavanna
 <mailto:l...@elastic.co.invalid> wrote:

Thanks Robert for the help with the github milestone.

I am planning on cutting the release branch on Monday if
there are no objections.

Cheers
Luca

On Tue, Jan 17, 2023 at 7:08 PM Robert Muir
 wrote:

+1 to release, thank you for volunteering to be RM!

I went thru 9.5 section of CHANGES.txt and tagged all
the GH issues in
there with milestone too, if they didn't already have
it. It looks
even bigger now.

On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna
 wrote:
>
> Hi all,
> I'd like to propose that we release Lucene 9.5.0.
There is a decent amount of changes that would go
into it looking at the github milestone:
https://github.com/apache/lucene/milestone/4 . I'd
volunteer to be the release manager. There is one PR
open listed for the 9.5 milestone:
https://github.com/apache/lucene/pull/11873 . Is this
something that we do want to address before we
release? Is anybody aware of outstanding work that we
would like to include or known blocker issues that
are not listed in the 9.5 milestone?
>
> Cheers
> Luca
>
>
>
>


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
dev-h...@lucene.apache.org





Re: Lucene 9.5.0 release

2023-01-21 Thread Michael Wechner
I tried to understand the issue described on github, but unfortunately 
do not really understand it.


Can you explain a little more?

Thanks

Michael



Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:

Hi,
this would be nice to have in 9.5 :
https://github.com/apache/lucene/issues/12099

It's a minor (adding getters to KnnQuery) but can be beneficial in 
Apache Solr as soon as possible.

Planning to merge in a few hours if no objections.
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter 
 | Youtube 
 | Github 




On Thu, 19 Jan 2023 at 14:38, Luca Cavanna  
wrote:


Thanks Robert for the help with the github milestone.

I am planning on cutting the release branch on Monday if there are
no objections.

Cheers
Luca

On Tue, Jan 17, 2023 at 7:08 PM Robert Muir  wrote:

+1 to release, thank you for volunteering to be RM!

I went thru 9.5 section of CHANGES.txt and tagged all the GH
issues in
there with milestone too, if they didn't already have it. It looks
even bigger now.

On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna
 wrote:
>
> Hi all,
> I'd like to propose that we release Lucene 9.5.0. There is a
decent amount of changes that would go into it looking at the
github milestone: https://github.com/apache/lucene/milestone/4
. I'd volunteer to be the release manager. There is one PR
open listed for the 9.5 milestone:
https://github.com/apache/lucene/pull/11873 . Is this
something that we do want to address before we release? Is
anybody aware of outstanding work that we would like to
include or known blocker issues that are not listed in the 9.5
milestone?
>
> Cheers
> Luca
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Release Lucene 9.4.2

2022-11-09 Thread Michael Wechner

Thank you! +1 :-)

Am 09.11.22 um 16:38 schrieb Adrien Grand:

Hello all,

A bad integer overflow  
has been discovered in the KNN vectors format, which affects segments 
that have more than ~16M vectors. I'd like to do a bugfix release when 
the bug is fixed and we have a test 
 for such large datasets 
of KNN vectors. I volunteer to be the RM for this release.


--
Adrien


Re: Raising the Value of MAX_DIMENSIONS of Vector Values

2022-10-20 Thread Michael Wechner

Hi Together

Any news on the MAX_DIMENSIONS discussion?

https://github.com/apache/lucene/issues/11507

I just implemented Cohere.ai embeddings and Cohere is offering

small: 1024
medium: 2048
large: 4096

whereas Cohere has a nice demo described at

https://txt.cohere.ai/building-a-search-based-discord-bot-with-language-models/

whereas I am not sure which model they are using for the demo.

Thanks

Michael


Am 09.08.22 um 21:56 schrieb Julie Tibshirani:
Thank you Marcus for raising this, it's an important topic! On the 
issue you filed, Mike pointed to the JIRA ticket where we've been 
discussing this (https://issues.apache.org/jira/browse/LUCENE-10471) 
and suggested commenting with the embedding models you've heard about 
from users. This seems like a good idea to me too -- looking forward 
to discussing more on that JIRA issue. (Unless we get caught in the 
middle of the migration -- then we'll discuss once it's been moved to 
GitHub!)


Julie

On Mon, Aug 8, 2022 at 10:05 PM Michael Wechner 
 wrote:


I agree that Lucene should support vector sizes depending on the
model one is choosing.

For example Weaviate seems to do this

https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479

Thanks

Michael


Am 07.08.22 um 22:48 schrieb Marcus Eagan:

Hi Lucene Team,

In general, I have advised very strongly against our team at
MongoDB modifying the Lucene source, except in scenarios where we
have strong needs for a particular customization. Ultimately,
people can do what they would like to do.

That being said, we have a number of customers preparing to use
Lucene for dense vector search. There are many language models
that are optimized for > 1024 dimensions. I remember Michael
Wechner's email
<https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html>
about one instance with Open API.

I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimension


It seems that customers who attempt to use these models should
not be turned away. It could be sufficient to explain the issues.
The only ones I have identified are two expected ones in very
slow indexing throughput, high CPU usage, and a maybe less
defined risk of more numerical errors.

I opened an issue <https://github.com/apache/lucene/issues/1060>
and PR <https://github.com/apache/lucene/pull/1061> for the
discussion as well. I would appreciate guidance on where we think
the warning should go. I feel like burying in a Javadoc is a
less than ideal experience. It would be better to be a warning on
startup. In the PR, I increased the max limit by a factor of
twenty. We should let users use the system based on their needs
even if it was designed or optimized for the models they bring
because we need the feedback and the data from the world.

Is there something I'm overlooking from a risk standpoint?

Best,
-- 
Marcus Eagan






Re: call for 9.4.1 release (bug in vectors format)

2022-10-18 Thread Michael Wechner

+1 :-)

Thanks

Michael

Am 18.10.22 um 19:52 schrieb Julie Tibshirani:

Hi everyone,

We recently discovered a severe bug in the 9.4 release in the kNN 
vectors format: https://github.com/apache/lucene/issues/11858. 
Explaining the problem: when ingesting a lot of data, or when 
performing a force merge, segments can grow large. The format 
validation code accidentally uses an int instead of a long to compute 
the data size, so it can fail on these large segments. When format 
validation fails, the segment is essentially lost and unusable. For 
some client systems like Elasticsearch, it can send the whole index 
into a "failed" state, blocking further writes or searches.


I think this bug is sufficiently bad that we should perform a 9.4.1 
release as soon as possible. The fix is just an update to the 
read-side validation code, there won't be any effect on the data 
format. This means it is safe to merge the fix into the existing 9.4 
vectors format. The bug was introduced during the work to add 
quantization (https://github.com/apache/lucene/pull/1054) and does not 
affect versions before 9.4.


Let me know what you think! I could serve as release manager. (We 
should also follow up with a plan to prevent this from happening in 
the future -- maybe we need to regularly run larger-scale benchmarks?)


Julie



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Raising the Value of MAX_DIMENSIONS of Vector Values

2022-08-08 Thread Michael Wechner
I agree that Lucene should support vector sizes depending on the model 
one is choosing.


For example Weaviate seems to do this

https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479

Thanks

Michael


Am 07.08.22 um 22:48 schrieb Marcus Eagan:

Hi Lucene Team,

In general, I have advised very strongly against our team at MongoDB 
modifying the Lucene source, except in scenarios where we have strong 
needs for a particular customization. Ultimately, people can do what 
they would like to do.


That being said, we have a number of customers preparing to use Lucene 
for dense vector search. There are many language models that are 
optimized for > 1024 dimensions. I remember Michael Wechner's email 
 
about one instance with Open API.


I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimension


It seems that customers who attempt to use these models should not be 
turned away. It could be sufficient to explain the issues. The only 
ones I have identified are two expected ones in very slow indexing 
throughput, high CPU usage, and a maybe less defined risk of more 
numerical errors.


I opened an issue  and 
PR  for the discussion as 
well. I would appreciate guidance on where we think the warning should 
go. I feel like burying in a Javadoc is a less than ideal experience. 
It would be better to be a warning on startup. In the PR, I increased 
the max limit by a factor of twenty. We should let users use the 
system based on their needs even if it was designed or optimized for 
the models they bring because we need the feedback and the data from 
the world.


Is there something I'm overlooking from a risk standpoint?

Best,
--
Marcus Eagan



Re: Generate autocomplete predictions

2022-03-14 Thread Michael Wechner

Hi Adrien

Ok :-)

I think I will try to do a very rough prototype first, just to get a 
better idea and then use for discussion in JIRA.


Thanks

Michael

Am 14.03.22 um 08:19 schrieb Adrien Grand:

Hey Michael,

I like discussing ideas in JIRA first, but sometimes attaching a rough 
prototype can help show how things tie together. I guess the thing you 
want to avoid is to spend hours on the prototype but otherwise either 
is fine.


Le dim. 13 mars 2022, 23:01, Michael Wechner 
 a écrit :


Hi Adrien

Thanks for your feedback!

 From a "project management" point of view how do I best do this?

Should I just create a Pull Request with a first prototype, or
discuss
the design first in Jira tickets?

Thanks

Michael



Am 13.03.22 um 21:52 schrieb Adrien Grand:
> Hi Michael,
>
> This sounds like a good fit for Lucene to me.
>
> On Fri, Mar 11, 2022 at 11:15 PM Michael Wechner
>  wrote:
>> Hi
>>
>> I recently implemened auto-suggest based on
>>
>> https://lucene.apache.org/core/9_0_0/suggest/index.html
>>
>> whereas I am currently managing the terms / predictions (e.g.
>> "autocompletion using lucene suggesters dev") contained by the
index
>> manually.
>>
>> I would like now to generate the terms / predictions more
automatically,
>> similar to what Google does
>>
>>

https://blog.google/products/search/how-google-autocomplete-predictions-work/
>>
>> Does Lucene provide code to analyze queries in order to
generate terms /
>> predictions for an auto-suggest index?
>>
>> If not, would it make sense to contribute this kind of
functionality to
>> Lucene or should this be rather a third-party library?
>>
>> Thanks
>>
>> Michael
>>
>>
-
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Generate autocomplete predictions

2022-03-13 Thread Michael Wechner

Hi Adrien

Thanks for your feedback!

From a "project management" point of view how do I best do this?

Should I just create a Pull Request with a first prototype, or discuss 
the design first in Jira tickets?


Thanks

Michael



Am 13.03.22 um 21:52 schrieb Adrien Grand:

Hi Michael,

This sounds like a good fit for Lucene to me.

On Fri, Mar 11, 2022 at 11:15 PM Michael Wechner
 wrote:

Hi

I recently implemened auto-suggest based on

https://lucene.apache.org/core/9_0_0/suggest/index.html

whereas I am currently managing the terms / predictions (e.g.
"autocompletion using lucene suggesters dev") contained by the index
manually.

I would like now to generate the terms / predictions more automatically,
similar to what Google does

https://blog.google/products/search/how-google-autocomplete-predictions-work/

Does Lucene provide code to analyze queries in order to generate terms /
predictions for an auto-suggest index?

If not, would it make sense to contribute this kind of functionality to
Lucene or should this be rather a third-party library?

Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Generate autocomplete predictions

2022-03-11 Thread Michael Wechner

Hi

I recently implemened auto-suggest based on

https://lucene.apache.org/core/9_0_0/suggest/index.html

whereas I am currently managing the terms / predictions (e.g. 
"autocompletion using lucene suggesters dev") contained by the index 
manually.


I would like now to generate the terms / predictions more automatically, 
similar to what Google does


https://blog.google/products/search/how-google-autocomplete-predictions-work/

Does Lucene provide code to analyze queries in order to generate terms / 
predictions for an auto-suggest index?


If not, would it make sense to contribute this kind of functionality to 
Lucene or should this be rather a third-party library?


Thanks

Michael

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.1 release soon?

2022-02-23 Thread Michael Wechner

I think this would be great :-) thank you very much for your efforts!

Michael

Am 24.02.22 um 00:28 schrieb Julie Tibshirani:

Hello everyone,

Would there be support for releasing Lucene 9.1 soon? It has been ~2.5 
months since 9.0 was released and we already have a long list of new 
features, optimizations, and bug fixes 
(https://github.com/apache/lucene/blob/branch_9x/lucene/CHANGES.txt).


If so, I am happy to take a shot at being release manager. I did not 
see any issues marked "blocker", but please let me know if there are any.


Julie



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-20 Thread Michael Wechner
btw, I have done some tests now with the sentence-transformer models 
"all-roberta-large-v1" and "all-mpnet-base-v2"


https://huggingface.co/sentence-transformers/all-roberta-large-v1
https://huggingface.co/sentence-transformers/all-mpnet-base-v2

whereas also see https://www.sbert.net/docs/pretrained_models.html

With the following input/search question

"How old have you been last year?"

I receive the following cosine distances with "all-mpnet-base-v2" (768) 
for the previously indexed vectors (questions)


0.22234131087379294        How old are you this year?
0.2235891372002562      What was your age last year?
0.4337717812264763      How old are you?
0.4557796164007806      What is your age?

and with "all-roberta-large-v1" (1024)

0.25013378528376184       How old are you this year?
0.2715761666421139      What was your age last year?
0.4658360947506338      What is your age?
0.4859953687958164        How old are you?

So both models do not "understand" the question.

As Alessandro suggested a "well-curated fine-tuning step" might improve 
this, whereas I have not been able to try this yet.


Thanks

Michael

Am 14.02.22 um 22:02 schrieb Michael Wechner:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)


But yes, I could imagine, that eventually it might make sense to allow 
more dimensions than 1024.


Beside memory and  "CPU", are there other limiting factors re more 
dimensions?


Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
Hello Michael, the max number of dimensions is currently hardcoded 
and can't be changed. I could see an argument for increasing the 
default a bit and would be happy to discuss if you'd like to file a 
JIRA issue. However 12288 dimensions still seems high to me, this is 
much larger than most well-established embedding models and could 
require a lot of memory.


Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner 
 wrote:


Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of
1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.
Although the other models are cheaper and have fewer dimensions,
the OpenAI ones perform similarly or worse. This got me thinking
that they might not be a good cost/ effectiveness trade-off,
especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
 wrote:

Re the OpenAI embedding the following recent paper might be
of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan
24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not
correct from an "understanding" point of view and results 3
and 4 are good again.

I understand "similarity" is not the same as
"understanding", but I hope it makes it clearer what I am
looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for
example whether the following two sentences are similar
resp. likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar,
resp. do not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https:/

Re: How to Increase max vector size?

2022-02-17 Thread Michael Wechner

Not at the moment :-)

I am using Lucene's vector search for https://ukatie.com to detect 
duplicated questions, whereas I am currently refactoring it, such that 
you can connect Katie with your own similarity search implementation, 
whereas I have done a very first prototype of a connector for Weaviate


https://github.com/wyona/spring-boot-hello-world-rest/blob/master/src/main/java/org/wyona/webapp/controllers/v2/KatieMockupConnectorController.java

Weaviate itself is now supporting the OpenAI embeddings and I wanted to 
see how well this works together with Lucene, whereas I would like to 
make the embeddings configurable.
So far the Katie Lucene implementation supports the various sbert 
transformer models https://www.sbert.net/docs/pretrained_models.html and 
OpenAI text-similarity-ada-001


I will need some more time for the refactoring, but will make the Lucene 
connecter available under the Apache license.


Thanks

Michael

Am 16.02.22 um 19:51 schrieb Michael Sokolov:

Fair enough - are you planning to offer such a service;) sounds exciting!

-Mike

On Tue, Feb 15, 2022 at 6:00 PM Michael Wechner 
 wrote:


true :-) when you are the one controlling the input of vectors,
then a method to disable the maximum limit would be sufficient.

But I could imagine when you offer Lucene as a service where
people can for example configure their own "sentence embedding
models" and you would like to offer a different maximum limit than
the default of 1024, then I think a method to reset the maximum
limit would make sense. Examples could be a service of OpenAI or
vector search databases like for example Weaviate or Pinecone.

Thanks

Michael




Am 15.02.22 um 23:34 schrieb Michael Sokolov:

I don't think it makes sense to have a static variable maximum
that you can change by calling a method. What purpose would it
serve?

On Tue, Feb 15, 2022, 2:39 PM Michael Wechner
 wrote:

Hi Alessandro

No, I have not created a Jira ticket, but I would be happy to
create one, just let me know or please feel free to create one.

I understand the concerns about the limits in general and I
think it makes sense to have a default max dimensions limit,
but I could imagine it needs to be increased eventually and
being able to increase it programmatically and at your own
risk will help people using Lucene.

Thanks

Michael

Am 15.02.22 um 19:22 schrieb Alessandro Benedetti:

Hi Michael,
let's create a Jira ticket to use a higher value(if you
haven't already).
I would be happy to consider the patch/or do it myself but
after 10/03.
Once the pull request is ready (including the Javadoc
documentation that clearly states that if you go above X
it's at your own risk), we'll involve also Michael Sokolov
and the other committers familiar with this area of the code.

Cheers

--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>


On Sat, 12 Feb 2022 at 22:53, Michael Wechner
 wrote:

Hi

I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimensions and
receive the following error

java.lang.IllegalArgumentException: vector numDimensions
must be <= VectorValues.MAX_DIMENSIONS (=1024); got 12288
    at

org.apache.lucene.document.FieldType.setVectorDimensionsAndSimilarityFunction(FieldType.java:381)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz -
2021-12-01 14:23:49]
    at

org.apache.lucene.document.KnnVectorField.createFieldType(KnnVectorField.java:69)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz -
2021-12-01 14:23:49]

IIUC I can not increase programmatically the max vector
size which is set inside
lucene/core/src/java/org/apache/lucene/index/VectorValues.java


  public static int MAX_DIMENSIONS = 1024;

right?

I guess I could rebuild Lucene with a greater size or
what are the possbilities to increase the max vector size?

Thanks

Michael








Re: How to Increase max vector size?

2022-02-15 Thread Michael Wechner
true :-) when you are the one controlling the input of vectors, then a 
method to disable the maximum limit would be sufficient.


But I could imagine when you offer Lucene as a service where people can 
for example configure their own "sentence embedding models" and you 
would like to offer a different maximum limit than the default of 1024, 
then I think a method to reset the maximum limit would make sense. 
Examples could be a service of OpenAI or vector search databases like 
for example Weaviate or Pinecone.


Thanks

Michael




Am 15.02.22 um 23:34 schrieb Michael Sokolov:
I don't think it makes sense to have a static variable maximum that 
you can change by calling a method. What purpose would it serve?


On Tue, Feb 15, 2022, 2:39 PM Michael Wechner 
 wrote:


Hi Alessandro

No, I have not created a Jira ticket, but I would be happy to
create one, just let me know or please feel free to create one.

I understand the concerns about the limits in general and I think
it makes sense to have a default max dimensions limit, but I could
imagine it needs to be increased eventually and being able to
increase it programmatically and at your own risk will help people
using Lucene.

Thanks

Michael

Am 15.02.22 um 19:22 schrieb Alessandro Benedetti:

Hi Michael,
let's create a Jira ticket to use a higher value(if you haven't
already).
I would be happy to consider the patch/or do it myself but after
10/03.
Once the pull request is ready (including the Javadoc
documentation that clearly states that if you go above X it's at
your own risk), we'll involve also Michael Sokolov and the other
committers familiar with this area of the code.

Cheers

--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>


On Sat, 12 Feb 2022 at 22:53, Michael Wechner
 wrote:

Hi

I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimensions and
receive the following error

java.lang.IllegalArgumentException: vector numDimensions must
be <= VectorValues.MAX_DIMENSIONS (=1024); got 12288
    at

org.apache.lucene.document.FieldType.setVectorDimensionsAndSimilarityFunction(FieldType.java:381)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz -
2021-12-01 14:23:49]
    at

org.apache.lucene.document.KnnVectorField.createFieldType(KnnVectorField.java:69)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz -
2021-12-01 14:23:49]

IIUC I can not increase programmatically the max vector size
which is set inside
lucene/core/src/java/org/apache/lucene/index/VectorValues.java

  public static int MAX_DIMENSIONS = 1024;

right?

I guess I could rebuild Lucene with a greater size or what
are the possbilities to increase the max vector size?

Thanks

Michael






Re: How to Increase max vector size?

2022-02-15 Thread Michael Wechner

Hi Alessandro

No, I have not created a Jira ticket, but I would be happy to create 
one, just let me know or please feel free to create one.


I understand the concerns about the limits in general and I think it 
makes sense to have a default max dimensions limit, but I could imagine 
it needs to be increased eventually and being able to increase it 
programmatically and at your own risk will help people using Lucene.


Thanks

Michael

Am 15.02.22 um 19:22 schrieb Alessandro Benedetti:

Hi Michael,
let's create a Jira ticket to use a higher value(if you haven't already).
I would be happy to consider the patch/or do it myself but after 10/03.
Once the pull request is ready (including the Javadoc documentation 
that clearly states that if you go above X it's at your own risk), 
we'll involve also Michael Sokolov and the other committers familiar 
with this area of the code.


Cheers

--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>


On Sat, 12 Feb 2022 at 22:53, Michael Wechner 
 wrote:


Hi

I just tried to test the OpenAI model
"text-similarity-davinci-001" with 12288 dimensions and receive
the following error

java.lang.IllegalArgumentException: vector numDimensions must be
<= VectorValues.MAX_DIMENSIONS (=1024); got 12288
    at

org.apache.lucene.document.FieldType.setVectorDimensionsAndSimilarityFunction(FieldType.java:381)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz - 2021-12-01
14:23:49]
    at

org.apache.lucene.document.KnnVectorField.createFieldType(KnnVectorField.java:69)
~[lucene-core-9.0.0.jar:9.0.0
0b18b3b965cedaf5eb129aa41243a44c83ca826d - jpountz - 2021-12-01
14:23:49]

IIUC I can not increase programmatically the max vector size which
is set inside
lucene/core/src/java/org/apache/lucene/index/VectorValues.java

  public static int MAX_DIMENSIONS = 1024;

right?

I guess I could rebuild Lucene with a greater size or what are the
possbilities to increase the max vector size?

Thanks

Michael




Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner



Am 15.02.22 um 19:48 schrieb Robert Muir:
Sure, but lucene should be able to have limits. We have this 
discussion with every single limit we attempt to implement :)
There will always be extreme use cases using too many dimensions or 
whatever.
It is open source! I think if what you are doing is strange enough, 
you can modify the sources.


sure :-)



Personally, I'm concerned about increasing this limit: things are 
quite slow already with hundreds of dimensions.


In my particular use case the performance is not the most important, but 
rather the quality of the result.


But as Julie pointed out with 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9 
more dimensions do not necessarily create better results, at least it 
seems to be like this in the case of sentence embeddings.


I could imagine though, that there might be other use cases where more 
dimensions do make a difference, but then again we can of course wait 
until this actually happens


There seems to be no light at the end of the tunnel for the JDK vector 
api, I think OpenJDK will incubate this API until the sun supernovas 
and java is dead :)
It is frustrating, as that could give current implementation a needed 
performance boost on basically any hardware.


I guess you mean https://openjdk.java.net/jeps/338 right?




Also, I'm concerned about increasing limit while HNSW is the only 
implementation. I'd like us to keep the door open to alternative 
algorithms that might have better performance.


It would be great if Lucene would provide alternative algorithms in the 
future and one can choose the algorithm based on one's requirements


Thanks

Michael




On Tue, Feb 15, 2022 at 12:21 PM Michael Wechner 
 wrote:


I understand, but if Lucene itself would allow to overwrite the
default max size programmatically, then I think it should be clear
that you do this at your own risk :-)

Thanks for the links to your blog posts, which sound very interesting.

Thanks

Michael

Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:

I believe it could make sense, but as Michael pointed out in the
Jira ticket related to the Solr integration, then we'll get
complaints like "I set it to 1.000.000 and my Solr instance
doesn't work anymore" (I kept everything super simple just to
simulate a realistic scenario).
So I tend to agree to keep it to 1024 at the moment and
potentially extend it(providing some benchmark on common machines
as a reference to justify the increase).

In terms of your original question, how are you
training/fine-tuning your models?
Using pre-trained language models won't probably help you that
much, on top of that, queries are short, so you may require a
well-curated fine-tuning step.
We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html

https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>


On Tue, 15 Feb 2022 at 09:10, Michael Wechner
 wrote:

fair enough, but wouldn't it make sense that one can increase it
programmatically, e.g.

.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:
> I think we picked the 1024 number as something that seemed
so large
> nobody would ever want to exceed it! Obviously that was
naive. Still
> the limit serves as a cautionary point for users; if your
vectors are
> bigger than this, there is probably a better way to
accomplish what
> you are after (eg better off-line training to reduce
dimensionality).
> Is 1024 the magic number? Maybe not, but before increasing
I'd like to
> see some strong evidence that bigger vectors than that are
indeed
> useful as part of a search application using Lucene.
>
> -Mike
>
> On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
 wrote:
>> Sounds good, hope the testing goes well! Memory and CPU
(largely from more expensive vector distance calculations)
are indeed the main factors to consider.
>>
    >> Julie
>>
>> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
 wrote:
>>> Hi Julie
>>>
>>> Thanks again for your feedback!
>>>
>>> I will do some more tests with "all-mpnet-base-v2" (768)
  

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
I understand, but if Lucene itself would allow to overwrite the default 
max size programmatically, then I think it should be clear that you do 
this at your own risk :-)


Thanks for the links to your blog posts, which sound very interesting.

Thanks

Michael

Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:
I believe it could make sense, but as Michael pointed out in the Jira 
ticket related to the Solr integration, then we'll get complaints like 
"I set it to 1.000.000 and my Solr instance doesn't work anymore" (I 
kept everything super simple just to simulate a realistic scenario).
So I tend to agree to keep it to 1024 at the moment and potentially 
extend it(providing some benchmark on common machines as a reference 
to justify the increase).


In terms of your original question, how are you 
training/fine-tuning your models?
Using pre-trained language models won't probably help you that much, 
on top of that, queries are short, so you may require a well-curated 
fine-tuning step.

We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>


On Tue, 15 Feb 2022 at 09:10, Michael Wechner 
 wrote:


fair enough, but wouldn't it make sense that one can increase it
programmatically, e.g.

.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:
> I think we picked the 1024 number as something that seemed so large
> nobody would ever want to exceed it! Obviously that was naive. Still
> the limit serves as a cautionary point for users; if your
vectors are
> bigger than this, there is probably a better way to accomplish what
> you are after (eg better off-line training to reduce
dimensionality).
> Is 1024 the magic number? Maybe not, but before increasing I'd
like to
> see some strong evidence that bigger vectors than that are indeed
> useful as part of a search application using Lucene.
>
> -Mike
>
> On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
 wrote:
>> Sounds good, hope the testing goes well! Memory and CPU
(largely from more expensive vector distance calculations) are
indeed the main factors to consider.
>>
    >> Julie
>>
>> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
 wrote:
>>> Hi Julie
>>>
>>> Thanks again for your feedback!
>>>
>>> I will do some more tests with "all-mpnet-base-v2" (768) and
"all-roberta-large-v1" (1024), so 1024 is enough for me for the
moment :-)
>>>
>>> But yes, I could imagine, that eventually it might make sense
to allow more dimensions than 1024.
>>>
>>> Beside memory and  "CPU", are there other limiting factors re
more dimensions?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
>>>
>>> Hello Michael, the max number of dimensions is currently
hardcoded and can't be changed. I could see an argument for
increasing the default a bit and would be happy to discuss if
you'd like to file a JIRA issue. However 12288 dimensions still
seems high to me, this is much larger than most well-established
embedding models and could require a lot of memory.
>>>
>>> Julie
>>>
>>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner
 wrote:
>>>> Hi Julie
>>>>
>>>> Thanks very much for this link, which is very interesting!
>>>>
>>>> Btw, do you have an idea how to increase the default max size
of 1024?
>>>>
>>>> https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
>>>>
>>>> Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
fair enough, but wouldn't it make sense that one can increase it 
programmatically, e.g.


.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:

I think we picked the 1024 number as something that seemed so large
nobody would ever want to exceed it! Obviously that was naive. Still
the limit serves as a cautionary point for users; if your vectors are
bigger than this, there is probably a better way to accomplish what
you are after (eg better off-line training to reduce dimensionality).
Is 1024 the magic number? Maybe not, but before increasing I'd like to
see some strong evidence that bigger vectors than that are indeed
useful as part of a search application using Lucene.

-Mike

On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani  wrote:

Sounds good, hope the testing goes well! Memory and CPU (largely from more 
expensive vector distance calculations) are indeed the main factors to consider.

Julie

On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner  
wrote:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)

But yes, I could imagine, that eventually it might make sense to allow more 
dimensions than 1024.

Beside memory and  "CPU", are there other limiting factors re more dimensions?

Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:

Hello Michael, the max number of dimensions is currently hardcoded and can't be 
changed. I could see an argument for increasing the default a bit and would be 
happy to discuss if you'd like to file a JIRA issue. However 12288 dimensions 
still seems high to me, this is much larger than most well-established 
embedding models and could require a lot of memory.

Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner  
wrote:

Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these models, but I found 
this article insightful: 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
 It evaluates the OpenAI models against a variety of existing models on tasks 
like sentence similarity and text retrieval. Although the other models are 
cheaper and have fewer dimensions, the OpenAI ones perform similarly or worse. 
This got me thinking that they might not be a good cost/ effectiveness 
trade-off, especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner  
wrote:

Re the OpenAI embedding the following recent paper might be of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model "text-similarity-ada-001" 
with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
score '0.98860765'

2) What was your age last year?
score '0.97811764'

3) What is your age?
score '0.97094905'

4) How old are you?
score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from an 
"understanding" point of view and results 3 and 4 are good again.

I understand "similarity" is not the same as "understanding", but I hope it 
makes it clearer what I am looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether the 
following two sentences are similar resp. likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not mean the 
same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with SBERT 
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the first 
neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,  wrote:

Hi

Does anyone have experience using OpenAI embeddings in combination with Lucene 
vector search?

https://beta.openai.com/docs/guides/embeddings

for example comparing performance re vector size

https://api.openai.com/v1/engines/text-similarit

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)


But yes, I could imagine, that eventually it might make sense to allow 
more dimensions than 1024.


Beside memory and  "CPU", are there other limiting factors re more 
dimensions?


Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
Hello Michael, the max number of dimensions is currently hardcoded and 
can't be changed. I could see an argument for increasing the default a 
bit and would be happy to discuss if you'd like to file a JIRA issue. 
However 12288 dimensions still seems high to me, this is much larger 
than most well-established embedding models and could require a lot of 
memory.


Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner 
 wrote:


Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.
Although the other models are cheaper and have fewer dimensions,
the OpenAI ones perform similarly or worse. This got me thinking
that they might not be a good cost/ effectiveness trade-off,
especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
 wrote:

Re the OpenAI embedding the following recent paper might be
of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan
24, 2022)

Thanks

Michael

    Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not
correct from an "understanding" point of view and results 3
and 4 are good again.

I understand "similarity" is not the same as
"understanding", but I hope it makes it clearer what I am
looking for :-)

    Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example
whether the following two sentences are similar resp.
likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp.
do not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we
contributed the first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How
to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,
 wrote:

Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size


||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael











Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner

Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
Hello Michael, I don't have personal experience with these models, but 
I found this article insightful: 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9. 
It evaluates the OpenAI models against a variety of existing models on 
tasks like sentence similarity and text retrieval. Although the other 
models are cheaper and have fewer dimensions, the OpenAI ones perform 
similarly or worse. This got me thinking that they might not be a good 
cost/ effectiveness trade-off, especially the larger ones with 4096 
or 12288 dimensions.


Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner 
 wrote:


Re the OpenAI embedding the following recent paper might be of
interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct
from an "understanding" point of view and results 3 and 4 are
good again.

I understand "similarity" is not the same as "understanding", but
I hope it makes it clearer what I am looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example
whether the following two sentences are similar resp. likely to
mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do
not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we
contributed the first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to
integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,
 wrote:

Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size


||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael









Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-13 Thread Michael Wechner

Re the OpenAI embedding the following recent paper might be of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:
Here a concrete example where I combine OpenAI model 
"text-similarity-ada-001" with Lucene vector search


INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from 
an "understanding" point of view and results 3 and 4 are good again.


I understand "similarity" is not the same as "understanding", but I 
hope it makes it clearer what I am looking for :-)


Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether 
the following two sentences are similar resp. likely to mean the same 
thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not 
mean the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example 
with SBERT 
(https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael







Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner
Here a concrete example where I combine OpenAI model 
"text-similarity-ada-001" with Lucene vector search


INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from an 
"understanding" point of view and results 3 and 4 are good again.


I understand "similarity" is not the same as "understanding", but I hope 
it makes it clearer what I am looking for :-)


Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether 
the following two sentences are similar resp. likely to mean the same 
thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not 
mean the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with 
SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael





How to Increase max vector size?

2022-02-12 Thread Michael Wechner

Hi

I just tried to test the OpenAI model "text-similarity-davinci-001" with 
12288 dimensions and receive the following error


java.lang.IllegalArgumentException: vector numDimensions must be <= 
VectorValues.MAX_DIMENSIONS (=1024); got 12288
    at 
org.apache.lucene.document.FieldType.setVectorDimensionsAndSimilarityFunction(FieldType.java:381) 
~[lucene-core-9.0.0.jar:9.0.0 0b18b3b965cedaf5eb129aa41243a44c83ca826d - 
jpountz - 2021-12-01 14:23:49]
    at 
org.apache.lucene.document.KnnVectorField.createFieldType(KnnVectorField.java:69) 
~[lucene-core-9.0.0.jar:9.0.0 0b18b3b965cedaf5eb129aa41243a44c83ca826d - 
jpountz - 2021-12-01 14:23:49]


IIUC I can not increase programmatically the max vector size which is 
set inside lucene/core/src/java/org/apache/lucene/index/VectorValues.java


  public static int MAX_DIMENSIONS = 1024;

right?

I guess I could rebuild Lucene with a greater size or what are the 
possbilities to increase the max vector size?


Thanks

Michael



Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner

Hi Alessandro

I am mainly interested in detecting similarity, for example whether the 
following two sentences are similar resp. likely to mean the same thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not mean 
the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with 
SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in combination
with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael



Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-11 Thread Michael Wechner

Hi

Does anyone have experience using OpenAI embeddings in combination with 
Lucene vector search?


https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and

||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael

Re: Searching Lucene FAQ with Lucene

2021-12-21 Thread Michael Wechner




Am 21.12.21 um 18:49 schrieb Michael Sokolov:

interesting -- it always matches *something* I guess?


yes, but this is something I would like to improve, that it knows when 
it does not know :-)


I understand Lucene provides a score, but just defining a threshold 
doesn't really solve the problem, or do I misunderstand this?


It seems to me one has to implement some kind of "understanding / 
reasoning" in order to solve this. Or what would be your approach?



  It might be
helpful to show not only the answer, but also the question that was
matched?


yes, definitely, whereas the Katie frontend already provides this 
functionality


https://ukatie.com/#/faq/9f206aec-5223-4e03-a2fc-c16e4b885ef8/en

but I have to enhance the Javascript client used at

https://lucene-faq.ukatie.com/

Thanks

Michael



On Mon, Dec 20, 2021 at 5:05 AM Michael Wechner
 wrote:

Hi

I am working on a webapp called "Katie" in order to detect duplicated
questions

https://ukatie.com/

As a test case I have imported the Lucene FAQ

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ

to

https://ukatie.com/#/faq/9f206aec-5223-4e03-a2fc-c16e4b885ef8/en

and made them available at

https://lucene-faq.ukatie.com/

whereas the FAQ are loaded as JSON from the REST interface of Katie

https://ukatie.com/swagger-ui/?urls.primaryName=API%20V2#/faq-controller-v-2/getFAQUsingGET_1

and the Javascript can be found at

https://github.com/wyona/katie-4-faq

I am currently "experimenting" with different search algorithms, e.g.

Lucene only
SentenceBERT- Lucene Vector Search
SentenceBERT only
Weaviate

The goal is to find the right answer with "similar" questions, e.g.

- "Are there mailing lists?"
- "How can I ask questions re Lucene?"

independent whether the question was trained/indexed or not or the
answer contains keywords of the question

whereas the answer in this particular case is

https://ukatie.com/#/read-answer?domain-id=9f206aec-5223-4e03-a2fc-c16e4b885ef8=e19b6f48-62ac-427a-9d5e-d4e4eb110769

and another meaningful answer could be

https://ukatie.com/#/read-answer?domain-id=9f206aec-5223-4e03-a2fc-c16e4b885ef8=154d9aa7-29e6-457e-a2ad-315b1a67599f

There is still a lot to be improved :-) but it is lot of fun to use
Lucene for this!

Any feedback is very welcome or if you want to know more about the
implementation details.

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Searching Lucene FAQ with Lucene

2021-12-20 Thread Michael Wechner

Hi

I am working on a webapp called "Katie" in order to detect duplicated 
questions


https://ukatie.com/

As a test case I have imported the Lucene FAQ

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ

to

https://ukatie.com/#/faq/9f206aec-5223-4e03-a2fc-c16e4b885ef8/en

and made them available at

https://lucene-faq.ukatie.com/

whereas the FAQ are loaded as JSON from the REST interface of Katie

https://ukatie.com/swagger-ui/?urls.primaryName=API%20V2#/faq-controller-v-2/getFAQUsingGET_1

and the Javascript can be found at

https://github.com/wyona/katie-4-faq

I am currently "experimenting" with different search algorithms, e.g.

Lucene only
SentenceBERT- Lucene Vector Search
SentenceBERT only
Weaviate

The goal is to find the right answer with "similar" questions, e.g.

- "Are there mailing lists?"
- "How can I ask questions re Lucene?"

independent whether the question was trained/indexed or not or the 
answer contains keywords of the question


whereas the answer in this particular case is

https://ukatie.com/#/read-answer?domain-id=9f206aec-5223-4e03-a2fc-c16e4b885ef8=e19b6f48-62ac-427a-9d5e-d4e4eb110769

and another meaningful answer could be

https://ukatie.com/#/read-answer?domain-id=9f206aec-5223-4e03-a2fc-c16e4b885ef8=154d9aa7-29e6-457e-a2ad-315b1a67599f

There is still a lot to be improved :-) but it is lot of fun to use 
Lucene for this!


Any feedback is very welcome or if you want to know more about the 
implementation details.


Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Haoyu (Patrick) Zhai as Lucene Committer

2021-12-19 Thread Michael Wechner

Hi Patrick/Haoyu, congratulations!

Am 19.12.21 um 21:05 schrieb Patrick Zhai:

Thanks everyone!

It's a great honor to become a lucene committer and thank you everyone 
for building such a friendly community and specially thank you to who 
has replied email/ commented on issues/ reviewed PRs related to my 
work. It is an enjoyable experience working with lucene community and 
I'm looking forward to learn more about lucene as well as contribute 
more to the community as a committer.


A little bit about myself, I'm currently living in Mountain View and 
working at Amazon Search, in the same team as Mike McCandless, Mike 
Sokolov and Greg. Besides digging into lucene for some work related 
projects, I am also very curious about how some fancy stuffs are 
implemented inside lucene, and that probably is one of the reason 
drives me to be a committer.
Besides programming, I enjoy playing video games a lot, I admire 
well-designed game and hope I could participate in game development 
one day. As for outdoor activities, I like skiing and traveling, due 
to COVID it's still hard to travel around but I hope things will be 
better next year.


Thank you again!
Patrick

David Smiley  于2021年12月19日周日 09:14写道:

Congratulations Haoyu!

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Dec 19, 2021 at 4:12 AM Dawid Weiss
 wrote:

Hello everyone!

Please welcome Haoyu Zhai as the latest Lucene committer. You
may also
know Haoyu as Patrick - this is perhaps his kind gesture to
those of
us whose tongues are less flexible in pronouncing difficult first
names. :)

It's a tradition to briefly introduce yourself to the group,
Patrick.
Welcome and thank you!

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Article link at Lucene FAQ does not exist anymore

2021-12-15 Thread Michael Wechner

I have removed it now :-)

Am 22.11.21 um 18:18 schrieb Michael Wechner:

Hi

The QnA

https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowcanIindexXMLdocuments?

is pointing to (See also this article Parsing, indexing, and searching 
XML with Digester and Lucene 
<http://www-106.ibm.com/developerworks/library/j-lucene/>.)


http://www-106.ibm.com/developerworks/library/j-lucene/

but this does not seem to exist anymore and one gets redirected to

https://developer.ibm.com/

I have found an old copy from 2003

https://web.archive.org/web/20030608074955/http://www-106.ibm.com/developerworks/library/j-lucene/

but I guess it does not really make sense to still link this, right?

Thanks

Michael


Re: Welcome Julie Tibshirani to the Lucene PMC

2021-12-01 Thread Michael Wechner

great to hear, congratulations, Julie!

Am 01.12.21 um 14:29 schrieb Mayya Sharipova:

Congratulations, Julie ! Very well deserved!

On Wed, Dec 1, 2021 at 2:45 PM Ignacio Vera  wrote:

Congratulations Julie!

On Wed, Dec 1, 2021 at 10:03 AM Alan Woodward
 wrote:

Congratulations and welcome!

- Alan

> On 30 Nov 2021, at 21:49, Adrien Grand 
wrote:
>
> I'm pleased to announce that Julie Tibshirani has accepted
an invitation to join the Lucene PMC!
>
> Congratulations Julie, and welcome aboard!
>
> --
> Adrien


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Article link at Lucene FAQ does not exist anymore

2021-11-22 Thread Michael Wechner

Hi

The QnA

https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowcanIindexXMLdocuments?

is pointing to (See also this article Parsing, indexing, and searching 
XML with Digester and Lucene 
.)


http://www-106.ibm.com/developerworks/library/j-lucene/

but this does not seem to exist anymore and one gets redirected to

https://developer.ibm.com/

I have found an old copy from 2003

https://web.archive.org/web/20030608074955/http://www-106.ibm.com/developerworks/library/j-lucene/

but I guess it does not really make sense to still link this, right?

Thanks

Michael

Re: Time to write an open-source book?

2021-11-17 Thread Michael Wechner

I think this would be great and I would be very happy to contribute.

For example I am currently trying to understand how the autocomplete / 
auto suggest functionality of Lucene works and I could contribute my 
learnings.


All the best

Michael

Am 16.11.21 um 20:49 schrieb Dongyu Xu:

Hi Devs,

I'm finally motivated enough to start this thread as I believe this is a
great thing to do for the Lucene community to continuously thrive as the
library has become so feature-rich but as the same time much more 
complex.


"What do you recommend to read for learning more Lucene?" -- A question I
was often asked by my friends and coworkers at Amazon Product Search . 
I'm

sure many of you have experienced the same. I always recommend Lucene In
Action 2nd Edition[1] which is a great book. However, it features 
*Lucene 3.0*

and we are at *Lucene 9.0* now! There is a huge gap.

Inspired by my recent experience with the Rust Book[2] and the Solr ref
guide[3], I believe it is possible for the Lucene community to 
collaborate

on writing a book/user guide just like how the software is built in the
open-source way!

Concretely, it will require to first draw the outline of the book with 
clear
intentions for all sections. Then the effort should be able to scale, 
allow-

ing individuals to work on different sections in parallel. Once built, the
book should be a live artifact and evolve together with Lucene.

Thoughts?

[1] 
https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp/1933988177

[2] https://github.com/rust-lang/book
[3] https://github.com/apache/solr/tree/main/solr/solr-ref-guide



Thanks,
Tony


Re: VectorField renamed to KnnVectorField?

2021-11-02 Thread Michael Wechner

Hi Vigya

Great, thank you very much for these links!

All the best

Michael

Am 02.11.21 um 01:45 schrieb Vigya Sharma:

Hi Michael,

Glad you got it working. There is also a KNN vector search demo that 
was added not long ago. You might want to check it out. It has 
references for example, to compute embeddings and build knn vector 
queries 
<https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/SearchFiles.java#L272-L292>, 
among other things.


  * Search Files:

https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/SearchFiles.java

<https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/SearchFiles.java>

  * Index Files:

https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java

<https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java>

  * knn demo files:

https://github.com/apache/lucene/tree/main/lucene/demo/src/java/org/apache/lucene/demo/knn

<https://github.com/apache/lucene/tree/main/lucene/demo/src/java/org/apache/lucene/demo/knn>


- Vigya


On Mon, Nov 1, 2021 at 2:44 PM Michael Wechner 
mailto:michael.wech...@wyona.com>> wrote:


I was able to update my code

-    FieldType vectorFieldType =
VectorField.createHnswType(vector.length,
VectorValues.SimilarityFunction.DOT_PRODUCT, 16, 500);
-    VectorField vectorField = new VectorField(VECTOR_FIELD,
vector, vectorFieldType);
+    FieldType vectorFieldType =
KnnVectorField.createFieldType(vector.length,
VectorSimilarityFunction.DOT_PRODUCT);
+    KnnVectorField vectorField = new
KnnVectorField(VECTOR_FIELD, vector, vectorFieldType);

and

-    return new TopDocScorer(this,
context.reader().searchNearestVectors(field, vector, topK, fanout));
+    return new TopDocScorer(this,
context.reader().searchNearestVectors(field, vector, topK, null));

the indexing and searching works again :-)

Thanks

Michael

    Am 01.11.21 um 18:53 schrieb Michael Wechner:

Hi

In May 2021 I have done a Vector Search implementation based on Lucene 
9.0.0-SNAPSHOT with the following code

FieldType vectorFieldType = VectorField.createHnswType(vector.length, 
VectorValues.SimilarityFunction.DOT_PRODUCT,16,500);
VectorField vectorField =new VectorField(VECTOR_FIELD, vector, 
vectorFieldType);
doc.add(vectorField)

and

class KnnWeightextends Weight {

 KnnWeight() {
 super(KnnQuery.this);
 }

 @Override public Scorer scorer(LeafReaderContext context)throws 
IOException {
 log.debug("Get scorer.");
 return new TopDocScorer(this, 
context.reader().searchNearestVectors(field,vector,topK,fanout));
 }

whereas fanout is of type "int"

I have now updated Lucene source and rebuilt 9.0.0-SNAPSHOT and get various 
compile errors.

I assume VectorField got renamed to KnnVectorField, right?

Does somebody maybe have some sample code how Vector search is being 
implemented with the most recent Lucene code?

Thanks

Michael




--
- Vigya




Re: VectorField renamed to KnnVectorField?

2021-11-01 Thread Michael Wechner

I was able to update my code

-    FieldType vectorFieldType = 
VectorField.createHnswType(vector.length, 
VectorValues.SimilarityFunction.DOT_PRODUCT, 16, 500);
-    VectorField vectorField = new VectorField(VECTOR_FIELD, vector, 
vectorFieldType);
+    FieldType vectorFieldType = 
KnnVectorField.createFieldType(vector.length, 
VectorSimilarityFunction.DOT_PRODUCT);
+    KnnVectorField vectorField = new KnnVectorField(VECTOR_FIELD, 
vector, vectorFieldType);


and

-    return new TopDocScorer(this, 
context.reader().searchNearestVectors(field, vector, topK, fanout));
+    return new TopDocScorer(this, 
context.reader().searchNearestVectors(field, vector, topK, null));


the indexing and searching works again :-)

Thanks

Michael

Am 01.11.21 um 18:53 schrieb Michael Wechner:

Hi

In May 2021 I have done a Vector Search implementation based on Lucene 
9.0.0-SNAPSHOT with the following code

FieldType vectorFieldType = VectorField.createHnswType(vector.length, 
VectorValues.SimilarityFunction.DOT_PRODUCT,16,500);
VectorField vectorField =new VectorField(VECTOR_FIELD, vector, vectorFieldType);
doc.add(vectorField)

and

class KnnWeightextends Weight {

 KnnWeight() {
 super(KnnQuery.this);
 }

 @Override public Scorer scorer(LeafReaderContext context)throws 
IOException {
 log.debug("Get scorer.");
 return new TopDocScorer(this, 
context.reader().searchNearestVectors(field,vector,topK,fanout));
 }

whereas fanout is of type "int"

I have now updated Lucene source and rebuilt 9.0.0-SNAPSHOT and get various 
compile errors.

I assume VectorField got renamed to KnnVectorField, right?

Does somebody maybe have some sample code how Vector search is being 
implemented with the most recent Lucene code?

Thanks

Michael




VectorField renamed to KnnVectorField?

2021-11-01 Thread Michael Wechner

Hi

In May 2021 I have done a Vector Search implementation based on Lucene 
9.0.0-SNAPSHOT with the following code

FieldType vectorFieldType = VectorField.createHnswType(vector.length, 
VectorValues.SimilarityFunction.DOT_PRODUCT,16,500);
VectorField vectorField =new VectorField(VECTOR_FIELD, vector, vectorFieldType);
doc.add(vectorField)

and

class KnnWeightextends Weight {

KnnWeight() {
super(KnnQuery.this);
}

@Override public Scorer scorer(LeafReaderContext context)throws IOException 
{
log.debug("Get scorer.");
return new TopDocScorer(this, 
context.reader().searchNearestVectors(field,vector,topK,fanout));
}

whereas fanout is of type "int"

I have now updated Lucene source and rebuilt 9.0.0-SNAPSHOT and get various 
compile errors.

I assume VectorField got renamed to KnnVectorField, right?

Does somebody maybe have some sample code how Vector search is being 
implemented with the most recent Lucene code?

Thanks

Michael



Re: Not able to subscribe to Developer Lists

2021-07-29 Thread Michael Wechner

Hi Praveen

I think you managed

https://lists.apache.org/list.html?dev@lucene.apache.org

and otherwise I would not have received this email :-)

HTH

Michael

Am 29.07.21 um 09:15 schrieb Praveen Nishchal:

Hi Dev Community,

I have sent multiple emails to dev-subscr...@lucene.apache.org 
 for subscription to Developer 
Lists but no luck! Kindly help me.


Thanks,
Praveen







Re: Does Luke already support vector search or are there any plans to support vector search?

2021-07-17 Thread Michael Wechner

Hi Tomoko

Just noticed that you resolved the issue and also did some additional 
improvement :-)


Thanks a lot!

Michael

Am 14.07.21 um 07:52 schrieb Michael Wechner:
sure, I understand, but I just wanted to ask whether such a change 
makes sense actually.


I have created a Jira ticket

https://issues.apache.org/jira/browse/LUCENE-10024

and added the patch as attachment. Let me know if you prefer a pull 
request.


Cheers

Michael

Am 14.07.21 um 03:43 schrieb Tomoko Uchida:

We don't accept patches by email... please open a Jira.


2021年7月14日(水) 5:58 Michael Wechner :

would the following patch make sense?

git diff lucene/luke/src/
diff --git
a/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
b/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
index f3fc635872b..ad13745eec8 100644
--- a/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
+++ b/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
@@ -18,6 +18,7 @@
   package org.apache.lucene.luke.app;

   import java.lang.invoke.MethodHandles;
+import java.nio.file.NoSuchFileException;
   import java.util.Objects;
   import org.apache.logging.log4j.Logger;
   import org.apache.lucene.index.IndexReader;
@@ -71,6 +72,10 @@ public final class IndexHandler extends
AbstractHandler {
   IndexReader reader;
   try {
 reader = IndexUtils.openIndex(indexPath, dirImpl);
+    } catch (NoSuchFileException e) {
+  log.error("Error opening index", e);
+  throw new LukeException(
+
MessageUtils.getLocalizedMessage("openindex.message.index_path_does_not_exist", 


indexPath), e);
   } catch (Exception e) {
 log.error("Error opening index", e);
 throw new LukeException(
diff --git
a/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties 

b/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties 


index f9c8c45a0f4..30b43cf18b7 100644
---
a/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties 


+++
b/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties 


@@ -71,6 +71,7 @@ openindex.radio.keep_only_last_commit=Keep only last
commit point
   openindex.radio.keep_all_commits=Keep all commit points
   openindex.message.index_path_not_selected=Please choose index path.
   openindex.message.index_path_invalid=Cannot open index path {0}. 
Not a

valid lucene index directory or corrupted?
+openindex.message.index_path_does_not_exist=Cannot open index path 
{0}.

No such directory!
   openindex.message.index_opened=Index successfully opened.
   openindex.message.index_opened_ro=Index successfully opened. 
(read-only)


Thanks

Michael



Am 13.07.21 um 22:43 schrieb Michael Wechner:

I analyzed the logs and the class/method

lucene/luke/src/java/org/apache/lucene/luke/models/util/IndexUtils.java#openIndex(String, 


String)

and realized that the problem was not the index itself, but that the
index directory/path did not exist anymore.

I forgot that I renamed the index directory, but Luke displayed in the
dropdown "Index Path" the previously opened directory paths.
So when I selected the one which did not exist anymore and I received
the error message

"Not a valid lucene index directory or corrupted?"

and I wrongly assumed that the problem is because the index is a
vector search index.

So Luke is able to open the vector search index and displays the
correct number of indexed vectors :-)

Sorry for the noise!

Nevertheless it might make sense to enhance the error message, that if
one tries to open a directory which does not exist, then the error
message reads

"No such directory"

Or that the dropdown "Index Path" is checking whether the previously
opened directories still exist.

Thanks

Michael


Am 13.07.21 um 10:47 schrieb Michael Wechner:

thanks again for your feeback!

I will give it a try and get back if I should have more questions :-)

Thanks

Michael

Am 13.07.21 um 09:58 schrieb Tomoko Uchida:
I think beside the query it would be nice if Luke would display 
some
"stats" of the index, for example the various fields beside the 
actual

vector and also how many vectors are inside the index

It would be a good start point, I think.


Can you give me a hint where in the code this check does currently
happen?
(I guess where the error is happening about the corrupted index)
Actually I have few clues about where to start (haven't tried to 
read

indexes that includes vector values with Luke).
The stack traces you might see should include full information to 
fix

or improve it.

Tomoko

2021年7月13日(火) 14:22 Michael Wechner :

Am 13.07.21 um 04:22 schrieb Tomoko Uchida:

There isn't any plans for that, and I'm not sure what is actually
expected of the GUI tool

yes, I understand, the input for the query would have to be an
embedding

Re: Does Luke already support vector search or are there any plans to support vector search?

2021-07-13 Thread Michael Wechner
sure, I understand, but I just wanted to ask whether such a change makes 
sense actually.


I have created a Jira ticket

https://issues.apache.org/jira/browse/LUCENE-10024

and added the patch as attachment. Let me know if you prefer a pull request.

Cheers

Michael

Am 14.07.21 um 03:43 schrieb Tomoko Uchida:

We don't accept patches by email... please open a Jira.


2021年7月14日(水) 5:58 Michael Wechner :

would the following patch make sense?

git diff lucene/luke/src/
diff --git
a/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
b/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
index f3fc635872b..ad13745eec8 100644
--- a/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
+++ b/lucene/luke/src/java/org/apache/lucene/luke/app/IndexHandler.java
@@ -18,6 +18,7 @@
   package org.apache.lucene.luke.app;

   import java.lang.invoke.MethodHandles;
+import java.nio.file.NoSuchFileException;
   import java.util.Objects;
   import org.apache.logging.log4j.Logger;
   import org.apache.lucene.index.IndexReader;
@@ -71,6 +72,10 @@ public final class IndexHandler extends
AbstractHandler {
   IndexReader reader;
   try {
 reader = IndexUtils.openIndex(indexPath, dirImpl);
+} catch (NoSuchFileException e) {
+  log.error("Error opening index", e);
+  throw new LukeException(
+
MessageUtils.getLocalizedMessage("openindex.message.index_path_does_not_exist",
indexPath), e);
   } catch (Exception e) {
 log.error("Error opening index", e);
 throw new LukeException(
diff --git
a/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties
b/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties
index f9c8c45a0f4..30b43cf18b7 100644
---
a/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties
+++
b/lucene/luke/src/resources/org/apache/lucene/luke/app/desktop/messages/messages.properties
@@ -71,6 +71,7 @@ openindex.radio.keep_only_last_commit=Keep only last
commit point
   openindex.radio.keep_all_commits=Keep all commit points
   openindex.message.index_path_not_selected=Please choose index path.
   openindex.message.index_path_invalid=Cannot open index path {0}. Not a
valid lucene index directory or corrupted?
+openindex.message.index_path_does_not_exist=Cannot open index path {0}.
No such directory!
   openindex.message.index_opened=Index successfully opened.
   openindex.message.index_opened_ro=Index successfully opened. (read-only)

Thanks

Michael



Am 13.07.21 um 22:43 schrieb Michael Wechner:

I analyzed the logs and the class/method

lucene/luke/src/java/org/apache/lucene/luke/models/util/IndexUtils.java#openIndex(String,
String)

and realized that the problem was not the index itself, but that the
index directory/path did not exist anymore.

I forgot that I renamed the index directory, but Luke displayed in the
dropdown "Index Path" the previously opened directory paths.
So when I selected the one which did not exist anymore and I received
the error message

"Not a valid lucene index directory or corrupted?"

and I wrongly assumed that the problem is because the index is a
vector search index.

So Luke is able to open the vector search index and displays the
correct number of indexed vectors :-)

Sorry for the noise!

Nevertheless it might make sense to enhance the error message, that if
one tries to open a directory which does not exist, then the error
message reads

"No such directory"

Or that the dropdown "Index Path" is checking whether the previously
opened directories still exist.

Thanks

Michael


Am 13.07.21 um 10:47 schrieb Michael Wechner:

thanks again for your feeback!

I will give it a try and get back if I should have more questions :-)

Thanks

Michael

Am 13.07.21 um 09:58 schrieb Tomoko Uchida:

I think beside the query it would be nice if Luke would display some
"stats" of the index, for example the various fields beside the actual
vector and also how many vectors are inside the index

It would be a good start point, I think.


Can you give me a hint where in the code this check does currently
happen?
(I guess where the error is happening about the corrupted index)

Actually I have few clues about where to start (haven't tried to read
indexes that includes vector values with Luke).
The stack traces you might see should include full information to fix
or improve it.

Tomoko

2021年7月13日(火) 14:22 Michael Wechner :

Am 13.07.21 um 04:22 schrieb Tomoko Uchida:

There isn't any plans for that, and I'm not sure what is actually
expected of the GUI tool

yes, I understand, the input for the query would have to be an
embedding
(vector of for example 768 dimensions).

I currently see two possibilities to do this:

- Import/open the embedding from a file
- Connecting the regular search input with a service generating the
embedding, lik

  1   2   >