Re: Lucene 9.7 release

2023-06-09 Thread Uwe Schindler

Hi,

BTW, there was a slight change in APIJARs caused by this API change: 
https://github.com/openjdk/jdk/commit/5fc9b5787dc4d7f00d2c59288bc8d840fdf5b495 
(this does not affect our code, but it was done 3 weeks ago). I hope 
something like this won't happen. I updated the PR, no code changes 
needed as those methods were not used by Lucene.


I'd like to update the APIJARS again shortly before the feature branch 
is created.


Uwe

Am 09.06.2023 um 23:10 schrieb Uwe Schindler:
Let me merge and backport the java 21 map PR first. It has all new 
source directories and APIJAR files.


For safety I will regenerate the 21 APIJAR with newest jdk build. Fyi, 
to regenerate you need to have an environment variable with jdk21 as 
autoprovisioning doesn't work.


After that we can copy-paste the vector impl to the main/java21 folder 
and add vector classes to it.


Uwe


Am 9. Juni 2023 22:30:09 MESZ schrieb Chris Hegarty 
:


Hi,


On 9 Jun 2023, at 17:19, Uwe Schindler  wrote:

Hi,

if possible I would like to get the Java 21 changes
(MemorySegments and Vector) into the release. I'd like to ask
Chris who has better knowledge how to proceed. If he suggests to
wait maybe a week or 2, I'd suggest to wait that time.

Chris Hegarthy: Do you know if the API of JDK 21 is finalized or
not. From my understanding the final phases have started, so API
changes are unlikely. If there are bug fixes they won't affect
public APIs or the incubator module, right?


Your understanding is correct. I do not expect any API changes at
this point.


The MMapDir changes are already tested all the time, vector API
needs the forward port to 21.


We are also doing some early testing with JDK 21 EA, and it would
be great to get the 21-version of Panama VectorUtils in. I can
help get this done.

Uwe, what has been done so far? If nothing, as that is still the
case tomorrow, I can start on it.

-Chris.


Uwe

Am 09.06.2023 um 18:07 schrieb Adrien Grand:

Hello all,

There is some good stuff that is scheduled for 9.7 already, I
found the following changes in the changelog that look
especially interesting:
 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of the
Panama vector API.
 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I
volunteer to be the release manager. I suggest the following
schedule:
 - Feature freeze on June 16th, one week from now. This is when
the 9.7 branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if
blockers get identified.

-- 
Adrien
-- 
Uwe Schindler

Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Fix for vector math precision

2023-06-09 Thread Jonathan Ellis
Hi all,

I ran into a bug where the cosine of a large vector taken with itself
returned NaN.  (Cosine of equal vectors should always be 1.)  I put
together a PR to do the internal math of the cosine function with double,
before returning the result as a float:
https://github.com/apache/lucene/pull/12281

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: Lucene 9.7 release

2023-06-09 Thread Uwe Schindler
Let me merge and backport the java 21 map PR first. It has all new source 
directories and APIJAR files.

For safety I will regenerate the 21 APIJAR with newest jdk build. Fyi, to 
regenerate you need to have an environment variable with jdk21 as 
autoprovisioning doesn't work.

After that we can copy-paste the vector impl to the main/java21 folder and add 
vector classes to it.

Uwe

Am 9. Juni 2023 22:30:09 MESZ schrieb Chris Hegarty 
:
>Hi,
>
>> On 9 Jun 2023, at 17:19, Uwe Schindler  wrote:
>> 
>> Hi,
>> 
>> if possible I would like to get the Java 21 changes (MemorySegments and 
>> Vector) into the release. I'd like to ask Chris who has better knowledge how 
>> to proceed. If he suggests to wait maybe a week or 2, I'd suggest to wait 
>> that time.
>> 
>> Chris Hegarthy: Do you know if the API of JDK 21 is finalized or not. From 
>> my understanding the final phases have started, so API changes are unlikely. 
>> If there are bug fixes they won't affect public APIs or the incubator 
>> module, right?
>> 
>Your understanding is correct. I do not expect any API changes at this point.
>> The MMapDir changes are already tested all the time, vector API needs the 
>> forward port to 21.
>> 
>We are also doing some early testing with JDK 21 EA, and it would be great to 
>get the 21-version of Panama VectorUtils in. I can help get this done.
>
>Uwe, what has been done so far? If nothing, as that is still the case 
>tomorrow, I can start on it.
>
>-Chris.
>
>> Uwe
>> 
>> Am 09.06.2023 um 18:07 schrieb Adrien Grand:
>>> Hello all,
>>> 
>>> There is some good stuff that is scheduled for 9.7 already, I found the 
>>> following changes in the changelog that look especially interesting:
>>>  - Concurrent query rewrites for vector queries.
>>>  - Speedups to vector indexing/search via integration of the Panama vector 
>>> API.
>>>  - Reduced overhead of soft deletes.
>>>  - Support for update by query.
>>> 
>>> I propose we start the process for a 9.7 release, and I volunteer to be the 
>>> release manager. I suggest the following schedule:
>>>  - Feature freeze on June 16th, one week from now. This is when the 9.7 
>>> branch will be cut.
>>>  - Open a vote on June 21st, which we'll possibly delay if blockers get 
>>> identified.
>>> 
>>> -- 
>>> Adrien
>> -- 
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de 
>> eMail: u...@thetaphi.de 

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Lucene 9.7 release

2023-06-09 Thread Chris Hegarty
Hi,

> On 9 Jun 2023, at 17:19, Uwe Schindler  wrote:
> 
> Hi,
> 
> if possible I would like to get the Java 21 changes (MemorySegments and 
> Vector) into the release. I'd like to ask Chris who has better knowledge how 
> to proceed. If he suggests to wait maybe a week or 2, I'd suggest to wait 
> that time.
> 
> Chris Hegarthy: Do you know if the API of JDK 21 is finalized or not. From my 
> understanding the final phases have started, so API changes are unlikely. If 
> there are bug fixes they won't affect public APIs or the incubator module, 
> right?
> 
Your understanding is correct. I do not expect any API changes at this point.
> The MMapDir changes are already tested all the time, vector API needs the 
> forward port to 21.
> 
We are also doing some early testing with JDK 21 EA, and it would be great to 
get the 21-version of Panama VectorUtils in. I can help get this done.

Uwe, what has been done so far? If nothing, as that is still the case tomorrow, 
I can start on it.

-Chris.

> Uwe
> 
> Am 09.06.2023 um 18:07 schrieb Adrien Grand:
>> Hello all,
>> 
>> There is some good stuff that is scheduled for 9.7 already, I found the 
>> following changes in the changelog that look especially interesting:
>>  - Concurrent query rewrites for vector queries.
>>  - Speedups to vector indexing/search via integration of the Panama vector 
>> API.
>>  - Reduced overhead of soft deletes.
>>  - Support for update by query.
>> 
>> I propose we start the process for a 9.7 release, and I volunteer to be the 
>> release manager. I suggest the following schedule:
>>  - Feature freeze on June 16th, one week from now. This is when the 9.7 
>> branch will be cut.
>>  - Open a vote on June 21st, which we'll possibly delay if blockers get 
>> identified.
>> 
>> -- 
>> Adrien
> -- 
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de 
> eMail: u...@thetaphi.de 


Scorer#getMinScore()

2023-06-09 Thread Marc D'Mello
Hi all,

I was wondering why there is no Scorer#getMinScore() equivalent to
Scorer#getMaxScore() (here
).
I think it could potentially be useful for skipping when you have scoring
functions with a subtraction in it.

As a contrived example, say I wrote a SubtractionAndQuery(Query a, Query b)
that matched a conjunction of a and b but the score was a.score() -
b.score(). When creating a scorer, the best getMaxScore() function I could
create would look like this:

float getMaxScore(int upto) {
return a.getMaxScore(upto);
}

However, this would not give me the tightest upper bound score possible as
I am completely neglecting the "b" term here. Something like this would be
better:

float getMaxScore(int upto) {
return Math.max(a.getMaxScore(upto) - b.getMinScore(upto), 0);
}

So I was wondering if not including this API was by design (the same reason
why Lucene doesn't allow negative scores for queries) or if it was because
the added block level metadata required to store the min term scores would
be too much? I'm sure there's some other issues I could be overlooking as
well.

Any answers would be greatly appreciated!

Thanks,
Marc


Re: Lucene 9.7 release

2023-06-09 Thread Michael Wechner

Thank you very much, Adrien!

Am 09.06.23 um 18:20 schrieb Tomás Fernández Löbbe:

+1
Thanks Adrien

On Fri, Jun 9, 2023 at 9:19 AM Michael McCandless 
 wrote:


+1, thanks Adrien!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jun 9, 2023 at 12:11 PM Patrick Zhai 
wrote:

+1, thank you Adrien!

On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:

Hello all,

There is some good stuff that is scheduled for 9.7
already, I found the following changes in the changelog
that look especially interesting:
 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of
the Panama vector API.
 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I
volunteer to be the release manager. I suggest the
following schedule:
 - Feature freeze on June 16th, one week from now. This is
when the 9.7 branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if
blockers get identified.

-- 
Adrien




Re: Lucene 9.7 release

2023-06-09 Thread Ignacio Vera
+1

On Fri, Jun 9, 2023 at 6:20 PM Tomás Fernández Löbbe 
wrote:

> +1
> Thanks Adrien
>
> On Fri, Jun 9, 2023 at 9:19 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1, thanks Adrien!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jun 9, 2023 at 12:11 PM Patrick Zhai  wrote:
>>
>>> +1, thank you Adrien!
>>>
>>> On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:
>>>
 Hello all,

 There is some good stuff that is scheduled for 9.7 already, I found the
 following changes in the changelog that look especially interesting:
  - Concurrent query rewrites for vector queries.
  - Speedups to vector indexing/search via integration of the Panama
 vector API.
  - Reduced overhead of soft deletes.
  - Support for update by query.

 I propose we start the process for a 9.7 release, and I volunteer to be
 the release manager. I suggest the following schedule:
  - Feature freeze on June 16th, one week from now. This is when the 9.7
 branch will be cut.
  - Open a vote on June 21st, which we'll possibly delay if blockers get
 identified.

 --
 Adrien

>>>


Re: Lucene 9.7 release

2023-06-09 Thread Tomás Fernández Löbbe
+1
Thanks Adrien

On Fri, Jun 9, 2023 at 9:19 AM Michael McCandless 
wrote:

> +1, thanks Adrien!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jun 9, 2023 at 12:11 PM Patrick Zhai  wrote:
>
>> +1, thank you Adrien!
>>
>> On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:
>>
>>> Hello all,
>>>
>>> There is some good stuff that is scheduled for 9.7 already, I found the
>>> following changes in the changelog that look especially interesting:
>>>  - Concurrent query rewrites for vector queries.
>>>  - Speedups to vector indexing/search via integration of the Panama
>>> vector API.
>>>  - Reduced overhead of soft deletes.
>>>  - Support for update by query.
>>>
>>> I propose we start the process for a 9.7 release, and I volunteer to be
>>> the release manager. I suggest the following schedule:
>>>  - Feature freeze on June 16th, one week from now. This is when the 9.7
>>> branch will be cut.
>>>  - Open a vote on June 21st, which we'll possibly delay if blockers get
>>> identified.
>>>
>>> --
>>> Adrien
>>>
>>


Re: Lucene 9.7 release

2023-06-09 Thread Uwe Schindler

Hi,

if possible I would like to get the Java 21 changes (MemorySegments and 
Vector) into the release. I'd like to ask Chris who has better knowledge 
how to proceed. If he suggests to wait maybe a week or 2, I'd suggest to 
wait that time.


Chris Hegarthy: Do you know if the API of JDK 21 is finalized or not. 
From my understanding the final phases have started, so API changes are 
unlikely. If there are bug fixes they won't affect public APIs or the 
incubator module, right?


The MMapDir changes are already tested all the time, vector API needs 
the forward port to 21.


Uwe

Am 09.06.2023 um 18:07 schrieb Adrien Grand:

Hello all,

There is some good stuff that is scheduled for 9.7 already, I found 
the following changes in the changelog that look especially interesting:

 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of the Panama 
vector API.

 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I volunteer to 
be the release manager. I suggest the following schedule:
 - Feature freeze on June 16th, one week from now. This is when the 
9.7 branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if blockers 
get identified.


--
Adrien


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Lucene 9.7 release

2023-06-09 Thread Michael McCandless
+1, thanks Adrien!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jun 9, 2023 at 12:11 PM Patrick Zhai  wrote:

> +1, thank you Adrien!
>
> On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:
>
>> Hello all,
>>
>> There is some good stuff that is scheduled for 9.7 already, I found the
>> following changes in the changelog that look especially interesting:
>>  - Concurrent query rewrites for vector queries.
>>  - Speedups to vector indexing/search via integration of the Panama
>> vector API.
>>  - Reduced overhead of soft deletes.
>>  - Support for update by query.
>>
>> I propose we start the process for a 9.7 release, and I volunteer to be
>> the release manager. I suggest the following schedule:
>>  - Feature freeze on June 16th, one week from now. This is when the 9.7
>> branch will be cut.
>>  - Open a vote on June 21st, which we'll possibly delay if blockers get
>> identified.
>>
>> --
>> Adrien
>>
>


Re: Lucene 9.7 release

2023-06-09 Thread Patrick Zhai
+1, thank you Adrien!

On Fri, Jun 9, 2023, 09:08 Adrien Grand  wrote:

> Hello all,
>
> There is some good stuff that is scheduled for 9.7 already, I found the
> following changes in the changelog that look especially interesting:
>  - Concurrent query rewrites for vector queries.
>  - Speedups to vector indexing/search via integration of the Panama vector
> API.
>  - Reduced overhead of soft deletes.
>  - Support for update by query.
>
> I propose we start the process for a 9.7 release, and I volunteer to be
> the release manager. I suggest the following schedule:
>  - Feature freeze on June 16th, one week from now. This is when the 9.7
> branch will be cut.
>  - Open a vote on June 21st, which we'll possibly delay if blockers get
> identified.
>
> --
> Adrien
>


Lucene 9.7 release

2023-06-09 Thread Adrien Grand
Hello all,

There is some good stuff that is scheduled for 9.7 already, I found the
following changes in the changelog that look especially interesting:
 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of the Panama vector
API.
 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I volunteer to be the
release manager. I suggest the following schedule:
 - Feature freeze on June 16th, one week from now. This is when the 9.7
branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if blockers get
identified.

-- 
Adrien


Re: [lucene] 01/04: Introduced the Word2VecSynonymFilter (#12169)

2023-06-09 Thread Alan Woodward
Hey Alessandro, I just spotted this going into 9.x which introduces some 
breaking changes to the QueryBuilder API (specifically, moving TermAndBoost to 
its own class).  This will make upgrading 9.6 to 9.7 difficult as it means 
anything that extends QueryBuilder will need to change imports and recompile.

Can we at least keep QueryBuilder.TermAndBoost in the same place in 9.x?  I’m 
not sure it needs to move in main either, but we can discuss that later!

> On 30 May 2023, at 16:29, abenede...@apache.org wrote:
> 
> This is an automated email from the ASF dual-hosted git repository.
> 
> abenedetti pushed a commit to branch branch_9x
> in repository https://gitbox.apache.org/repos/asf/lucene.git
> 
> commit 64b48b89b501a89e303f6201f3a25ed0fb901f80
> Author: Daniele Antuzi 
> AuthorDate: Mon Apr 24 13:35:26 2023 +0200
> 
>   Introduced the Word2VecSynonymFilter (#12169)
> 
>   Co-authored-by: Alessandro Benedetti 
> ---
> .../lucene/analysis/tests/TestRandomChains.java|  25 
> lucene/analysis/common/src/java/module-info.java   |   2 +
> .../analysis/synonym/word2vec/Dl4jModelReader.java | 126 
> .../analysis/synonym/word2vec/Word2VecModel.java   |  95 
> .../synonym/word2vec/Word2VecSynonymFilter.java| 108 ++
> .../word2vec/Word2VecSynonymFilterFactory.java | 101 +
> .../synonym/word2vec/Word2VecSynonymProvider.java  | 104 ++
> .../word2vec/Word2VecSynonymProviderFactory.java   |  63 
> .../analysis/synonym/word2vec/package-info.java|  19 +++
> .../org.apache.lucene.analysis.TokenFilterFactory  |   1 +
> .../synonym/word2vec/TestDl4jModelReader.java  |  98 +
> .../word2vec/TestWord2VecSynonymFilter.java| 152 
> .../word2vec/TestWord2VecSynonymFilterFactory.java | 159 +
> .../word2vec/TestWord2VecSynonymProvider.java  | 132 +
> .../word2vec-corrupted-vector-dimension-model.zip  | Bin 0 -> 323 bytes
> .../synonym/word2vec/word2vec-empty-model.zip  | Bin 0 -> 195 bytes
> .../analysis/synonym/word2vec/word2vec-model.zip   | Bin 0 -> 439678 bytes
> .../java/org/apache/lucene/util/QueryBuilder.java  |  14 --
> .../java/org/apache/lucene/util/TermAndBoost.java  |  31 
> .../java/org/apache/lucene/util/TermAndVector.java |  72 ++
> .../apache/lucene/util/hnsw/HnswGraphBuilder.java  |   9 ++
> .../test/org/apache/lucene/index/TestKnnGraph.java |  11 +-
> .../tests/analysis/BaseTokenStreamTestCase.java| 149 ++-
> 23 files changed, 1448 insertions(+), 23 deletions(-)
> 
> diff --git 
> a/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests/TestRandomChains.java
>  
> b/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests/TestRandomChains.java
> index 8c245e7058c..988deaf99e5 100644
> --- 
> a/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests/TestRandomChains.java
> +++ 
> b/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests/TestRandomChains.java
> @@ -89,6 +89,8 @@ import org.apache.lucene.analysis.shingle.ShingleFilter;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
> import org.apache.lucene.analysis.synonym.SynonymMap;
> +import org.apache.lucene.analysis.synonym.word2vec.Word2VecModel;
> +import org.apache.lucene.analysis.synonym.word2vec.Word2VecSynonymProvider;
> import org.apache.lucene.store.ByteBuffersDirectory;
> import org.apache.lucene.tests.analysis.BaseTokenStreamTestCase;
> import org.apache.lucene.tests.analysis.MockTokenFilter;
> @@ -99,8 +101,10 @@ import org.apache.lucene.tests.util.TestUtil;
> import org.apache.lucene.tests.util.automaton.AutomatonTestUtil;
> import org.apache.lucene.util.AttributeFactory;
> import org.apache.lucene.util.AttributeSource;
> +import org.apache.lucene.util.BytesRef;
> import org.apache.lucene.util.CharsRef;
> import org.apache.lucene.util.IgnoreRandomChains;
> +import org.apache.lucene.util.TermAndVector;
> import org.apache.lucene.util.Version;
> import org.apache.lucene.util.automaton.Automaton;
> import org.apache.lucene.util.automaton.CharacterRunAutomaton;
> @@ -415,6 +419,27 @@ public class TestRandomChains extends 
> BaseTokenStreamTestCase {
>  }
>}
>  });
> +  put(
> +  Word2VecSynonymProvider.class,
> +  random -> {
> +final int numEntries = atLeast(10);
> +final int vectorDimension = random.nextInt(99) + 1;
> +Word2VecModel model = new Word2VecModel(numEntries, 
> vectorDimension);
> +for (int j = 0; j < numEntries; j++) {
> +  String s = TestUtil.randomSimpleString(random, 10, 20);
> +  float[] vec = new float[vectorDimension];
> +  for (int i = 0; i < vectorDimension; i++) {
> +