Re: [VOTE] Release Lucene 9.11.0 RC1

2024-06-05 Thread Adrien Grand
+1 SUCCESS! [1:09:30.262027]

On Wed, Jun 5, 2024 at 4:15 PM Tomás Fernández Löbbe 
wrote:

> +1
>
> SUCCESS! [1:12:30.029470]
>
> On Wed, Jun 5, 2024 at 9:22 AM Bruno Roustant 
> wrote:
>
>> +1
>>
>> SUCCESS! [0:41:14.593265]
>>
>> Bruno
>>
>>>

-- 
Adrien


Re: [EXTERNAL] Re: Question about extending Similarity

2024-05-22 Thread Adrien Grand
Your similarity looks ok.

> My hunch is that I would need to create a specialized type of query, but
it’s not clear to me what it needs to be.

You are right, this requires a query. A similarity alone cannot do this.
You could create a two-phase iterator that reads the norm field and returns
false in matches() when the score doesn't match the length of the field.

In case you want a more general form of this, note that you could look into
Lucene's monitor module. Because what you are doing here consists of
indexing conjunctive queries and trying to match sets of terms against them.

On Wed, May 22, 2024 at 8:50 PM Georgios Georgiadis
 wrote:

> Thanks, I got it by doing something like this:
>
>
>
> public class PartialSimilarity : DefaultSimilarity
>
> {
>
> public override float Idf(long docFreq, long docCount)
>
> {
>
> return 1.0f;
>
> }
>
>
>
> public override float Tf(float freq)
>
> {
>
> return 1.0f;
>
> }
>
>
>
> public override float LengthNorm(FieldInvertState state)
>
> {
>
> int numTerms;
>
> if (m_discountOverlaps)
>
> {
>
> numTerms = state.Length - state.NumOverlap;
>
> }
>
> else
>
> {
>
> numTerms = state.Length;
>
> }
>
> return (float)numTerms;
>
> }
>
>
>
> public override long ComputeNorm(FieldInvertState state)
>
> {
>
> float normValue = LengthNorm(state);
>
> return (long)normValue;
>
> }
>
>
>
> public override float QueryNorm(float sumOfSquaredWeights)
>
> {
>
> return 1.0f;
>
> }
>
>
>
> public override float DecodeNormValue(long norm)
>
> {
>
> return 1.0f / (float)norm;
>
> }
>
>
>
> public override float Coord(int overlap, int maxOverlap)
>
> {
>
> return 1.0f;
>
> }
>
> }
>
>
>
> A slightly different variation of this is the following:
>
> If it’s a partial match, how can I return a score of 0? i.e. if query is
> “A B C” and the field contains “B D”, then, I want to say that the score is
> 0. This requires knowledge of the sum of scores of all terms, which I am
> not sure how I can access.
>
> My hunch is that I would need to create a specialized type of query, but
> it’s not clear to me what it needs to be. Any suggestions?
>
> Best,
>
> Georgios
>
>
>
> *From:* Adrien Grand 
> *Sent:* Wednesday, May 22, 2024 12:20 AM
> *To:* dev@lucene.apache.org
> *Subject:* [EXTERNAL] Re: Question about extending Similarity
>
>
>
> You don't often get email from jpou...@gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
> Hi Georgios,
>
>
>
> This is possible. You need to create a similarity that stores the number
> of terms as a norm, and then produce scores that are equal to freq/norm at
> search time.
>
>
>
> On Tue, May 21, 2024 at 8:02 PM Georgios Georgiadis <
> georgios.georgia...@microsoft.com.invalid> wrote:
>
> Hi,
>
> I would like to extend Similarity to have the following functionality: if
> the query is “A B C” and a field contains “B C” then I would like to call
> that a “match” and return a score of 1 (2/2). If the query is “A B C” and
> the field contains “B D” then I would like to call that a partial match and
> give a score of 0.5 (1/2). Is this possible?
>
> Best,
>
> Georgios
>
>
>
>
> --
>
> Adrien
>


-- 
Adrien


Re: Maximum score estimation

2024-05-22 Thread Adrien Grand
Hi Mikhail,

You is correct, it should give an ok upper bound of scores on term queries
and combinations of term queries via BooleanQuery.

On Wed, May 22, 2024 at 6:57 PM Mikhail Khludnev  wrote:

> I'm trying to understand Impacts. Need help.
> https://github.com/apache/lucene/issues/5270#issuecomment-1223383919
> Does it mean
> advanceShallow(0)
> getMaxScore(maxDoc-1)
> gives a  good max score estem at least for a term query?
>
> On Fri, May 10, 2024 at 11:21 PM Mikhail Khludnev  wrote:
>
>> Hello Alessandro.
>> Glad to hear!
>> There's not much update from the previously published link: just a tiny
>> test. Guessing max tf doesn't seem really reliable.
>> However, I've got another idea:
>> Can't Impacts give us an exact max score like
>> https://lucene.apache.org/core/9_9_1/core/org/apache/lucene/search/Scorer.html#getMaxScore(int)?
>>
>> I don't know if it's possible and how to do it.
>>
>> On Thu, May 9, 2024 at 6:11 PM Alessandro Benedetti 
>> wrote:
>>
>>> Hi Mikhail,
>>> I was thinking again about this regarding Hybrid Search in Solr and the
>>> current
>>> https://solr.apache.org/guide/solr/latest/query-guide/function-queries.html#scale-function
>>> .
>>> Was there any progress on this? Any traction?
>>> Sooner or later I hope to get some funds to work on this, I keep you
>>> updated!
>>> I agree this would be useful in Learning To Rank and Hybrid Search in
>>> general.
>>> The current original score feature is unlikely to be useful if not
>>> normalised per an estimated maximum score.
>>>
>>> Cheers
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io 
>>> LinkedIn  | Twitter
>>>  | Youtube
>>>  | Github
>>> 
>>>
>>>
>>> On Mon, 13 Feb 2023 at 12:47, Mikhail Khludnev  wrote:
>>>
 Hello.
 Just FYI. I scratched a little prototype
 https://github.com/mkhludnev/likely/blob/main/src/test/java/org/apache/lucene/contrb/highly/TestLikelyReader.java#L53
 To estimate maximum possible score for the query against an index:
  - it creates a virtual index (LikelyReader), which
  - contains all terms from the original index with the same docCount
  - matching all of these terms in the first doc (docnum=0) with the
 maximum termFreq (which estimating is a separate question).
 So, if we search over this LikelyReader we get a score estimate, which
 can hardly be exceeded by the same query over the original index.
 I suppose this might be useful for LTR as a better alternative to the
 query score feature.

 On Tue, Dec 6, 2022 at 10:02 AM Mikhail Khludnev 
 wrote:

> Hello dev!
> Users are interested in the meaning of absolute value of the score,
> but we always reply that it's just relative value. Maximum score of 
> matched
> docs is not an answer.
> Ultimately we need to measure how much sense a query has in the index.
> e.g. [jet OR propulsion OR spider] query should be measured like
> nonsense, because the best matching docs have much lower scores than
> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
> spider].
> Could it be a method that returns the maximum possible score if all
> query terms would match. Something like stubbing postings on virtual
> all_matching doc with average stats like tf and field length and kicks
> scorers in? It reminds me something about probabilistic retrieval, but not
> much. Is there anything like this already?
>
> --
> Sincerely yours
> Mikhail Khludnev
>


 --
 Sincerely yours
 Mikhail Khludnev

>>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien


Re: Question about extending Similarity

2024-05-22 Thread Adrien Grand
Hi Georgios,

This is possible. You need to create a similarity that stores the number of
terms as a norm, and then produce scores that are equal to freq/norm at
search time.

On Tue, May 21, 2024 at 8:02 PM Georgios Georgiadis
 wrote:

> Hi,
>
> I would like to extend Similarity to have the following functionality: if
> the query is “A B C” and a field contains “B C” then I would like to call
> that a “match” and return a score of 1 (2/2). If the query is “A B C” and
> the field contains “B D” then I would like to call that a partial match and
> give a score of 0.5 (1/2). Is this possible?
>
> Best,
>
> Georgios
>


-- 
Adrien


Re: Lucene 9.11

2024-05-14 Thread Adrien Grand
+1 the 9.11 changelog looks great!

On Tue, May 14, 2024 at 4:50 PM Benjamin Trent 
wrote:

> Hey y'all,
>
> Looking at changes for 9.11, we are building a significant list. I propose
> we do a release in the next couple of weeks.
>
> While this email is a little early (I am about to go on vacation for a
> bit), I volunteer myself as release manager.
>
> Unless there are objections, I plan on kicking off the release process May
> 28th.
>
> Thanks!
>
> Ben
>


-- 
Adrien


Re: Any recommended issues to work on for a newcomer?

2024-05-13 Thread Adrien Grand
> Maybe Adrien Grand and others might also have some feedback :-)

I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int
topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`.
Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to
figure out which hits map to the same document.

> Back in the day, I was reasoning on this and I didn't think Lucene was
the right place for an interleaving algorithm, given that Reciprocal Rank
Fusion is affected by distribution and it's not supposed to work per node.

To me this is like `TopDocs#merge`. There are changes needed on the
application side to hook this call into the logic that combines hits that
come from multiple shards (multiple queries in the case of RRF), but Lucene
can still provide the merging logic.

On Mon, May 13, 2024 at 1:41 PM Michael Wechner 
wrote:

> Thanks for your feedback Alessandro!
>
> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but
> would like to combine different result sets using RRF, therefore think that
> Lucene itself could be a good place actually.
>
> Looking forward to your additional elaboration!
>
> Thanks
>
> Michael
>
>
>
>
> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti  >:
>
> This is not strictly related to Lucene, but I'll give a talk at Berlin
> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
> I'll resume my work on the contribution next week and have more to share
> later.
>
> Back in the day, I was reasoning on this and I didn't think Lucene was the
> right place for an interleaving algorithm, given that Reciprocal Rank
> Fusion is affected by distribution and it's not supposed to work per node.
> I think I evaluated the possibility of doing it as a Lucene query or a
> Lucene component but then ended up with a different approach.
> I'll elaborate more when I go back to the task!
>
> Cheers
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Sat, 11 May 2024 at 09:10, Michael Wechner 
> wrote:
>
>> sure, no problem!
>>
>> Maybe Adrien Grand and others might also have some feedback :-)
>>
>> Thanks
>>
>> Michael
>>
>> Am 10.05.24 um 23:03 schrieb Chang Hank:
>>
>> Thank you for these useful resources, please allow me to spend some time
>> look into it.
>> I’ll let you know asap!!
>>
>> Thanks
>>
>> Hank
>>
>> On May 10, 2024, at 12:34 PM, Michael Wechner 
>>  wrote:
>>
>> also we might want to consider how this relates to
>>
>>
>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
>>
>> In vector search reranking has become quite popular, e.g.
>>
>> https://docs.cohere.com/docs/reranking
>>
>> IIUC LangChain (python) for example adds the reranker as an argument to
>> the searcher/retriever
>>
>>
>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
>>
>> So maybe the following might make sense as well
>>
>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new
>> CohereReranker());
>>
>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword,
>> topDocsVector);
>>
>> WDYT?
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 10.05.24 um 21:08 schrieb Michael Wechner:
>>
>> great, yes, let's get started :-)
>>
>> What about the following pseudo code, assuming that there might be
>> alternative ranking algorithms to RRF
>>
>> StoredFieldsKeyword storedFieldsKeyword =
>> indexReaderKeyword.storedFields();
>> StoredFieldsVector storedFieldsVector = indexReaderKeyword.storedFields();
>>
>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);
>>
>> Ranker ranker = new RRFRanker();
>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, topDocsVector);
>>
>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>> Document docK = storedFieldsKeyword.document(sc

Re: Format metadata versioning vs. new named Formats

2024-04-12 Thread Adrien Grand
These are indeed two ways to change the on-disk layout of our file formats.

In general, I try to follow the following rules:
 - If the format is not bw-compatible (e.g. formats in lucene/codecs), do
the change in-place and bump both VERSION_START and VERSION_CURRENT to make
sure users get a proper error when opening old indexes with a new Lucene
version.
 - If the format is bw compatible:
   - If the change is substantial: create a new format.
   - For smaller changes, incrementing VERSION_CURRENT is an option, but we
should make sure we retain test coverage for the previous version.
See Lucene90RWPostingsFormat for an example of this, it passes an old
version to Lucene90BlockTreeTermsWriter. And we also
have TestLucene90PostingsFormat that tests this format.

On Fri, Apr 12, 2024 at 2:00 PM Benjamin Trent 
wrote:

> Hey y'all,
>
> I am confused about when we should supply a new format name (e.g.
> Lucene911... vs. Lucene99) versus using a new metadata header version
> (incrementing VERSION_CURRENT).
>
> Are there general rules to follow?
>
> At first glance, using a new Lucene format name prefix is functionally the
> same as adjusting the metadata header version. Older versions won't be able
> to read it. Newer versions will be able to read it and will be able to read
> older formats (both named and via metadata versioning).
>
> Thanks!
>
> Ben
>
>
>

-- 
Adrien


Re: Lucene 10

2024-03-20 Thread Adrien Grand
Thanks Mike and Dawid for the kind words, and thanks Patrick, Luca and Egor
for your interest in decoupling index geometry from search concurrency,
this would be a great release highlight if we can get it into Lucene 10!

I haven't seen pushback on the proposed schedule so I plan on proceeding
with this timeline in mind.

If you have changes that you would like to include in Lucene 10.0, please
add the 10.0 milestone 
to them. It's ok to be a bit ambitious at this stage and
optimistically mark some changes as scheduled for 10.0, we'll have
opportunities for removing items from this list when the date comes closer
and some issues are not getting proper traction. I'll take care of that.

On Mon, Mar 18, 2024 at 11:39 AM Dawid Weiss  wrote:

> [...] but Adrien I don't honestly believe anyone who is
>> paying attention thinks that is what you have been doing!
>
>
> +1. I wish I were procrastinating as productively!
>
> D.
>


-- 
Adrien


Lucene 10

2024-03-13 Thread Adrien Grand
Hello everyone!

It's been ~2.5 years since we released Lucene 9.0 (December 2021) and I'd
like us to start working towards Lucene 10.0. I'm volunteering for being
the release manager and propose the following timeline:
 - ~September 15th: main gets bumped to 11.x, branch_10x gets created
 - ~September 22nd: Do a last 9.x minor release.
 - ~October 1st: Release 10.0.

This may sound like a long notice period. My motivation is that there are a
few changes I have on my mind that are likely worthy of a major release,
and I plan on taking advantage of a date being set to stop procrastinating
and finally start moving these enhancements forward. These are not
blockers, only my wish list for Lucene 10.0, if they are not ready in time
we can have discussions about letting them slip until the next major.
 - Greater I/O concurrency .
Can Lucene better utilize modern disks that are plenty concurrent?
 - Decouple search concurrency from index geometry
. Can Lucene better utilize
modern CPUs that are plenty concurrent?
 - "Sparse indexing " /
"zone indexing" for sorted indexes. This is one of the most efficient
techniques that OLAP databases take advantage of to make search fast. Let's
bring it to Lucene.

This list isn't meant to be an exhaustive list of release highlights for
Lucene 10, feel free to add your own. There are also a number of cleanups
we may want to consider. I wanted to share this list for visibility though
in case you have thoughts on these enhancements and/or would like to help.

-- 
Adrien


Re: [Vote] Bump the Lucene main branch to Java 21

2024-02-23 Thread Adrien Grand
+1

On Fri, Feb 23, 2024 at 12:54 PM Uwe Schindler  wrote:
>
> Here is my +1
>
> Uwe
>
> Am 23.02.2024 um 12:24 schrieb Chris Hegarty:
> > Hi,
> >
> > Since the discussion on bumping the Lucene main branch to Java 21 is 
> > winding down, let's hold a vote on this important change.
> >
> > Once bumped, the next major release of Lucene (whenever that will be) will 
> > require a version of Java greater than or equal to Java 21.
> >
> > The vote will be open for at least 72 hours (and allow some additional time 
> > for the weekend) i.e. until 2024-02-28 12:00 UTC.
> >
> > [ ] +1  approve
> > [ ] +0  no opinion
> > [ ] -1  disapprove (and reason why)
> >
> > Here is my +1
> >
> > -Chris.
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: (lucene) branch main updated: Fix bw index generation logic.

2024-02-20 Thread Adrien Grand
I had to fix a couple things for addBackcompatIndexes.py to work
properly. I pushed directly because it would have been a bit
cumbersome to run this script without pushing these changes first, but
I'd still appreciate a review if anyone is up for it.

On Tue, Feb 20, 2024 at 10:14 PM  wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> jpountz pushed a commit to branch main
> in repository https://gitbox.apache.org/repos/asf/lucene.git
>
>
> The following commit(s) were added to refs/heads/main by this push:
>  new 13d561af1d6 Fix bw index generation logic.
> 13d561af1d6 is described below
>
> commit 13d561af1d624f35f8a27a05490062ac2472e786
> Author: Adrien Grand 
> AuthorDate: Tue Feb 20 22:10:01 2024 +0100
>
> Fix bw index generation logic.
> ---
>  dev-tools/scripts/addBackcompatIndexes.py  | 13 +++-
>  .../BackwardsCompatibilityTestBase.java| 23 
> +++---
>  .../backward_index/TestGenerateBwcIndices.java |  2 ++
>  3 files changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/dev-tools/scripts/addBackcompatIndexes.py 
> b/dev-tools/scripts/addBackcompatIndexes.py
> index bbaf0b40630..7faacb8b8e3 100755
> --- a/dev-tools/scripts/addBackcompatIndexes.py
> +++ b/dev-tools/scripts/addBackcompatIndexes.py
> @@ -45,16 +45,13 @@ def create_and_add_index(source, indextype, 
> index_version, current_version, temp
>'emptyIndex': 'empty'
>  }[indextype]
>if indextype in ('cfs', 'nocfs'):
> -dirname = 'index.%s' % indextype
>  filename = '%s.%s-%s.zip' % (prefix, index_version, indextype)
>else:
> -dirname = indextype
>  filename = '%s.%s.zip' % (prefix, index_version)
>
>print('  creating %s...' % filename, end='', flush=True)
>module = 'backward-codecs'
>index_dir = os.path.join('lucene', module, 
> 'src/test/org/apache/lucene/backward_index')
> -  test_file = os.path.join(index_dir, filename)
>if os.path.exists(os.path.join(index_dir, filename)):
>  print('uptodate')
>  return
> @@ -76,24 +73,20 @@ def create_and_add_index(source, indextype, 
> index_version, current_version, temp
>  '-Dtests.codec=default'
>])
>base_dir = os.getcwd()
> -  bc_index_dir = os.path.join(temp_dir, dirname)
> -  bc_index_file = os.path.join(bc_index_dir, filename)
> +  bc_index_file = os.path.join(temp_dir, filename)
>
>if os.path.exists(bc_index_file):
>  print('alreadyexists')
>else:
> -if os.path.exists(bc_index_dir):
> -  shutil.rmtree(bc_index_dir)
>  os.chdir(source)
>  scriptutil.run('./gradlew %s' % gradle_args)
> -os.chdir(bc_index_dir)
> -scriptutil.run('zip %s *' % filename)
> +if not os.path.exists(bc_index_file):
> +  raise Exception("Expected file can't be found: %s" %bc_index_file)
>  print('done')
>
>print('  adding %s...' % filename, end='', flush=True)
>scriptutil.run('cp %s %s' % (bc_index_file, os.path.join(base_dir, 
> index_dir)))
>os.chdir(base_dir)
> -  scriptutil.run('rm -rf %s' % bc_index_dir)
>print('done')
>
>  def update_backcompat_tests(index_version, current_version):
> diff --git 
> a/lucene/backward-codecs/src/test/org/apache/lucene/backward_index/BackwardsCompatibilityTestBase.java
>  
> b/lucene/backward-codecs/src/test/org/apache/lucene/backward_index/BackwardsCompatibilityTestBase.java
> index 8df28d40dbc..b131bb9497b 100644
> --- 
> a/lucene/backward-codecs/src/test/org/apache/lucene/backward_index/BackwardsCompatibilityTestBase.java
> +++ 
> b/lucene/backward-codecs/src/test/org/apache/lucene/backward_index/BackwardsCompatibilityTestBase.java
> @@ -17,6 +17,7 @@
>  package org.apache.lucene.backward_index;
>
>  import com.carrotsearch.randomizedtesting.annotations.Name;
> +import java.io.FileOutputStream;
>  import java.io.IOException;
>  import java.io.InputStream;
>  import java.io.LineNumberReader;
> @@ -38,11 +39,17 @@ import java.util.function.Predicate;
>  import java.util.regex.Matcher;
>  import java.util.regex.Pattern;
>  import java.util.stream.Collectors;
> +import java.util.zip.ZipEntry;
> +import java.util.zip.ZipOutputStream;
>  import org.apache.lucene.codecs.Codec;
>  import org.apache.lucene.index.DirectoryReader;
>  import org.apache.lucene.index.LeafReaderContext;
>  import org.apache.lucene.index.SegmentReader;
>  import org.apache.lucene.store.Directory;
> +import org.apache.lucene.store.FSDirectory;
> +import org.apache.lucene.store.IOContext;
> +import org.apache.lucene.store.IndexInput;
> +import org.apache.lucene.store.OutputStreamDataOutput;
>  import org.apache.lucene.tests.util.LuceneTestCase;
&g

[ANNOUNCE] Apache Lucene 9.10.0 released

2024-02-20 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.10.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search on high-dimensionality vectors, spell correction or
query suggestions.

This release contains numerous features, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.10 Release Highlights

New Features

 * Support for similarity-based vector searches, ie. finding all nearest
neighbors whose similarity is greater than a configured threshold from a
query vector. See [Byte|Float]VectorSimilarityQuery.

 * Index sorting is now compatible with block joins. See
IndexWriterConfig#setParentField.

 * MMapDirectory now takes advantage of the now finalized JDK foreign
memory API internally when running on Java 22 (or later). This was only
supported with Java 19 to 21 until now.

 * SIMD vectorization now takes advantage of JDK vector incubator on Java
22. This was only supported with Java 20 or 21 until now.

Optimizations

 * Tail postings are now encoded using group-varint. This yielded speedups
on queries that match lots of terms that have short postings lists in
Lucene's nightly benchmarks.

 * Range queries on points now exit earlier when evaluating a segment that
has no matches. This will improve performance when intersected with other
queries that have a high up-front cost such as multi-term queries.

 * BooleanQueries that mix SHOULD and FILTER clauses now propagate minimum
competitive scores to the SHOULD clauses, yielding significant speedups for
top-k queries sorted by descending score.

 * IndexSearcher#count has been optimized on pure disjunctions of two term
queries.

... plus a multitude of helpful bug fixes!

Further details of changes are available in the change log available at:
http://lucene.apache.org/core/9_10_0/changes/Changes.html.

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

-- 
Adrien


Welcome Zhang Chao as Lucene committer

2024-02-20 Thread Adrien Grand
I'm pleased to announce that Zhang Chao has accepted the PMC's
invitation to become a committer.

Chao, the tradition is that new committers introduce themselves with a
brief bio.

Congratulations and welcome!

-- 
Adrien


Re: Announcing githubsearch!

2024-02-20 Thread Adrien Grand
Very cool, thank you Mike!

On Mon, Feb 19, 2024 at 5:40 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Team,
>
> ~1.5 years ago (August 2022) we migrated our Lucene issue tracking from
> Jira to GitHub. Thank you Tomoko for all the hard work doing such a
> complex, multi-phased, high-fidelity migration!
>
> I finally finished also migrating jirasearch to GitHub:
> githubsearch.mikemccandless.com. It was tricky because GitHub issues/PRs
> are fundamentally more complex than Jira's data model, and the GitHub REST
> API is also quite rich / heavily normalized. All of the source code for
> githubsearch lives here
> .
> The UI remains its barebones self ;)
>
> Githubsearch
> 
> is dog food for us: it showcases Lucene (currently 9.8.0), and many of its
> fun features like infix autosuggest, block join queries (each comment is a
> sub-document on the issue/PR), DrillSideways faceting, near-real-time
> indexing/searching, synonyms (try “oome
> ”),
> expressions, non-relevance and blended-relevance sort, etc.  (This old
> blog post
> 
>  goes
> into detail.)  Plus, it’s meta-fun to use Lucene to search its own issues,
> to help us be more productive in improving Lucene!  Nicely recursive.
>
> In addition to good ol’ searching by text, githubsearch
>  has some new/fun features:
>
>- Drill down to just PRs or issues
>- Filter by “review requested” for a given user: poor Adrien has 8
>(open) now
>
> 
>(sorry)! Or see your mentions (Robert is mentioned in 27 open
>issues/PRs
>
> ).
>Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs
>
> ).
>Or issues and PRs where a user has had any involvement at all (Dawid
>has interacted on 197 issues/PRs
>
> 
>).
>- Find still-open PRs that were created by a New Contributor
>
> 
>(an author who has no changes merged into our repository) or
>Contributor
>
> 
>(non-committer who has had some changes merged into our repository) or
>Member
>
> 
>- Here are the uber-stale (last touched more than a month ago) open
>PRs by outside contributors
>
> .
>We should ideally keep this at 0, but it’s 83 now!
>- “Link to this search” to get a short-er, more permanent URL (it is
>NOT a URL shortener, though!)
>- Save named searches you frequently run (they just save to local
>cookie state on that one browser)
>
> I’m sure there are exciting bugs, feedback/patches welcome!  If you see
> problems, please reply to this email or file an issue here
> .
>
> Note that jirasearch 
> remains running, to search Solr, Tika and Infra issues.
>
> Happy Searching,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


-- 
Adrien


[RESULT][VOTE] Release Lucene 9.10.0 RC1

2024-02-20 Thread Adrien Grand
It's been >72h since the vote was initiated and the result is:

+1  10  (8 binding)
 0  0
-1  0

This vote has PASSED.

On Mon, Feb 19, 2024 at 12:33 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1
>
> SUCCESS! [0:19:57.370204]
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Feb 19, 2024 at 6:26 AM Chris Hegarty
>  wrote:
>
>>
>> +1   SUCCESS! [1:14:49.683559]
>>
>> -Chris.
>>
>> > On 15 Feb 2024, at 21:08, Uwe Schindler  wrote:
>> >
>> > Hi,
>> > I used Stefan Vodita's Hack to make the Smoketester run on a large list
>> of JDKs: https://github.com/apache/lucene/pull/13108
>> > See the console of running Java 11, Java 17, Java 19, Java 20, Java 21.
>> Due to limitations of Gradle I wasn't able to do the smoker checks on Java
>> 22 release candidate, but as there are no changes to 9.x branch I assume
>> that everything also works in Java 22. If anybody else has time to run a
>> test project with Java 22 using mmap and vectors it would be great!
>> > Log file:
>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester-v2/3/console
>> > Result was:
>> > SUCCESS! [2:42:55.968473]
>> >
>> > Here is my +1 (binding).
>> > Uwe
>> >
>> > Am 15.02.2024 um 12:50 schrieb Uwe Schindler:
>> >> Hi,
>> >> I ran the default smoke tester with Java 11 and Java 17 on Policeman
>> Jenkins; all looks fine:
>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/32/console
>> >> SUCCESS! [1:04:45.740708]
>> >> I only have one problem. Now that Java 21 LTS is out and more an more
>> people use it, it would be good to also run the smoke tester with Java 21.
>> I tried that locally by just passing the home dir of java 21 instead of
>> Java 17, but that failed due to some check in smoker.
>> >> I will work this evening on patching Smoke tester to also allow it to
>> pass Java 21. Maybe the best would be to pass multiple Java versions as
>> comma spearated list, just the default one must be Java 11 (the baseline).
>> This would allo me to spin Policeman Jenkins with Java 11, Java 17, Java
>> 19, Java 20, Java 21 and Java 22-rc1. Takes a while but would make sure all
>> works in the officially MR-JAR supported relaeses + LTS.
>> >> What do you think.
>> >> I will give my +1 later when I checked the options and also looked
>> into the downloaded artifacts.
>> >> Uwe
>> >> Am 14.02.2024 um 20:28 schrieb Adrien Grand:
>> >>> Please vote for release candidate 1 for Lucene 9.10.0
>> >>>
>> >>> The artifacts can be downloaded from:
>> >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.10.0-RC1-rev-695c0ac84508438302cd346a812cfa2fdc5a10df
>> >>>
>> >>> You can run the smoke tester directly with this command:
>> >>>
>> >>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>> >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.10.0-RC1-rev-695c0ac84508438302cd346a812cfa2fdc5a10df
>> >>>
>> >>> The vote will be open for at least 72 hours i.e. until 2024-02-17
>> 20:00 UTC.
>> >>>
>> >>> [ ] +1  approve
>> >>> [ ] +0  no opinion
>> >>> [ ] -1  disapprove (and reason why)
>> >>>
>> >>> Here is my +1
>> >>>
>> >>> --
>> >>> Adrien
>> >> --
>> >> Uwe Schindler
>> >> Achterdiek 19, D-28357 Bremen
>> >> https://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> > --
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > https://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien


Re: [JENKINS] Lucene » Lucene-NightlyTests-9.x - Build # 825 - Still Unstable!

2024-02-15 Thread Adrien Grand
I removed 8.12 from the versions.txt file since it hasn't been released.

On Thu, Feb 15, 2024 at 7:38 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/825/
>
> 6 tests failed.
> FAILED:
> org.apache.lucene.backward_index.TestBinaryBackwardsCompatibility.testReadNMinusTwoCommit
> {Lucene-Version:8.12.0; Pattern: unsupported.%1$s-nocfs.zip}
>
> Error Message:
> java.lang.AssertionError: Index name 8.12.0 not found:
> unsupported.8.12.0-nocfs.zip
>
> Stack Trace:
> java.lang.AssertionError: Index name 8.12.0 not found:
> unsupported.8.12.0-nocfs.zip
> at
> __randomizedtesting.SeedInfo.seed([ED00AE45564CD44F:1E4C9941B6E4BF8F]:0)
> at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
> at junit@4.13.1/org.junit.Assert.assertTrue(Assert.java:42)
> at junit@4.13.1/org.junit.Assert.assertNotNull(Assert.java:713)
> at
> org.apache.lucene.backward_index.BackwardsCompatibilityTestBase.setUp(BackwardsCompatibilityTestBase.java:137)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at junit@4.13.1
> /org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at org.apache.lucene.test_framework@9.11.0-SNAPSHOT
> /org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1
> /com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1
> 

[VOTE] Release Lucene 9.10.0 RC1

2024-02-14 Thread Adrien Grand
Please vote for release candidate 1 for Lucene 9.10.0

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.10.0-RC1-rev-695c0ac84508438302cd346a812cfa2fdc5a10df

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.10.0-RC1-rev-695c0ac84508438302cd346a812cfa2fdc5a10df

The vote will be open for at least 72 hours i.e. until 2024-02-17 20:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1

-- 
Adrien


Re: Lucene 9.10

2024-02-13 Thread Adrien Grand
I started a draft for release notes, feel free to modify or add more
release highlights.

https://cwiki.apache.org/confluence/display/LUCENE/Release+notes+9.10

On Thu, Feb 8, 2024 at 11:49 AM Uwe Schindler  wrote:

> Hi Adrien,
>
> as discussed in the PR, I will merge the MMapDir and Panama Vector for JDK
> 22 later today or at latest tomorrow. I need to first download the RC
> version of JDK that is going to be released today and do the usual API
> consistency checks (checking no late API changes appeared).
>
> So next Wednesday is perfectly fine.
>
> Uwe
> Am 07.02.2024 um 15:57 schrieb Adrien Grand:
>
> Hello all,
>
> It's been 2 months since we released 9.9 and we accumulated a good number
> of changes, so I'd like to propose that we release 9.10.0.
>
> If there are no objections, I volunteer to be the release manager and
> suggest cutting the branch next Monday (February 12th) and starting the
> release process on Wednesday, one week from now (February 14th).
>
> +Uwe Schindler  I remember that there are JDK22-related
> changes that you'd like to get into 9.10, feel free to let me know if this
> timeline doesn't work for you.
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


Re: (lucene) branch branch_9_10 created (now 695c0ac8450)

2024-02-12 Thread Adrien Grand
You're so quick Uwe, thank you!

On Mon, Feb 12, 2024 at 2:49 PM Uwe Schindler  wrote:

> Hi Adrien,
>
> Thanks for creating the branch. I activated Policeman Jenkins tests for it.
>
> Uwe
>
> Am 12.02.2024 um 14:30 schrieb jpou...@apache.org:
> > This is an automated email from the ASF dual-hosted git repository.
> >
> > jpountz pushed a change to branch branch_9_10
> > in repository https://gitbox.apache.org/repos/asf/lucene.git
> >
> >
> >at 695c0ac8450 Add the missing Version field for 8.11.3. (#13093)
> >
> > No new revisions were added by this update.
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien


New branch and feature freeze for Lucene 9.10.0

2024-02-12 Thread Adrien Grand
NOTICE:

Branch branch_9_10 has been cut and versions updated to 9.11 on stable
branch.

Please observe the normal rules:

* No new features may be committed to the branch.
* Documentation patches, build patches and serious bug fixes may be
  committed to the branch. However, you should submit all patches you
  want to commit as pull requests first to give others the chance to review
  and possibly vote against them. Keep in mind that it is our
  main intention to keep the branch as stable as possible.
* All patches that are intended for the branch should first be committed
  to the unstable branch, merged into the stable branch, and then into
  the current release branch.
* Normal unstable and stable branch development may continue as usual.
  However, if you plan to commit a big change to the unstable branch
  while the branch feature freeze is in effect, think twice: can't the
  addition wait a couple more days? Merges of bug fixes into the branch
  may become more difficult.
* Only Github issues with Milestone 9.10
  and priority "Blocker" will delay a release candidate build.

We have one such blocker currently:
https://github.com/apache/lucene/issues/13094.

-- 
Adrien


Lucene 9.10

2024-02-07 Thread Adrien Grand
Hello all,

It's been 2 months since we released 9.9 and we accumulated a good number
of changes, so I'd like to propose that we release 9.10.0.

If there are no objections, I volunteer to be the release manager and
suggest cutting the branch next Monday (February 12th) and starting the
release process on Wednesday, one week from now (February 14th).

+Uwe Schindler  I remember that there are JDK22-related
changes that you'd like to get into 9.10, feel free to let me know if this
timeline doesn't work for you.

-- 
Adrien


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Good point, I opened an issue to discuss this:
https://github.com/apache/lucene/issues/13084.

Did we actually use a sparse bit set to encode deleted docs before? I don't
recall that.

On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler  wrote:

> Hi,
>
> A SparseBitset impl for DELETES would be fine if the model in Lucene would
> encode deleted docs (it did that in earlier times). As deletes are sparse
> (deletes are in most cases <40%), this would help to make the iterator
> cheaper.
> Uwe
>
> Am 06.02.2024 um 09:01 schrieb Adrien Grand:
>
> Hey Michael,
>
> You are right, iterating all deletes with nextClearBit() would run in
> O(maxDoc). I am coming from the other direction, where I'm expecting the
> number of deletes to be more in the order of 1%-5% of the doc ID space, so
> a separate int[] would use lots of heap and probably not help that much
> compared with nextClearBit(). My mental model is that the two most common
> use-cases are append-only workloads, where there are no deletes at all, and
> update workloads, which would commonly have several percents of deleted
> docs. It's not clear to me how common it is to have very few deletes.
>
> On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:
>
>> Thanks Adrien!
>>
>> My thinking with a separate iterator was that nextClearBit() is
>> relatively expensive (O(maxDoc) to traverse everything, I think). The
>> solution I was imagining would involve an index-time change to output, say,
>> an int[] of deleted docIDs if the number is sufficiently small (like maybe
>> less than 1000). Then the livedocs interface could optionally return a
>> cheap deleted docs iterator (i.e. only if the number of deleted docs is
>> less than the threshold). Technically, the cost would be O(1), since we set
>> a constant bound on the effort and fail otherwise. :)
>>
>> I think 1000 doc value lookups would be cheap, but I don't know if the
>> guarantee is cheap enough to make it into Weight#count.
>>
>> That said, I'm going to see if iterating with nextClearBit() is
>> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
>> refresh could be an option too.
>>
>> Thanks again,
>> Froh
>>
>> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:
>>
>>> Hi Michael,
>>>
>>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>>> are deletes.
>>>
>>> Your idea sounds good to me, do you actually need a side car iterator
>>> for deletes, or could you use a nextClearBit() operation on the bit set?
>>>
>>> I don't think we can fold it into Weight#count since there is an
>>> expectation that it is negligible compared with the cost of a naive count,
>>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>>> side.
>>>
>>> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>>>
>>>> Hi,
>>>>
>>>> On OpenSearch, we've been taking advantage of the various O(1)
>>>> Weight#count() implementations to quickly compute various aggregations
>>>> without needing to iterate over all the matching documents (at least when
>>>> the top-level query is functionally a match-all at the segment level). Of
>>>> course, from what I've seen, every clever Weight#count()
>>>> implementation falls apart (returns -1) in the face of deletes.
>>>>
>>>> I was thinking that we could still handle small numbers of deletes
>>>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>>>
>>>> Like suppose you're doing a date histogram aggregation, you could get
>>>> the counts for each bucket from the points tree (ignoring deletes), then
>>>> iterate through the deleted docs and decrement their contribution from the
>>>> relevant bucket (determined based on a docvalues lookup). Assuming the
>>>> number of deleted docs is small, it should be cheap, right?
>>>>
>>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK
>>>> it's not great for iteration. I'm imagining adding a supplementary "deleted
>>>> docs iterator" that could sit next to the FixedBitSet if and only if the
>>>> number of deletes is "small". Is there a better way that I should be
>>>> thinking about this?
>>>>
>>>> Thanks,
>>>> Froh
>>>>
>>>
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Hey Michael,

You are right, iterating all deletes with nextClearBit() would run in
O(maxDoc). I am coming from the other direction, where I'm expecting the
number of deletes to be more in the order of 1%-5% of the doc ID space, so
a separate int[] would use lots of heap and probably not help that much
compared with nextClearBit(). My mental model is that the two most common
use-cases are append-only workloads, where there are no deletes at all, and
update workloads, which would commonly have several percents of deleted
docs. It's not clear to me how common it is to have very few deletes.

On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:

> Thanks Adrien!
>
> My thinking with a separate iterator was that nextClearBit() is relatively
> expensive (O(maxDoc) to traverse everything, I think). The solution I was
> imagining would involve an index-time change to output, say, an int[] of
> deleted docIDs if the number is sufficiently small (like maybe less than
> 1000). Then the livedocs interface could optionally return a cheap deleted
> docs iterator (i.e. only if the number of deleted docs is less than the
> threshold). Technically, the cost would be O(1), since we set a constant
> bound on the effort and fail otherwise. :)
>
> I think 1000 doc value lookups would be cheap, but I don't know if the
> guarantee is cheap enough to make it into Weight#count.
>
> That said, I'm going to see if iterating with nextClearBit() is
> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
> refresh could be an option too.
>
> Thanks again,
> Froh
>
> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:
>
>> Hi Michael,
>>
>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>> are deletes.
>>
>> Your idea sounds good to me, do you actually need a side car iterator for
>> deletes, or could you use a nextClearBit() operation on the bit set?
>>
>> I don't think we can fold it into Weight#count since there is an
>> expectation that it is negligible compared with the cost of a naive count,
>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>> side.
>>
>> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>>
>>> Hi,
>>>
>>> On OpenSearch, we've been taking advantage of the various O(1)
>>> Weight#count() implementations to quickly compute various aggregations
>>> without needing to iterate over all the matching documents (at least when
>>> the top-level query is functionally a match-all at the segment level). Of
>>> course, from what I've seen, every clever Weight#count()
>>> implementation falls apart (returns -1) in the face of deletes.
>>>
>>> I was thinking that we could still handle small numbers of deletes
>>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>>
>>> Like suppose you're doing a date histogram aggregation, you could get
>>> the counts for each bucket from the points tree (ignoring deletes), then
>>> iterate through the deleted docs and decrement their contribution from the
>>> relevant bucket (determined based on a docvalues lookup). Assuming the
>>> number of deleted docs is small, it should be cheap, right?
>>>
>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>>> iterator" that could sit next to the FixedBitSet if and only if the number
>>> of deletes is "small". Is there a better way that I should be thinking
>>> about this?
>>>
>>> Thanks,
>>> Froh
>>>
>>

-- 
Adrien


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Adrien Grand
Hi Michael,

Indeed, only MatchAllDocsQuery knows how to produce a count when there are
deletes.

Your idea sounds good to me, do you actually need a side car iterator for
deletes, or could you use a nextClearBit() operation on the bit set?

I don't think we can fold it into Weight#count since there is an
expectation that it is negligible compared with the cost of a naive count,
but we may be able to do it in IndexSearcher#count or on the OpenSearch
side.

Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :

> Hi,
>
> On OpenSearch, we've been taking advantage of the various O(1)
> Weight#count() implementations to quickly compute various aggregations
> without needing to iterate over all the matching documents (at least when
> the top-level query is functionally a match-all at the segment level). Of
> course, from what I've seen, every clever Weight#count()
> implementation falls apart (returns -1) in the face of deletes.
>
> I was thinking that we could still handle small numbers of deletes
> efficiently if only we could get a DocIdSetIterator for deleted docs.
>
> Like suppose you're doing a date histogram aggregation, you could get the
> counts for each bucket from the points tree (ignoring deletes), then
> iterate through the deleted docs and decrement their contribution from the
> relevant bucket (determined based on a docvalues lookup). Assuming the
> number of deleted docs is small, it should be cheap, right?
>
> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
> not great for iteration. I'm imagining adding a supplementary "deleted docs
> iterator" that could sit next to the FixedBitSet if and only if the number
> of deletes is "small". Is there a better way that I should be thinking
> about this?
>
> Thanks,
> Froh
>


Re: [VOTE] Release Lucene 9.9.2 RC1

2024-01-26 Thread Adrien Grand
+1

SUCCESS! [1:00:39.059480]

On Fri, Jan 26, 2024 at 7:54 AM Ignacio Vera  wrote:

> +1
>
> SUCCESS! [0:54:32.772088]
>
> On Thu, Jan 25, 2024 at 11:23 PM Uwe Schindler  wrote:
>
>> Hi,
>>
>> +1 to release.
>>
>> Tested smoketester with Java 11 and 17; results:
>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/31/console
>>
>> Uwe
>>
>> Am 25.01.2024 um 12:57 schrieb Chris Hegarty:
>> > Please vote for release candidate 1 for Lucene 9.9.2
>> >
>> > The artifacts can be downloaded from:
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.2-RC1-rev-a2939784c4ca60bc28bf488b5479c02fc2e5e22c
>> >
>> > You can run the smoke tester directly with this command:
>> >
>> > python3 -u dev-tools/scripts/smokeTestRelease.py \
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.2-RC1-rev-a2939784c4ca60bc28bf488b5479c02fc2e5e22c
>> >
>> > The vote will be open for 96 hours ( allowing some additional time for
>> weekend span) i.e. until 2024-01-29 12:00 UTC.
>> >
>> > [ ] +1  approve
>> > [ ] +0  no opinion
>> > [ ] -1  disapprove (and reason why)
>> >
>> > Here is my +1
>> >
>> > Draft release notes can be found at
>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_2
>> >
>> > -Chris.
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien


Re: Welcome Stefan Vodita as Lucene committter

2024-01-18 Thread Adrien Grand
Welcome Stefan!

On Thu, Jan 18, 2024 at 6:10 PM Patrick Zhai  wrote:

> Welcome and Congrats, Stefan.
>
> Patrick
>
> On Thu, Jan 18, 2024, 08:45 Chris Hegarty
>  wrote:
>
>> Welcome Stefan.
>>
>> -Chris.
>>
>> > On 18 Jan 2024, at 15:53, Michael McCandless 
>> wrote:
>> >
>> > Hi Team,
>> >
>> > I'm pleased to announce that Stefan Vodita has accepted the Lucene
>> PMC's invitation to become a committer!
>> >
>> > Stefan, the tradition is that new committers introduce themselves with
>> a brief bio.
>> >
>> > Congratulations, welcome, and thank you for all your improvements to
>> Lucene and our community,
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien


Re: Lucene v9.9.1: org.apache.lucene.search.ScoreMode

2024-01-12 Thread Adrien Grand
There have been a few similar reports of the ScoreMode import issue at
Elastic with Lucene 9.9.1. It looks like an Intellij-specific issue, which
can be addressed by upgrading to the latest version. (I'm not really an
Intellij user myself so I don't know more about the problem.)

On Sun, Jan 7, 2024 at 5:08 PM Guo Feng  wrote:

> Hi.
>
> I suspect that the reason for this error may be that BytesRefHash#sort is
> called more than
> once on a BytesRefHash instance. This is fine before 9.9.0, but it won't
> work after
> https://github.com/apache/lucene/pull/12784.
>
> On 2024/01/07 13:41:33 Nazerke S wrote:
> > I re-run the test in a terminal and getting this: (seems not a dependency
> > issue)
> >
> > (TEST-TestScoreJoinQPNoScore.testRandomJoin-seed#[BD516D11246BE886]) [n:
> c:
> > s: r: x: t:] o.a.s.SolrTestCaseJ4 ###Ending testRandomJoin
> >
> >> java.lang.AssertionError
> >
> >> at
> > __randomizedtesting.SeedInfo.seed([BD516D11246BE886:C4DBA1A5364C4666]:0)
> >
> >> at
> > org.apache.lucene.util.BytesRefHash.compact(BytesRefHash.java:135)
> >
> >> at
> > org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:147)
> >
> >> at
> > org.apache.lucene.search.join.TermsQuery.(TermsQuery.java:68)
> >
> >> at
> >
> org.apache.lucene.search.join.TermsIncludingScoreQuery.createWeight(TermsIncludingScoreQuery.java:133)
> >
> >> at
> >
> org.apache.solr.search.join.ScoreJoinQParserPlugin$SameCoreJoinQuery.createWeight(ScoreJoinQParserPlugin.java:196)
> >
> >> at
> >
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:900)
> >
> >> at
> >
> org.apache.lucene.search.ConstantScoreQuery.createWeight(ConstantScoreQuery.java:136)
> >
> >> at
> >
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:900)
> >
> >> at
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:554)
> >
> >> at
> >
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:275)
> >
> >> at
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1878)
> >
> >> at
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1695)
> >
> >> at
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:710)
> >
> >> at
> >
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1698)
> >
> >> at
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:423)
> >
> >> at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:467)
> >
> >> at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:226)
> >
> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2884)
> >
> >> at
> org.apache.solr.util.TestHarness.query(TestHarness.java:353)
> >
> >> at
> org.apache.solr.util.TestHarness.query(TestHarness.java:333)
> >
> >> at
> >
> org.apache.solr.search.join.TestScoreJoinQPNoScore.testRandomJoin(TestScoreJoinQPNoScore.java:361)
> >
> >> at
> >
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> >
> >> at
> java.base/java.lang.reflect.Method.invoke(Method.java:578)
> >
> >   2> NOTE: reproduce with: gradlew test --tests
> > TestScoreJoinQPNoScore.testRandomJoin -Dtests.seed=BD516D11246BE886
> > -Dtests.locale=th-Thai-TH -Dtests.timezone=Europe/Berlin
> > -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> >
> > On Sun, Jan 7, 2024 at 1:35 PM Dawid Weiss 
> wrote:
> >
> > >
> > > Can you check whether it's a dependency graph problem somehow, maybe
> (does
> > > it compile outside of intellij?). Can you trim down the project to a
> > > reproducible scenario so that we can look at it?
> > >
> > > D.
> > >
> > > On Sun, Jan 7, 2024 at 5:53 AM Nazerke S 
> wrote:
> > >
> > >> One of the functions takes ScoreMode as an argument and this
> ScoreMode is
> > >> not found from dependencies.  In Intellij, seeing 'cannot resolve
> symbol
> > >> ScoreMode'.
> > >>
> > >>
> > >> @Override
> > >>
> > >> public Weight createWeight(IndexSearcher searcher,
> > >> org.apache.lucene.search.ScoreMode scoreMode, float boost) {...}
> > >>
> > >>
> > >> Tried 'import org.apache.lucene.search.ScoreMode' but not found either
> > >> way.
> > >>
> > >> On Sun, Jan 7, 2024 at 6:39 AM Marcus Eagan 
> > >> wrote:
> > >>
> > >>> It’s there for sure, but that doesn’t mean there is no problem. Could
> > >>> you share what you are seeing in more detail given the class
> certainly
> > >>> exists?
> > >>>
> > >>> Marcus Eagan
> > >>>
> > >>>
> > >>>
> > >>> On Sat, Jan 6, 2024 at 14:05 Chris Hegarty
> > >>>  wrote:
> > >>>
> >  Hi,
> > 
> >  I see no 

Re: [JENKINS] Lucene-main-Linux (64bit/hotspot/jdk-19) - Build # 45856 - Unstable!

2023-12-20 Thread Adrien Grand
I don't fully understandi it yet. I opened an issue:
https://github.com/apache/lucene/issues/12957.

On Tue, Dec 19, 2023 at 6:02 PM Adrien Grand  wrote:

> This looks like a real bug with the default codec when the prefix compares
> greater than every indexed term. I'll look into it tomorrow if nobody beats
> me to it.
>
> On Tue, Dec 19, 2023 at 12:35 PM Policeman Jenkins Server <
> jenk...@thetaphi.de> wrote:
>
>> Build: https://jenkins.thetaphi.de/job/Lucene-main-Linux/45856/
>> Java: 64bit/hotspot/jdk-19 -XX:+UseCompressedOops -XX:+UseSerialGC
>>
>> 1 tests failed.
>> FAILED:  org.apache.lucene.index.TestTerms.testTermMinMaxRandom
>>
>> Error Message:
>> java.lang.AssertionError
>>
>> Stack Trace:
>> java.lang.AssertionError
>> at
>> __randomizedtesting.SeedInfo.seed([CBF65306049672F4:8785DC72680AA991]:0)
>> at
>> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.getState(IntersectTermsEnum.java:245)
>> at
>> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.seekToStartTerm(IntersectTermsEnum.java:288)
>> at
>> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.(IntersectTermsEnum.java:126)
>> at
>> org.apache.lucene.codecs.lucene90.blocktree.FieldReader.intersect(FieldReader.java:223)
>> at
>> org.apache.lucene.index.CheckIndex.checkTermsIntersect(CheckIndex.java:2374)
>> at
>> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:2327)
>> at
>> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2529)
>> at
>> org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1067)
>> at
>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:783)
>> at
>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:550)
>> at
>> org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:340)
>> at
>> org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
>> at
>> org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:85)
>> at
>> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:578)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>> at
>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>> at
>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>> at
>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>

Re: [JENKINS] Lucene-main-Linux (64bit/hotspot/jdk-19) - Build # 45856 - Unstable!

2023-12-19 Thread Adrien Grand
This looks like a real bug with the default codec when the prefix compares
greater than every indexed term. I'll look into it tomorrow if nobody beats
me to it.

On Tue, Dec 19, 2023 at 12:35 PM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-main-Linux/45856/
> Java: 64bit/hotspot/jdk-19 -XX:+UseCompressedOops -XX:+UseSerialGC
>
> 1 tests failed.
> FAILED:  org.apache.lucene.index.TestTerms.testTermMinMaxRandom
>
> Error Message:
> java.lang.AssertionError
>
> Stack Trace:
> java.lang.AssertionError
> at
> __randomizedtesting.SeedInfo.seed([CBF65306049672F4:8785DC72680AA991]:0)
> at
> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.getState(IntersectTermsEnum.java:245)
> at
> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.seekToStartTerm(IntersectTermsEnum.java:288)
> at
> org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.(IntersectTermsEnum.java:126)
> at
> org.apache.lucene.codecs.lucene90.blocktree.FieldReader.intersect(FieldReader.java:223)
> at
> org.apache.lucene.index.CheckIndex.checkTermsIntersect(CheckIndex.java:2374)
> at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:2327)
> at
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2529)
> at
> org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1067)
> at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:783)
> at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:550)
> at
> org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:340)
> at
> org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
> at
> org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:85)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> at java.base/java.lang.reflect.Method.invoke(Method.java:578)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> 

Re: UTF-8 well-formedness for SimpleTextCodec

2023-12-19 Thread Adrien Grand
Hey Michael,

Writing well-formed UTF-8 with SimpleTextformat sounds desirable indeed,
e.g. your PR makes sense. I don't think we would want to be heroic about
it, but if we can serialize the same information easily, then it sounds
like something we should do. Thanks for improving SimpleTextCodec!

On Mon, Dec 18, 2023 at 6:01 PM Michael Froh  wrote:

> Hi there,
>
> I was recently writing up a short Lucene file format tutorial (
> https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
> using SimpleTextCodec for educational purposes.
>
> I found that SimpleTextSegmentInfo tries to output the segment ID as raw
> bytes, which will often result in malformed UTF-8 output. I wrote a little
> fix to output as the text representation of a byte array (
> https://github.com/apache/lucene/pull/12897). I noticed that it's a
> similar sort of thing with binary doc values (where the bytes get written
> directly).
>
> Is there any general desire for SImpleTextCodec to output well-formed
> UTF-8 where possible?
>
> Thanks,
> Froh
>


-- 
Adrien


Re: [VOTE] Release Lucene 9.9.1 RC1

2023-12-14 Thread Adrien Grand
+1 SUCCESS! [1:41:08.997307]

Thanks Chris for taking care of this release.

On Thu, Dec 14, 2023 at 4:40 PM Michael Sokolov  wrote:

>
> +1
>
> SUCCESS! [0:50:50.776559]
>
> Note: we did get some test fails on the mailing list this morning, but I
> believe they are not real bugs and will be resolved by tightening up our
> test assumptions
>
> On Thu, Dec 14, 2023 at 7:08 AM Guo Feng  wrote:
>
>> +1
>>
>> SUCCESS! [3:38:43.833896]
>>
>> On 2023/12/14 10:44:18 Michael McCandless wrote:
>> > +1
>> >
>> > SUCCESS! [0:14:52.296147]
>> >
>> >
>> > I also cracked a bit of rust off our Monster tests and all but one
>> passed:
>> > https://github.com/apache/lucene/pull/12942
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Dec 13, 2023 at 4:24 PM Benjamin Trent 
>> > wrote:
>> >
>> > > SUCCESS! [1:06:02.232333]
>> > >
>> > > + 1!
>> > >
>> > > On Wed, Dec 13, 2023 at 3:26 PM Greg Miller 
>> wrote:
>> > >
>> > >> SUCCESS! [2:27:01.875939]
>> > >>
>> > >> +1
>> > >>
>> > >> Thanks!
>> > >> -Greg
>> > >>
>> > >> On Wed, Dec 13, 2023 at 3:58 AM Chris Hegarty
>> > >>  wrote:
>> > >>
>> > >>> And (short) release note:
>> > >>>
>> > >>>
>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_1
>> > >>>
>> > >>> -Chris.
>> > >>>
>> > >>> > On 13 Dec 2023, at 11:55, Chris Hegarty <
>> > >>> christopher.hega...@elastic.co> wrote:
>> > >>> >
>> > >>> > Hi,
>> > >>> >
>> > >>> > Please vote for release candidate 1 for Lucene 9.9.1
>> > >>> >
>> > >>> > The artifacts can be downloaded from:
>> > >>> >
>> > >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
>> > >>> >
>> > >>> > You can run the smoke tester directly with this command:
>> > >>> >
>> > >>> > python3 -u dev-tools/scripts/smokeTestRelease.py \
>> > >>> >
>> > >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
>> > >>> >
>> > >>> > The vote will be open for at least 72 hours i.e. until 2023-12-16
>> > >>> 12:00 UTC.
>> > >>> >
>> > >>> > [ ] +1  approve
>> > >>> > [ ] +0  no opinion
>> > >>> > [ ] -1  disapprove (and reason why)
>> > >>> >
>> > >>> > Here is my +1
>> > >>> >
>> > >>> > -Chris.
>> > >>> >
>> > >>>
>> > >>>
>> > >>>
>> -
>> > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >>>
>> > >>>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien


Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1209 - Unstable!

2023-12-11 Thread Adrien Grand
Woops, sorry for suggesting this change in the first place! I didn't know
we had this validation for points, but not for postings.

On Fri, Dec 8, 2023 at 2:16 PM Michael McCandless 
wrote:

> OK I reverted the "optimization" to not pull FieldInfo for a field when
> getting Points values from SlowCompositeCodecReaderWrapper!  Clearly it was
> not safe ;)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Dec 8, 2023 at 8:06 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Uh oh -- I'll dig.  We may need to put back the FieldInfo check before
>> pulling points.  Tricky!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Dec 8, 2023 at 3:55 AM Apache Jenkins Server <
>> jenk...@builds.apache.org> wrote:
>>
>>> Build:
>>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/1209/
>>>
>>> 3 tests failed.
>>> FAILED:  org.apache.lucene.index.TestPointValues.testSparsePoints
>>>
>>> Error Message:
>>> java.lang.IllegalStateException: this writer hit an unrecoverable error;
>>> cannot merge
>>>
>>> Stack Trace:
>>> java.lang.IllegalStateException: this writer hit an unrecoverable error;
>>> cannot merge
>>> at
>>> __randomizedtesting.SeedInfo.seed([ADA30A2081CE6DA4:A05414293C35A568]:0)
>>> at
>>> org.apache.lucene.index.IndexWriter.hasPendingMerges(IndexWriter.java:2425)
>>> at
>>> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.hasPendingMerges(IndexWriter.java:6527)
>>> at
>>> org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:589)
>>> at
>>> org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:540)
>>> at
>>> org.apache.lucene.index.IndexWriter.executeMerge(IndexWriter.java:2315)
>>> at
>>> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2310)
>>> at
>>> org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5985)
>>> at
>>> org.apache.lucene.index.IndexWriter.flushNextBuffer(IndexWriter.java:3606)
>>> at
>>> org.apache.lucene.tests.index.RandomIndexWriter.flushAllBuffersSequentially(RandomIndexWriter.java:263)
>>> at
>>> org.apache.lucene.tests.index.RandomIndexWriter.maybeFlushOrCommit(RandomIndexWriter.java:235)
>>> at
>>> org.apache.lucene.tests.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:226)
>>> at
>>> org.apache.lucene.index.TestPointValues.testSparsePoints(TestPointValues.java:697)
>>> at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>>> at
>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>>> at
>>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>>> at
>>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>>> at
>>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>>> at
>>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>>> at
>>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>>> at
>>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>>> at
>>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>>> at
>>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>>> at
>>> 

Re: [VOTE] Release Lucene 9.9.0 RC1

2023-11-30 Thread Adrien Grand
My expectation is that we will do a 9.x minor at about the same time as
10.0 anyway, this is what we have done in the past for new majors. This
will give an opportunity to make sure we have deprecation warnings for all
breaking changes in 10.0.

Le jeu. 30 nov. 2023, 10:43, Chris Hegarty
 a écrit :

> For clarity, consider this vote cancelled. A new vote has been started on
> an RC2 build.
>
> On 30 Nov 2023, at 16:22, Greg Miller  wrote:
>
> If we're spinning a new RC, I'd like to ask this group if it would make
> sense to pull this very small method deprecation in:
> https://github.com/apache/lucene/pull/12854
>
> If there's a chance we don't release a 9.10 and go directly to 10.0, this
> would be our last opportunity to mark it deprecated on a 9.x version so we
> can actually remove it in 10.0. It's really minor though, so I don't want
> to create churn, but if we can get it into 9.9 without much issue, it would
> be nice. If folks agree, I can get it merged onto 9.9.
>
>
> Thanks for raising the issue. I don’t have a strong opinion on whether or
> not to do the deprecation in this release, and since you say that it is
> minor, then I don’t see that it necessitates another respin.
>
> Since I had already started an RC2 build, then I just continued with it
> (and since the above issue is not yet reviewed ). If others feel like the
> deprecation should absolutely be in, then we can do an RC3.
>
> -Chris.
>
> Cheers,
> -Greg
>
> On Thu, Nov 30, 2023 at 7:58 AM Michael Sokolov 
> wrote:
>
>> for the sake of posterity, I did get a successful smoketest:
>>
>> SUCCESS! [1:00:06.512261]
>>
>> but +0 to release I guess since it's moot...
>>
>> On Thu, Nov 30, 2023 at 10:38 AM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> On Thu, Nov 30, 2023 at 9:56 AM Chris Hegarty
>>>  wrote:
>>>
>>> P.S. I’m less sure about this, but the RC 2 starts a 72hr voting time
 again? (Just so I know what TTL to put on that)

>>>
>>> Yeah a new 72 hour clock starts with each new RC :)
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>
>


Re: [VOTE] Release Lucene 9.9.0 RC1

2023-11-30 Thread Adrien Grand
Yet another bug due to ghost fields. :( Thanks for fixing! For reference, I
checked how postings work on SlowCompositeCodecReaderWrapper, since they
are prone to ghost fields as well, and they seem to be ok.

I worry that it could actually occur in practice when enabling recursive
graph bisection, so I would prefer to respin.

On Thu, Nov 30, 2023 at 6:01 AM Luca Cavanna 
wrote:

> SUCCESS! [0:33:10.432870]
>
> +1
>
> On Thu, Nov 30, 2023 at 2:59 PM Chris Hegarty
>  wrote:
>
>> Hi Mike,
>>
>> On 30 Nov 2023, at 11:41, Michael McCandless 
>> wrote:
>>
>> +1 to release.
>>
>> I hit a corner-case test failure and opened a PR to fix it:
>> https://github.com/apache/lucene/pull/12859
>>
>>
>> Good find!  It looks like the fix for this issue is well in hand - great.
>>
>> I don't think this should block the release? -- it looks exotic.
>>
>>
>> I’m not sure how likely this bug is to show in real (non-test) scenarios,
>> but it does look kinda “exotic” to me too. So unless there are counter
>> arguments, I do not see it as critical, and therefore not needing a respin.
>>
>> -Chris.
>>
>>
>> Thanks Chris!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Nov 30, 2023 at 1:16 AM Patrick Zhai  wrote:
>>
>>> SUCCESS! [1:03:54.880200]
>>>
>>> +1. Thank you Chris!
>>>
>>> On Wed, Nov 29, 2023 at 8:45 PM Nhat Nguyen
>>>  wrote:
>>>
 SUCCESS! [1:11:30.037919]

 +1. Thanks, Chris!

 On Wed, Nov 29, 2023 at 8:53 AM Chris Hegarty
  wrote:

> Hi,
>
> Please vote for release candidate 1 for Lucene 9.9.0
>
> The artifacts can be downloaded from:
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037
>
> The vote will be open for at least 72 hours, and given the weekend in
> between, let’s it open until 2023-12-04 12:00 UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
>
> Draft release highlights can be viewed here (comments and feedback
> welcome):
> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_0
>
> -Chris.
>

>>

-- 
Adrien


Re: Lucene 9.9.0 Release

2023-11-27 Thread Adrien Grand
Thanks Chris for checking.

I had been too optimistic for #12180, I'll push it to 9.10.

Fingers crossed that #12699 fixes the performance drop.


Le lun. 27 nov. 2023, 07:17, Chris Hegarty 
a écrit :

> Hi Adrien,
>
> Comments inline.
>
> On 21 Nov 2023, at 12:31, Adrien Grand  wrote:
>
> +1 9.9 has plenty of great changes indeed! Thanks for volunteering as a
> RM, Chris.
>
> It would be good to try and fix the PKLookup regression that was
> introduced since 9.8:
> http://people.apache.org/~mikemccand/lucenebench/PKLookup.html. Is it
> just about getting #12699 <https://github.com/apache/lucene/pull/12699>
> merged?
>
>
> I see that this is not yet merged. It looks like it is waiting final
> review.
>
>
> Separately, I have a PR that does a small change to the file format of
> postings and skip lists. It's certainly not a blocker for 9.9, but it would
> be convenient to get it into 9.9 since we already changed file formats for
> the switch from PFOR to FOR. Does someone have time to take a look? #12810
> <https://github.com/apache/lucene/pull/12810>
>
>
> I see that this has been merged, and later reverted because of some test
> instability. The new issue tracking this work is #12810
> <https://github.com/apache/lucene/pull/12810> [1]. Are we still expecting
> this to be resolved in 9.9.0 ?
>
> -Chris.
>
> [1] https://github.com/apache/lucene/pull/12810
>
>
> On Tue, Nov 21, 2023 at 11:16 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1
>>
>> Thank you for volunteering as RC Chris!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Nov 21, 2023 at 4:52 AM Chris Hegarty
>>  wrote:
>>
>>> Hi,
>>>
>>> It's been a while since the 9.8.0 release and we’ve accumulated quite a
>>> few changes. I’d like to propose that we release 9.9.0.
>>>
>>> If there's no objections, I volunteer to be the release manager and will
>>> cut the feature branch a week from now, 12:00 28th Nov UTC.
>>>
>>> -Chris.
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> Adrien
>
>
>


Re: Lucene 9.9.0 Release

2023-11-21 Thread Adrien Grand
+1 9.9 has plenty of great changes indeed! Thanks for volunteering as a RM,
Chris.

It would be good to try and fix the PKLookup regression that was introduced
since 9.8: http://people.apache.org/~mikemccand/lucenebench/PKLookup.html.
Is it just about getting #12699
 merged?

Separately, I have a PR that does a small change to the file format of
postings and skip lists. It's certainly not a blocker for 9.9, but it would
be convenient to get it into 9.9 since we already changed file formats for
the switch from PFOR to FOR. Does someone have time to take a look? #12810


On Tue, Nov 21, 2023 at 11:16 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1
>
> Thank you for volunteering as RC Chris!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Nov 21, 2023 at 4:52 AM Chris Hegarty
>  wrote:
>
>> Hi,
>>
>> It's been a while since the 9.8.0 release and we’ve accumulated quite a
>> few changes. I’d like to propose that we release 9.9.0.
>>
>> If there's no objections, I volunteer to be the release manager and will
>> cut the feature branch a week from now, 12:00 28th Nov UTC.
>>
>> -Chris.
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien


Heads up: reindex main/9.x indices

2023-11-20 Thread Adrien Grand
Hello all,

The 9.9 file format was just updated to encode tail postings using
group-vint instead of vint[1], so you need to reindex all indices generated
from the main and branch_9x branches. As always, indexes created from a
proper Lucene release are still compatible.

[1] https://github.com/apache/lucene/pull/12782

-- 
Adrien


Re: [JENKINS] Lucene » Lucene-Check-main - Build # 10678 - Unstable!

2023-11-20 Thread Adrien Grand
A one-in-a-million-runs test failure. I pushed a fix:
https://github.com/apache/lucene/commit/194a500323531b66124577167006115c34dfde54
.

On Sun, Nov 19, 2023 at 10:00 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/10678/
>
> 1 tests failed.
> FAILED:  org.apache.lucene.util.TestPagedBytes.testDataInputOutput2
>
> Error Message:
> java.lang.IllegalArgumentException: bound must be positive
>
> Stack Trace:
> java.lang.IllegalArgumentException: bound must be positive
> at
> __randomizedtesting.SeedInfo.seed([65D9DDAC3D23F916:EB18D32A6DB237D6]:0)
> at java.base/java.util.Random.nextInt(Random.java:322)
> at
> com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom.nextInt(Xoroshiro128PlusRandom.java:73)
> at
> com.carrotsearch.randomizedtesting.AssertingRandom.nextInt(AssertingRandom.java:87)
> at
> org.apache.lucene.util.TestPagedBytes.testDataInputOutput2(TestPagedBytes.java:156)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> 

Re: SPLADE implementation

2023-11-15 Thread Adrien Grand
Say your model produces a set of weighted terms:
 - At index time, for each (term, weight) pair, you add a "new
FeatureField(fieldName, term, weight)` field to your document.
 - At search time, for each (term, weight) pair, you add a "new
BooleanClause(FeatureField.newLinearQuery(fieldName, term, weight))" to
your BooleanQuery.

On Wed, Nov 15, 2023 at 11:08 AM Michael Wechner 
wrote:

> Hi Adrien
>
> Ah ok, I did not realize this, thanks for pointing this out!
>
> I don't quite understand though, how you would implement the "SPLADE"
> approach using FeatureField from the documentation at
>
>
> https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/FeatureField.html
>
> For example when indexing a document or doing a query and I use some
> language model (e.g. BERT) to do the term expansion, how
> do I then make use of FeatureField exactly?
>
> I tried to find some code examples, but couldn't, do you maybe have some
> pointers?
>
> Thanks
>
> Michael
>
>
> Am 15.11.23 um 10:34 schrieb Adrien Grand:
>
> Hi Michael,
>
> What functionality are you missing? Lucene already supports
> indexing/querying weighted terms using FeatureField.
>
> On Wed, Nov 15, 2023 at 10:03 AM Michael Wechner <
> michael.wech...@wyona.com> wrote:
>
>> Hi
>>
>> I have found the following issue re a possible SPLADE implementation
>>
>> https://github.com/apache/lucene/issues/11799
>>
>> Is somebody still working on this?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Adrien
>
>
>

-- 
Adrien


Re: SPLADE implementation

2023-11-15 Thread Adrien Grand
Hi Michael,

What functionality are you missing? Lucene already supports
indexing/querying weighted terms using FeatureField.

On Wed, Nov 15, 2023 at 10:03 AM Michael Wechner 
wrote:

> Hi
>
> I have found the following issue re a possible SPLADE implementation
>
> https://github.com/apache/lucene/issues/11799
>
> Is somebody still working on this?
>
> Thanks
>
> Michael
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien


Re: [JENKINS] Lucene-9.x-Linux (64bit/openj9/jdk-17.0.8) - Build # 14028 - Unstable!

2023-11-14 Thread Adrien Grand
What a fantastic test, it found another real bug. I opened
https://github.com/apache/lucene/pull/12807.

On Mon, Nov 13, 2023 at 10:44 PM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/14028/
> Java: 64bit/openj9/jdk-17.0.8 -XX:-UseCompressedOops -Xgcpolicy:optthruput
>
> 1 tests failed.
> FAILED:
> org.apache.lucene.codecs.lucene99.TestLucene99HnswQuantizedVectorsFormat.testRandomExceptions
>
> Error Message:
> java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are
> still 2 open files: {_31_Lucene99HnswScalarQuantizedVectorsFormat_0.vec=1,
> _31_Lucene99HnswScalarQuantizedVectorsFormat_0.veq=1}
>
> Stack Trace:
> java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are
> still 2 open files: {_31_Lucene99HnswScalarQuantizedVectorsFormat_0.vec=1,
> _31_Lucene99HnswScalarQuantizedVectorsFormat_0.veq=1}
> at
> __randomizedtesting.SeedInfo.seed([315CAE8CFF8FB78B:5973C0486181EE2B]:0)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876)
> at
> app//org.apache.lucene.tests.index.BaseIndexFileFormatTestCase.testRandomExceptions(BaseIndexFileFormatTestCase.java:728)
> at
> app//org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase.testRandomExceptions(BaseKnnVectorsFormatTestCase.java:69)
> at
> java.base@17.0.8.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base@17.0.8.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base@17.0.8.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at
> java.base@17.0.8.1/java.lang.reflect.Method.invoke(Method.java:568)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> 

Re: Welcome Patrick Zhai to the Lucene PMC

2023-11-10 Thread Adrien Grand
Welcome Patrick!

Le ven. 10 nov. 2023, 21:18, Greg Miller  a écrit :

> Congrats and welcome Patrick!
>
> On Fri, Nov 10, 2023 at 12:05 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> I'm happy to announce that Patrick Zhai has accepted an invitation to
>> join the Lucene Project Management Committee (PMC)!
>>
>> Congratulations Patrick, thank you for all your hard work improving
>> Lucene's community and source code, and welcome aboard!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>


Re: Apach Solr Exercise 1 Index the Techproducts Data step not working

2023-11-04 Thread Adrien Grand
Hi Qizhi,

I am moving your question to the Solr users list.

Le sam. 4 nov. 2023, 01:58, Qizhi Zheng  a
écrit :

> Hello,
>
>
>
> I am trying to run the Solr Tutorial Exercise 1 Index Techproducts Data in
> Windows 10. I typed the exact same command following it link:
>
>
> https://solr.apache.org/guide/solr/latest/getting-started/tutorial-techproducts.html
>
>
>
> *java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar
> example\exampledocs\**
>
>
>
> But I always got the following error messages:
>
>
>
>   *Error: Could not find or load main class
> org.apache.solr.util.SimplePostTool*.
>
>
>
> My Java CLASSPATH has been configured with solr-core-9.8.0.jar as
> following:
>
>
>
>   CLASSPATH =   C:\Solr\solr-9.4.0\dist\solr-core-9.4.0.jar
>
>
>
> I have googled many stack overflow answers but no luck.
>
>
>
> Does anyone know what is wrong with this error? Are there any other jar
> libraries that I need to configure in my CLASSPATH in order to run this
> command?
>
>
>
> For the first step, “Launch Solr in SolrCloud Model”, in the above link
> was working very well. Why the same Windows command line doesn’t work in my
> computer?
>
>
>
> Where is the library for org.apache.solr.util.SimplePostTool? Someone said
> that SimplePostTool was only used in Solr old version. Is that true?
>
>
>
> It is very hard to follow Solr command line to design Schema and indexing
> for Solr. Are there any GUI interface for client side to implement Solr
> server design?
>
>
>
> Thank you very much
>
>
>
> qizhi
>
>
>
>
>


Re: [JENKINS] Lucene-main-Linux (64bit/hotspot/jdk-20) - Build # 45223 - Unstable!

2023-11-01 Thread Adrien Grand
I pushed a fix:
https://github.com/apache/lucene/commit/66324f763fc7fb0d8e7cd6f334e5438f0171c84e
.

On Thu, Oct 26, 2023 at 4:35 PM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-main-Linux/45223/
> Java: 64bit/hotspot/jdk-20 -XX:-UseCompressedOops -XX:+UseSerialGC
>
> 1 tests failed.
> FAILED:  org.apache.lucene.index.TestDeletionPolicy.testOpenPriorSnapshot
>
> Error Message:
> java.lang.AssertionError
>
> Stack Trace:
> java.lang.AssertionError
> at
> __randomizedtesting.SeedInfo.seed([B82869D77CFECED5:D1F1A059633E0B0D]:0)
> at org.junit.Assert.fail(Assert.java:87)
> at org.junit.Assert.assertTrue(Assert.java:42)
> at org.junit.Assert.assertTrue(Assert.java:53)
> at
> org.apache.lucene.index.TestDeletionPolicy.testOpenPriorSnapshot(TestDeletionPolicy.java:524)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> at java.base/java.lang.reflect.Method.invoke(Method.java:578)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:1623)
>
> -
> To unsubscribe, e-mail: 

Re: [JENKINS] Lucene-9.x-Linux (64bit/openj9/jdk-17.0.5) - Build # 13732 - Unstable!

2023-10-31 Thread Adrien Grand
I pushed a fix for these failures:
https://github.com/apache/lucene/commit/85f5d3bb0bf84fed46ca4c093c1aa084e4a43873

On Fri, Oct 27, 2023 at 9:55 AM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/13732/
> Java: 64bit/openj9/jdk-17.0.5 -XX:-UseCompressedOops -Xgcpolicy:gencon
>
> 1 tests failed.
> FAILED:  org.apache.lucene.index.TestIndexWriter.testHasUncommittedChanges
>
> Error Message:
> java.lang.AssertionError
>
> Stack Trace:
> java.lang.AssertionError
> at
> __randomizedtesting.SeedInfo.seed([63AADDD55C51D4C2:45E1C0A475266832]:0)
> at app//org.junit.Assert.fail(Assert.java:87)
> at app//org.junit.Assert.assertTrue(Assert.java:42)
> at app//org.junit.Assert.assertFalse(Assert.java:65)
> at app//org.junit.Assert.assertFalse(Assert.java:75)
> at
> app//org.apache.lucene.index.TestIndexWriter.testHasUncommittedChanges(TestIndexWriter.java:2400)
> at 
> java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at java.base@17.0.5
> /jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at java.base@17.0.5
> /jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base@17.0.5
> /java.lang.reflect.Method.invoke(Method.java:568)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> 

Re: [JENKINS] Lucene » Lucene-NightlyTests-9.x - Build # 720 - Unstable!

2023-10-26 Thread Adrien Grand
For reference, Simon pushed a fix for these TestIndexWriter.classMethod
failures:
https://github.com/apache/lucene/commit/01acb1c37b2826339d95681251dacd7e2a929be9

On Tue, Oct 24, 2023 at 11:12 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.x/720/
>
> 1 tests failed.
> FAILED:  org.apache.lucene.index.TestIndexWriter.classMethod
>
> Error Message:
> java.lang.RuntimeException: file handle leaks:
> [FileChannel(/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.x/checkout/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriter_CB61703B06C96C2B-001/index-MMapDirectory-005/write.lock)]
>
> Stack Trace:
> java.lang.RuntimeException: file handle leaks:
> [FileChannel(/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-9.x/checkout/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriter_CB61703B06C96C2B-001/index-MMapDirectory-005/write.lock)]
> at __randomizedtesting.SeedInfo.seed([CB61703B06C96C2B]:0)
> at org.apache.lucene.tests.mockfile.LeakFS.onClose(LeakFS.java:63)
> at
> org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:69)
> at
> org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:70)
> at
> org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:70)
> at
> org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:223)
> at
> com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.Exception
> at org.apache.lucene.tests.mockfile.LeakFS.onOpen(LeakFS.java:46)
> at
> org.apache.lucene.tests.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:82)
> at
> org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:202)
> at
> org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171)
> at
> org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:206)
> at
> java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
> at
> java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
> at
> org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:112)
> at
> org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)
> at
> org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)
> at
> org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106)
> at
> org.apache.lucene.tests.store.MockDirectoryWrapper.obtainLock(MockDirectoryWrapper.java:1095)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:950)
> at
> org.apache.lucene.index.TestIndexWriter.testCarryOverHasBlocks(TestIndexWriter.java:1772)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)

Re: [JENKINS] Lucene-9.x-Linux (64bit/openj9/jdk-17.0.5) - Build # 13705 - Unstable!

2023-10-25 Thread Adrien Grand
This is mine, I'm looking into it.

On Wed, Oct 25, 2023 at 7:54 PM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/13705/
> Java: 64bit/openj9/jdk-17.0.5 -XX:-UseCompressedOops -Xgcpolicy:metronome
>
> 1 tests failed.
> FAILED:
> org.apache.lucene.index.TestStressIndexing.testStressIndexAndSearching
>
> Error Message:
> java.io.IOException: cannot createOutput after crash
>
> Stack Trace:
> java.io.IOException: cannot createOutput after crash
> at
> __randomizedtesting.SeedInfo.seed([FA4E433E09843EB6:1D271B47446E5F4A]:0)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:700)
> at
> app//org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
> at
> app//org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:41)
> at
> app//org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.(Lucene90CompressingStoredFieldsWriter.java:130)
> at
> app//org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsWriter(Lucene90CompressingStoredFieldsFormat.java:140)
> at
> app//org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsWriter(Lucene90StoredFieldsFormat.java:154)
> at
> app//org.apache.lucene.tests.codecs.asserting.AssertingStoredFieldsFormat.fieldsWriter(AssertingStoredFieldsFormat.java:49)
> at
> app//org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:233)
> at
> app//org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:273)
> at
> app//org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:110)
> at
> app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5173)
> at
> app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4706)
> at
> app//org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6461)
> at
> app//org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:38)
> at
> app//org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3720)
> at
> app//org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4070)
> at
> app//org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1327)
> at
> app//org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1365)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:979)
> at
> app//org.apache.lucene.index.TestStressIndexing.testStressIndexAndSearching(TestStressIndexing.java:171)
> at 
> java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at java.base@17.0.5
> /jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at java.base@17.0.5
> /jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base@17.0.5
> /java.lang.reflect.Method.invoke(Method.java:568)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> 

Welcome Guo Feng to the Lucene PMC

2023-10-24 Thread Adrien Grand
I'm pleased to announce that Guo Feng has accepted an invitation to join
the Lucene PMC!

Congratulations Feng, and welcome aboard!

-- 
Adrien


Welcome Luca Cavanna to the Lucene PMC

2023-10-19 Thread Adrien Grand
I'm pleased to announce that Luca Cavanna has accepted an invitation to
join the Lucene PMC!

Congratulations Luca, and welcome aboard!

-- 
Adrien


Re: PackedInts functionalities

2023-10-17 Thread Adrien Grand
+1 to what Mikhail wrote, this is e.g. how postings work: instead of
interleaving doc IDs and frequencies, they always store a block of 128 doc
IDs followed by a block of 128 frequencies.

For reference, bit packing feels space-inefficient for this kind of data. I
would expect docFreqs to have a zipfian distribution, so you would end up
using a number of bits per docFreq that is driven by the highest docFreq in
the block while most values might be very low. Do you need random-access
into these doc freqs and postings start offsets or will you decode data for
an entire block every time anyway?


On Tue, Oct 17, 2023 at 8:39 AM Mikhail Khludnev  wrote:

> Hello Tony
> Is it possible to write a block of docfreqs and then a block of
> postingoffsets?
> Or why not write them as 10-bit integers and then split to quad and sextet
> in the posting format code?
>
> On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu  wrote:
>
>> Hi devs,
>>
>> As I was working on https://github.com/apache/lucene/issues/12513 I
>> needed to compress positive integers which are used to locate postings etc.
>>
>> To put it concretely, I will need to pack a few values per term
>> contiguously and those values can have different bit-width. For example,
>> consider that we need to encode docFreq and postingsStartOffset per term
>> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We
>> expect to write the following for two terms.
>>
>> ```
>> Term1 |  Term2
>>
>> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) |
>> postingsStartOffset(6bit)
>>
>> ```
>>
>> On the read path, I expect to locate the offest for a term first and
>> followed by reading two values that have different bit-width.
>>
>> In the spirit of not re-inventing necessarily, I tried to explore the
>> existing PackedInts util classes and I believe there is no support for this
>> at the moment. The biggest gap I found is that the existing classes expect
>> to write/read values of same bit-width.
>>
>> I'm writing to get feedback from yall to see if I missed anything.
>>
>> Cheers,
>> Tony X
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien


Re: Weird HNSW merge performance result

2023-10-10 Thread Adrien Grand
Regarding building time, did you configure a SerialMergeScheduler?
Otherwise merges run in separate threads, which would explain the speedup
as adding vectors to the graph gets more and more expensive as the size of
the graph increases.

Le mer. 11 oct. 2023, 05:07, Patrick Zhai  a écrit :

> Hi folks,
> I was running the HNSW benchmark today and found some weird results. Want
> to share it here and see whether people have any ideas.
>
> The set up is:
> the 384 dimension vector that's available in luceneutil, 100k documents.
> And lucene main branch.
> max_conn=64, fanout=0, beam_width=250
>
> I first tried with the default setting where we use a 1994MB writer
> buffer, so with 100k documents, there will be no merge happening and I will
> have 1 segment at the end.
> This gives me 0.755 recall and 101113ms index building time.
>
> Then I tried with 50MB writer buffer and then forcemerge at the last, and
> with 100k documents, I'll get several segments (the final index is around
> 300MB so I guess 5 or 6) before merge, and then merge them into 1 at last.
> This gives me 0.692 recall but it took only 81562ms (including 34394ms
> doing the merge) to index.
> I have also tried disabling the initialize from graph feature (such that
> when we merge we always rebuild the whole graph), or change the random
> seed, but still get the similar result.
>
> I'm wondering:
> 1. Why recall drops that much in the later setup?
> 2. Why index time is way better? I think we still need to rebuild the
> whole graph, or maybe it's just because we're using more off-heap memory
> (and less heap) when merge (do we?)?
>
> Best
> Patrick
>


Re: LeafCollector#finish idempotency?

2023-10-09 Thread Adrien Grand
Hi Greg,

I agree that LeafCollector implementations should be able to assume that
finish() only gets called once. The test framework already makes this
assumption:
https://github.com/apache/lucene/blob/dfff1e635805ffc61dd6029a8060e2635bfcbdb9/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java#L95-L100
.

On Mon, Oct 9, 2023 at 5:38 PM Greg Miller  wrote:

> Hey folks-
>
> I'm curious if anyone has thoughts around idempotency concerns related to
> the LeafCollector#finish API added in GH#12380
> . My expectation would be
> that LeafCollector implementations should be able to assume #finish will
> only get called once. In fact, it looks like FacetsCollector is already
> making that assumption.
>
> Is this inline with other folks' expectations? If so, I'm going to, 1)
> address a small bug related to drill-sideways that results #finish being
> called multiple times on one of the collectors, and 2) propose some
> additional javadoc on LeafCollector#finish clarifying this.
>
> Make sense?
>
> Cheers,
> -Greg
>


-- 
Adrien


Re: ConjunctionDISI nextDoc can return immediately when NO_MORE_DOCS

2023-10-01 Thread Adrien Grand
This is a good approach indeed, Lucene does this too.

Le dim. 1 oct. 2023, 19:33, Walter Underwood  a
écrit :

> At Infoseek, the engine checked the terms in frequency order, with the
> most rare term first. If the conjunction reached zero matches at any point,
> it stopped checking.
>
> This might be a related but more general approach.
>
> That was almost 30 years ago, so any patents are long-expired.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Oct 1, 2023, at 10:12 AM, Adrien Grand  wrote:
>
> I agree that it would save work in that case, but this query should be
> very fast anyway.
>
> On the other hand, if term1, term2 and term3 have 10M matches each, the
> conjunction will need to check if the current candidate match is
> NO_MORE_DOCS millions of times even though this would only happen once.
>
> In general it's better to have less overhead for expensive queries and
> more overhead for cheap queries than the other way around.
>
> Le dim. 1 oct. 2023, 17:35, YouPeng Yang  a
> écrit :
>
>> Hi Adrien
>> suppose that conjunction query like  (term1 AND term2 AND term3) ,if
>> the term1 does not exist ,and then the loop execution may cause
>> unnecessary  overhead.(sorry I have not yet find out whether there is any
>> filter work before the doNext()..
>>
>> Best Regard
>>
>> Adrien Grand  于2023年10月1日周日 22:30写道:
>>
>>> Hello,
>>>
>>> This change would be correct, but it would only save work when the
>>> conjunction is exhausted, and add overhead otherwise?
>>>
>>> Le sam. 30 sept. 2023, 16:20, YouPeng Yang 
>>> a écrit :
>>>
>>>> Hi
>>>>   I am reading the code of class ConjunctionDISI .and about the method
>>>> nextDoc ,  Suppose that the sub-disi is emtpy in the lead1.lead2,should
>>>> there be that it can return immediately when the input doc==NO_MORE_DOCS?
>>>>
>>>>
>>>> private int doNext(int doc) throws IOException {
>>>> advanceHead:
>>>> for (; ; ) {
>>>> assert doc == lead1.docID();
>>>> //assumpt doc==NO_MORE_DOCS ,it return
>>>> if(doc==NO_MORE_DOCS){
>>>> return NO_MORE_DOCS;
>>>> }
>>>> // find agreement between the two iterators with the lower costs
>>>> // we special case them because they do not need the
>>>> // 'other.docID() < doc' check that the 'others' iterators need
>>>> final int next2 = lead2.advance(doc);
>>>> if (next2 != doc) {
>>>> doc = lead1.advance(next2);
>>>> if(doc==NO_MORE_DOCS){
>>>> return NO_MORE_DOCS;
>>>> }
>>>> if (next2 != doc) {
>>>> continue;
>>>> }
>>>> }
>>>> ..left omited...
>>>> }
>>>>
>>>
>


Re: ConjunctionDISI nextDoc can return immediately when NO_MORE_DOCS

2023-10-01 Thread Adrien Grand
I agree that it would save work in that case, but this query should be very
fast anyway.

On the other hand, if term1, term2 and term3 have 10M matches each, the
conjunction will need to check if the current candidate match is
NO_MORE_DOCS millions of times even though this would only happen once.

In general it's better to have less overhead for expensive queries and more
overhead for cheap queries than the other way around.

Le dim. 1 oct. 2023, 17:35, YouPeng Yang  a
écrit :

> Hi Adrien
> suppose that conjunction query like  (term1 AND term2 AND term3) ,if
> the term1 does not exist ,and then the loop execution may cause
> unnecessary  overhead.(sorry I have not yet find out whether there is any
> filter work before the doNext()..
>
> Best Regard
>
> Adrien Grand  于2023年10月1日周日 22:30写道:
>
>> Hello,
>>
>> This change would be correct, but it would only save work when the
>> conjunction is exhausted, and add overhead otherwise?
>>
>> Le sam. 30 sept. 2023, 16:20, YouPeng Yang  a
>> écrit :
>>
>>> Hi
>>>   I am reading the code of class ConjunctionDISI .and about the method
>>> nextDoc ,  Suppose that the sub-disi is emtpy in the lead1.lead2,should
>>> there be that it can return immediately when the input doc==NO_MORE_DOCS?
>>>
>>>
>>> private int doNext(int doc) throws IOException {
>>> advanceHead:
>>> for (; ; ) {
>>> assert doc == lead1.docID();
>>> //assumpt doc==NO_MORE_DOCS ,it return
>>> if(doc==NO_MORE_DOCS){
>>> return NO_MORE_DOCS;
>>> }
>>> // find agreement between the two iterators with the lower costs
>>> // we special case them because they do not need the
>>> // 'other.docID() < doc' check that the 'others' iterators need
>>> final int next2 = lead2.advance(doc);
>>> if (next2 != doc) {
>>> doc = lead1.advance(next2);
>>> if(doc==NO_MORE_DOCS){
>>> return NO_MORE_DOCS;
>>> }
>>> if (next2 != doc) {
>>> continue;
>>> }
>>> }
>>> ..left omited...
>>> }
>>>
>>


Re: ConjunctionDISI nextDoc can return immediately when NO_MORE_DOCS

2023-10-01 Thread Adrien Grand
Hello,

This change would be correct, but it would only save work when the
conjunction is exhausted, and add overhead otherwise?

Le sam. 30 sept. 2023, 16:20, YouPeng Yang  a
écrit :

> Hi
>   I am reading the code of class ConjunctionDISI .and about the method
> nextDoc ,  Suppose that the sub-disi is emtpy in the lead1.lead2,should
> there be that it can return immediately when the input doc==NO_MORE_DOCS?
>
>
> private int doNext(int doc) throws IOException {
> advanceHead:
> for (; ; ) {
> assert doc == lead1.docID();
> //assumpt doc==NO_MORE_DOCS ,it return
> if(doc==NO_MORE_DOCS){
> return NO_MORE_DOCS;
> }
> // find agreement between the two iterators with the lower costs
> // we special case them because they do not need the
> // 'other.docID() < doc' check that the 'others' iterators need
> final int next2 = lead2.advance(doc);
> if (next2 != doc) {
> doc = lead1.advance(next2);
> if(doc==NO_MORE_DOCS){
> return NO_MORE_DOCS;
> }
> if (next2 != doc) {
> continue;
> }
> }
> ..left omited...
> }
>


Re: Solr upgrade to Lucene 9.8.0 question

2023-09-28 Thread Adrien Grand
Hi Alex,

I believe that your analysis is correct.

> is it expected that the 'finish' method is idempotent?

I don't expect `finish()` to be idempotent. It should not get called
multiple times per segment either, only once and when collection runs
successfully. Do you have a Lucene test case that reproduces this double
calling of finish()?

I'm sorry this change broke Solr. I remember that Solr had post-collection
hooks, which felt like another case for adding this new API, but I
overlooked that it could break Solr by introducing a clash given that Solr
uses SimpleCollector.

Maybe we should think of deprecating SimpleCollector in Lucene and
recommend going with Collector directly. SimpleCollector is mostly a
backward compatibility layer with the old (9 years old
) collector API, we already
moved some of Lucene's main collectors to the new API, e.g.
TopScoreDocCollector and TopFieldCollector. Let's move other collectors
too, e.g. FacetsCollector and friends?



On Thu, Sep 28, 2023 at 12:31 AM Alex Deparvu  wrote:

> Hi,
>
> I am working on getting Solr upgraded to Lucene 9.8 [0] and I wanted to
> raise visibility on an issue I ran into.
>
> I believe PR#12380 [1] introduced a change that calls `finish()` on the
> LeafCollector [2]. The trouble is on Solr side we have a few collectors
> that extend SimpleCollector. (to be more precise there is a
> DelegatingCollector in between but that does not change things).
> SimpleCollector returns `this` on the `getLeafCollector` call, so now
> there are 2 calls to the `finish()` method on the same collector instance
> (one as a leaf, one at the end).
>
> One example I am working with is CollapsingQParserPlugin$OrdScoreCollector
> [3] where I am seeing a few tests fail because calling `finish` twice will
> mess up the results. I don't know yet if there are others.
>
> My first question is to validate this analysis with someone that knows
> this code (and perhaps the Solr code too), and ideally also take a quick
> look at my fix [4].
>
> Second is related to PR#12380. is it expected that the 'finish' method is
> idempotent? per my tests this seems to be called twice now in some cases
> and it will be the case for any implementation extending SimpleCollector.
> also given that SimpleCollector's getLeafCollector method is final, there
> is almost no room for passing some state wrt. this being a leaf vs not.
>
>
> thanks,
> alex
>
>
> [0] https://github.com/apache/solr/pull/1958
> [1] https://github.com/apache/lucene/pull/12380
> [2]
> https://github.com/apache/lucene/blob/releases/lucene/9.8.0/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L779
> [3]
> https://github.com/apache/solr/blob/f0fcd300c896b858ae83235ecdb0a109eaea5cea/solr/core/src/java/org/apache/solr/search/CollapsingQParserPlugin.java#L594
> [4]
> https://github.com/apache/solr/commit/fc8a2ffe8951f31aa3b65fac2adc9eaa3fee6258
>
>

-- 
Adrien


Re: Can the BooleanQuery execution be optimized with same term queries

2023-09-23 Thread Adrien Grand
Thanks for letting me know, I'm glad you like them!


Le ven. 22 sept. 2023, 16:36, YouPeng Yang  a
écrit :

> Hi Adrien
>Glad to have your opinion.I am reading your excellent articles  on
> elastic blog.
>
> Best regards
>
>
> Adrien Grand  于2023年9月19日周二 21:32写道:
>
>> Hi Yang,
>>
>> It would be legal for Lucene to perform such optimizations indeed.
>>
>> On Tue, Sep 19, 2023 at 3:27 PM YouPeng Yang 
>> wrote:
>> >
>> > Hi All
>> >
>> >  Sorry to bother you.The happiest thing is  studying the Lucene source
>> codes,thank you for all the  great works .
>> >
>> >
>> >   About the BooleanQuery.I am encountered by a question about the
>> execution of BooleanQuery:although,BooleanQuery#rewrite has done some
>> works to remove duplicate FILTER,SHOULD clauses.however still the same term
>> query can been executed the several times.
>> >
>> >   I copied the test code in the TestBooleanQuery to confirm my
>> assumption.
>> >
>> >   Unit Test Code as follows:
>> >
>> >
>> >
>> > BooleanQuery.Builder qBuilder = new BooleanQuery.Builder();
>> >
>> > qBuilder = new BooleanQuery.Builder();
>> >
>> > qBuilder.add(new TermQuery(new Term("field", "b")), Occur.FILTER);
>> >
>> > qBuilder.add(new TermQuery(new Term("field", "a")), Occur.SHOULD);
>> >
>> > qBuilder.add(new TermQuery(new Term("field", "d")), Occur.SHOULD);
>> >
>> > BooleanQuery.Builder nestQuery  = new BooleanQuery.Builder();
>> >
>> > nestQuery.add(new TermQuery(new Term("field", "b")), Occur.FILTER);
>> >
>> > nestQuery.add(new TermQuery(new Term("field", "a")), Occur.SHOULD);
>> >
>> > nestQuery.add(new TermQuery(new Term("field", "d")), Occur.SHOULD);
>> >
>> > qBuilder.add(nestQuery.build(),Occur.SHOULD);
>> >
>> > qBuilder.setMinimumNumberShouldMatch(1);
>> >
>> > BooleanQuery q = qBuilder.build();
>> >
>> > q = qBuilder.build();
>> >
>> > assertSameScoresWithoutFilters(searcher, q);
>> >
>> >
>> > In this test, the top boolean query(qBuilder) contains 4 clauses(3
>> simple term-query ,1 nested boolean query that contains the same 3
>> term-query).
>> >
>> > The underlying execution is that all the 6 term query were executed(see
>> TermQuery.Termweight#getTermsEnum()).
>> >
>> > Apparently and theoretically,  the executions can be merged to increase
>> the time,right?.
>> >
>> >
>> > So,is it possible or necessary  that Lucene merge the execution to
>> optimize the query performance, even though I know the optimization may be
>> difficult.
>> >
>> >
>> >
>>
>>
>> --
>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [VOTE] Release Lucene 9.8.0 RC1

2023-09-22 Thread Adrien Grand
+1 SUCCESS! [0:54:58.932481]

On Fri, Sep 22, 2023 at 4:18 PM Uwe Schindler  wrote:
>
> Hi,
>
> I verified the release with the usual tools and my workflow:
>
> Policeman Jenkins ran smoketester for me with Java 11 and Java 17:
> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/28/console
>
> SUCCESS! [1:10:15.704228]
>
> In addition I checked the changes entries and ran Luke with Java 21 GA
> (released two days ago). All fine!
>
> +1 to release!
>
> Am 22.09.2023 um 07:48 schrieb Patrick Zhai:
> > Please vote for release candidate 1 for Lucene 9.8.0
> >
> > The artifacts can be downloaded from:
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-9.8.0-RC1-rev-d914b3722bd5b8ef31ccf7e8ddc638a87fd648db
> >
> > You can run the smoke tester directly with this command:
> >
> > python3 -u dev-tools/scripts/smokeTestRelease.py \
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-9.8.0-RC1-rev-d914b3722bd5b8ef31ccf7e8ddc638a87fd648db
> >
> > The vote will be open for at least 72 hours, as there's a weekend, the
> > vote will last until 2023-09-27 06:00 UTC.
> >
> > [ ] +1  approve
> > [ ] +0  no opinion
> > [ ] -1  disapprove (and reason why)
> >
> > Here is my +1 (non-binding)
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Adrien Grand
Hi Walter,

You emailed the Lucene dev list (dev@lucene.a.o) but I think you meant
to ask this question to the Solr list (dev@solr.a.o).

On Wed, Sep 20, 2023 at 8:59 PM Walter Underwood  wrote:
>
> When I get web search results that include the Solr Reference Guide, I often 
> get older versions (6.6, 7.4) in the results. I would prefer to always get 
> the latest reference (https://solr.apache.org/guide/solr/latest/index.html).
>
> I think we can list the URLs for that in a sitemap.xml file with a higher 
> priority to suggest to the crawlers that these are the preferred pages.
>
> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org.
>
> Should we prefer the latest manual? How do we build/deploy a sitemap? See: 
> https://www.sitemaps.org/
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.8 Release

2023-09-21 Thread Adrien Grand
Thanks Patrick. I expanded a bit on the optimization section to
highlight the sort of speedup that nightly benchmarks reported, and
moved this section first as I suspect that users would be especially
interested in these speedups.

Out of curiosity, do you know when you plan on creating a release candidate?

On Thu, Sep 21, 2023 at 7:40 AM Patrick Zhai  wrote:
>
> Hi all,
> Here's the draft release note: 
> https://cwiki.apache.org/confluence/display/LUCENE/Draft+Release+Notes+9.8
>
> Please feel free to edit if you feel like to add anything
>
> Best
> Patrick
>
> On Tue, Sep 19, 2023 at 12:05 AM Adrien Grand  wrote:
>>
>> Thanks Patrick, this PR is now merged.
>>
>> On Tue, Sep 19, 2023 at 6:22 AM Patrick Zhai  wrote:
>> >
>> > Update:
>> > Will wait https://github.com/apache/lucene/pull/12568 to be merged to cut 
>> > the branch
>> >
>> >
>> > On Mon, Sep 18, 2023 at 11:00 AM Michael Sokolov  
>> > wrote:
>> >>
>> >> +1 for a release soon, and thanks for volunteering, Patrick!
>> >>
>> >> On Tue, Sep 12, 2023 at 2:08 AM Patrick Zhai  wrote:
>> >> >
>> >> > Hi all,
>> >> > It's been a while since the last release and we have quite a few good 
>> >> > changes including new APIs, improvements and bug fixes. Should we 
>> >> > release the 9.8?
>> >> >
>> >> > If there's no objections I volunteer to be the release manager and will 
>> >> > cut the feature branch a week from now, which is Sep. 18th PST.
>> >> >
>> >> > Best
>> >> > Patrick
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>>
>> --
>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Can the BooleanQuery execution be optimized with same term queries

2023-09-19 Thread Adrien Grand
Hi Yang,

It would be legal for Lucene to perform such optimizations indeed.

On Tue, Sep 19, 2023 at 3:27 PM YouPeng Yang  wrote:
>
> Hi All
>
>  Sorry to bother you.The happiest thing is  studying the Lucene source 
> codes,thank you for all the  great works .
>
>
>   About the BooleanQuery.I am encountered by a question about the execution 
> of BooleanQuery:although,BooleanQuery#rewrite has done some  works to remove 
> duplicate FILTER,SHOULD clauses.however still the same term query can been 
> executed the several times.
>
>   I copied the test code in the TestBooleanQuery to confirm my assumption.
>
>   Unit Test Code as follows:
>
>
>
> BooleanQuery.Builder qBuilder = new BooleanQuery.Builder();
>
> qBuilder = new BooleanQuery.Builder();
>
> qBuilder.add(new TermQuery(new Term("field", "b")), Occur.FILTER);
>
> qBuilder.add(new TermQuery(new Term("field", "a")), Occur.SHOULD);
>
> qBuilder.add(new TermQuery(new Term("field", "d")), Occur.SHOULD);
>
> BooleanQuery.Builder nestQuery  = new BooleanQuery.Builder();
>
> nestQuery.add(new TermQuery(new Term("field", "b")), Occur.FILTER);
>
> nestQuery.add(new TermQuery(new Term("field", "a")), Occur.SHOULD);
>
> nestQuery.add(new TermQuery(new Term("field", "d")), Occur.SHOULD);
>
> qBuilder.add(nestQuery.build(),Occur.SHOULD);
>
> qBuilder.setMinimumNumberShouldMatch(1);
>
> BooleanQuery q = qBuilder.build();
>
> q = qBuilder.build();
>
> assertSameScoresWithoutFilters(searcher, q);
>
>
> In this test, the top boolean query(qBuilder) contains 4 clauses(3 simple 
> term-query ,1 nested boolean query that contains the same 3 term-query).
>
> The underlying execution is that all the 6 term query were executed(see 
> TermQuery.Termweight#getTermsEnum()).
>
> Apparently and theoretically,  the executions can be merged to increase the 
> time,right?.
>
>
> So,is it possible or necessary  that Lucene merge the execution to optimize 
> the query performance, even though I know the optimization may be difficult.
>
>
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [lucene] branch branch_9x updated: Fix issues with BP tests and the security manager. (#12568)

2023-09-19 Thread Adrien Grand
Tricky problem that only gets detected withyJava 11! It should be
fixed now on main and branch_9x.

Patrick, I think you should feel free to cut the branch, if there's
any other problem I will still be able to backport fixes to the newly
created branch.

On Tue, Sep 19, 2023 at 9:52 AM Uwe Schindler  wrote:
>
> I know where it comes from. The javadoc comment has a "<" sign.
>
> I would also fix this in main.
>
> Am 19.09.2023 um 09:48 schrieb Uwe Schindler:
> > Looks like Java 11 can't compile this, see
> > https://github.com/apache/lucene/actions/runs/6232257025/job/16915121779#step:5:452
> >
> > /home/runner/work/lucene/lucene/lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:78:
> > error: bad use of '>'
> >
> >  * p -> new ForkJoinWorkerThread(p) {}, null,
> > random().nextBoolean());
> > > Task :lucene:misc:compileJava FAILED
> >   ^
> > Note:
> > /home/runner/work/lucene/lucene/lucene/misc/src/java/org/apache/lucene/misc/util/fst/UpToTwoPositiveIntOutputs.java
> > uses or overrides a deprecated API.
> > Note: Recompile with -Xlint:deprecation for details.
> > 1 error
> > Note: Some input files use or override a deprecated API.
> >
> > Not sure what's wrong, I think the problem is with the anonymous
> > subclassing Maybe brackets around the whole "new ForkJoin()
> > {}" helps?
> >
> > Uwe
> >
> > Am 19.09.2023 um 09:04 schrieb jpou...@apache.org:
> >> This is an automated email from the ASF dual-hosted git repository.
> >>
> >> jpountz pushed a commit to branch branch_9x
> >> in repository https://gitbox.apache.org/repos/asf/lucene.git
> >>
> >>
> >> The following commit(s) were added to refs/heads/branch_9x by this push:
> >>   new c241ab006c4 Fix issues with BP tests and the security
> >> manager. (#12568)
> >> c241ab006c4 is described below
> >>
> >> commit c241ab006c4be918207adc69bb34fa72a48286f3
> >> Author: Adrien Grand 
> >> AuthorDate: Tue Sep 19 08:55:48 2023 +0200
> >>
> >>  Fix issues with BP tests and the security manager. (#12568)
> >>   The default ForkJoinPool implementation uses a thread
> >> factory that removes all
> >>  permissions on threads, so we need to create our own to avoid
> >> tests failing
> >>  with FS-based directories.
> >> ---
> >> .../src/java/org/apache/lucene/misc/index/BPIndexReorderer.java | 4 +++-
> >> .../test/org/apache/lucene/misc/index/TestBPIndexReorderer.java | 7
> >> ++-
> >>   2 files changed, 9 insertions(+), 2 deletions(-)
> >>
> >> diff --git
> >> a/lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java
> >> b/lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java
> >>
> >> index 7482e7a06ed..b8dadc3f6a0 100644
> >> ---
> >> a/lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java
> >> +++
> >> b/lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java
> >> @@ -74,7 +74,9 @@ import
> >> org.apache.lucene.util.OfflineSorter.BufferSize;
> >>*
> >>* Directory targetDir = FSDirectory.open(targetPath);
> >>* BPIndexReorderer reorderer = new BPIndexReorderer();
> >> - * reorderer.setForkJoinPool(ForkJoinPool.commonPool());
> >> + * ForkJoinPool pool = new
> >> ForkJoinPool(Runtime.getRuntime().availableProcessors(),
> >> + * p -> new ForkJoinWorkerThread(p) {}, null,
> >> random().nextBoolean());
> >> + * reorderer.setForkJoinPool(pool);
> >>* reorderer.setFields(Collections.singleton("body"));
> >>* CodecReader reorderedReaderView =
> >> reorderer.reorder(SlowCodecReaderWrapper.wrap(reader), targetDir);
> >>* try (IndexWriter w = new IndexWriter(targetDir, new
> >> IndexWriterConfig().setOpenMode(OpenMode.CREATE))) {
> >> diff --git
> >> a/lucene/misc/src/test/org/apache/lucene/misc/index/TestBPIndexReorderer.java
> >> b/lucene/misc/src/test/org/apache/lucene/misc/index/TestBPIndexReorderer.java
> >>
> >> index 4b6a9a85037..13d6989ff74 100644
> >> ---
> >> a/lucene/misc/src/test/org/apache/lucene/misc/index/TestBPIndexReorderer.java
> >> +++
> >> b/lucene/misc/src/test/org/apache/lucene/misc/index/TestBPIndexReorderer.java
> >> @@ -21,6 +21,7 @@ import static
> >> org.apache.lucene.misc.index.BPIndexReorderer.fastLog2;
> >>   i

Re: Lucene 9.8 Release

2023-09-19 Thread Adrien Grand
Thanks Patrick, this PR is now merged.

On Tue, Sep 19, 2023 at 6:22 AM Patrick Zhai  wrote:
>
> Update:
> Will wait https://github.com/apache/lucene/pull/12568 to be merged to cut the 
> branch
>
>
> On Mon, Sep 18, 2023 at 11:00 AM Michael Sokolov  wrote:
>>
>> +1 for a release soon, and thanks for volunteering, Patrick!
>>
>> On Tue, Sep 12, 2023 at 2:08 AM Patrick Zhai  wrote:
>> >
>> > Hi all,
>> > It's been a while since the last release and we have quite a few good 
>> > changes including new APIs, improvements and bug fixes. Should we release 
>> > the 9.8?
>> >
>> > If there's no objections I volunteer to be the release manager and will 
>> > cut the feature branch a week from now, which is Sep. 18th PST.
>> >
>> > Best
>> > Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-MMAPv2-Windows (64bit/hotspot/jdk-21-rc) - Build # 801 - Still Unstable!

2023-09-18 Thread Adrien Grand
OK I just did that too, see https://github.com/apache/lucene/pull/12568.

On Mon, Sep 18, 2023 at 6:32 PM Uwe Schindler  wrote:
>
> It may still be a good idea to show an example how to pass a
> ForkJoinPool to the sorter that does not limit permissions (just
> examples for the educated reader).
>
> Uwe
>
> Am 18.09.2023 um 18:18 schrieb Adrien Grand:
> > Thanks Uwe for digging. The fork-join pool is optional, I will change
> > the test to use a ByteBuffersDirectory.
> >
> > On Mon, Sep 18, 2023 at 6:15 PM Uwe Schindler  wrote:
> >> Hi,
> >>
> >> this issue is a real one. The problem is: The default ForkJoin thread pool 
> >> runs all tasks with zero permissions if a security manager is present. As 
> >> the MMap Jenkins enforces usage of MMapDirectory for all tests (it passes 
> >> -Dtests.directory=MMapDirectory), all disk IO fails.
> >>
> >> This will be a big issue for Elasticsearch/Opensearch/Solr if we use the 
> >> default thread pool. If this is a test only issue, we should fix it:
> >>
> >> use non-FS-based directory
> >> use our own thread pool
> >>
> >> If this issue is in 9.8 branch we have to fix it!
> >>
> >> Uwe
> >>
> >> Am 18.09.2023 um 17:59 schrieb Policeman Jenkins Server:
> >>
> >> Build: https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Windows/801/
> >> Java: 64bit/hotspot/jdk-21-rc -XX:-UseCompressedOops -XX:+UseG1GC
> >>
> >> 1 tests failed.
> >> FAILED:  
> >> org.apache.lucene.misc.index.TestBPIndexReorderer.testSingleTermWithForkJoinPool
> >>
> >> Error Message:
> >> java.security.AccessControlException: access denied 
> >> ("java.io.FilePermission" 
> >> "C:\Users\jenkins\workspace\Lucene-MMAPv2-Windows\lucene\misc\build\tmp\tests-tmp\lucene.misc.index.TestBPIndexReorderer_4B02FABB1F62D832-001\index-MMapDirectory-003\forward-index_sort_5.tmp"
> >>  "write")
> >>
> >> Stack Trace:
> >> java.security.AccessControlException: access denied 
> >> ("java.io.FilePermission" 
> >> "C:\Users\jenkins\workspace\Lucene-MMAPv2-Windows\lucene\misc\build\tmp\tests-tmp\lucene.misc.index.TestBPIndexReorderer_4B02FABB1F62D832-001\index-MMapDirectory-003\forward-index_sort_5.tmp"
> >>  "write")
> >> at __randomizedtesting.SeedInfo.seed([4B02FABB1F62D832:77694EDC9D6E8956]:0)
> >> at 
> >> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488)
> >> at 
> >> java.base/java.security.AccessController.checkPermission(AccessController.java:1071)
> >> at 
> >> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411)
> >> at java.base/java.lang.SecurityManager.checkWrite(SecurityManager.java:833)
> >> at 
> >> java.base/sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
> >> at 
> >> java.base/sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:168)
> >> at 
> >> java.base/sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:229)
> >> at 
> >> java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:132)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:132)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> >> at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
> >> at 
> >> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:394)
> >> at 
> >> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory.createTempOutput(FSDirectory.java:234)
> >> at 
> >> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.t

Re: [JENKINS] Lucene-MMAPv2-Windows (64bit/hotspot/jdk-21-rc) - Build # 801 - Still Unstable!

2023-09-18 Thread Adrien Grand
Thanks Uwe for digging. The fork-join pool is optional, I will change
the test to use a ByteBuffersDirectory.

On Mon, Sep 18, 2023 at 6:15 PM Uwe Schindler  wrote:
>
> Hi,
>
> this issue is a real one. The problem is: The default ForkJoin thread pool 
> runs all tasks with zero permissions if a security manager is present. As the 
> MMap Jenkins enforces usage of MMapDirectory for all tests (it passes 
> -Dtests.directory=MMapDirectory), all disk IO fails.
>
> This will be a big issue for Elasticsearch/Opensearch/Solr if we use the 
> default thread pool. If this is a test only issue, we should fix it:
>
> use non-FS-based directory
> use our own thread pool
>
> If this issue is in 9.8 branch we have to fix it!
>
> Uwe
>
> Am 18.09.2023 um 17:59 schrieb Policeman Jenkins Server:
>
> Build: https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Windows/801/
> Java: 64bit/hotspot/jdk-21-rc -XX:-UseCompressedOops -XX:+UseG1GC
>
> 1 tests failed.
> FAILED:  
> org.apache.lucene.misc.index.TestBPIndexReorderer.testSingleTermWithForkJoinPool
>
> Error Message:
> java.security.AccessControlException: access denied ("java.io.FilePermission" 
> "C:\Users\jenkins\workspace\Lucene-MMAPv2-Windows\lucene\misc\build\tmp\tests-tmp\lucene.misc.index.TestBPIndexReorderer_4B02FABB1F62D832-001\index-MMapDirectory-003\forward-index_sort_5.tmp"
>  "write")
>
> Stack Trace:
> java.security.AccessControlException: access denied ("java.io.FilePermission" 
> "C:\Users\jenkins\workspace\Lucene-MMAPv2-Windows\lucene\misc\build\tmp\tests-tmp\lucene.misc.index.TestBPIndexReorderer_4B02FABB1F62D832-001\index-MMapDirectory-003\forward-index_sort_5.tmp"
>  "write")
> at __randomizedtesting.SeedInfo.seed([4B02FABB1F62D832:77694EDC9D6E8956]:0)
> at 
> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488)
> at 
> java.base/java.security.AccessController.checkPermission(AccessController.java:1071)
> at 
> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411)
> at java.base/java.lang.SecurityManager.checkWrite(SecurityManager.java:833)
> at 
> java.base/sun.nio.fs.WindowsChannelFactory.open(WindowsChannelFactory.java:302)
> at 
> java.base/sun.nio.fs.WindowsChannelFactory.newFileChannel(WindowsChannelFactory.java:168)
> at 
> java.base/sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:229)
> at 
> java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:132)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:132)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:198)
> at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:394)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory.createTempOutput(FSDirectory.java:234)
> at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.createTempOutput(MockDirectoryWrapper.java:752)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.TrackingDirectoryWrapper.createTempOutput(TrackingDirectoryWrapper.java:49)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.TrackingDirectoryWrapper.createTempOutput(TrackingDirectoryWrapper.java:49)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.OfflineSorter$SortPartitionTask.call(OfflineSorter.java:623)
> at 
> org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.OfflineSorter$SortPartitionTask.call(OfflineSorter.java:610)
> at 
> java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1456)
> at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
> at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
> at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
> at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
> at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
>
>
> -
> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> For additional 

Re: Lucene 9.8 Release

2023-09-12 Thread Adrien Grand
Thanks Patrick for volunteering as release manager!

Le mar. 12 sept. 2023, 08:07, Patrick Zhai  a écrit :

> Hi all,
> It's been a while since the last release and we have quite a few good
> changes including new APIs, improvements and bug fixes. Should we release
> the 9.8?
>
> If there's no objections I volunteer to be the release manager and will
> cut the feature branch a week from now, which is Sep. 18th PST.
>
> Best
> Patrick
>


Re: Enabling concurrent search only for certain queries

2023-07-19 Thread Adrien Grand
Hi Alexander,

It sounds likely that it will always be possible to pass an Executor
to IndexSearcher's constructor. So this sounds like a safe bet.

On Wed, Jul 19, 2023 at 7:22 AM Alexander Lukyanchikov
 wrote:
>
> Hi Adrien,
>
> Yes, that can be done. I just wanted to make sure my understanding is correct 
> and that's how the future API is going to look like before we do this 
> refactoring. Thank you.
>
> --
> Regards,
> Alex
>
>
> On Tue, Jul 18, 2023 at 3:26 PM Adrien Grand  wrote:
>>
>> Hi Alexander,
>>
>> You mentioned that your current implementation relies on a single 
>> IndexSearcher. Could you have two instead? One that configures an executor 
>> for long running queries and another one that doesn't?
>>
>> For reference, IndexSearchers are cheap to create, it would be ok to create 
>> one per query if that helps.
>>
>>
>> Le mar. 18 juil. 2023, 23:59, Alexander Lukyanchikov 
>>  a écrit :
>>>
>>> Hi everyone,
>>> We performed testing of the concurrent rewrite for knn vector queries in 
>>> Lucene 9.7 and the results look great, we see up to x9 improvement on large 
>>> datasets.
>>>
>>> Our current implementation for intra-query concurrency relies on a single 
>>> IndexSearcher per index which is always configured with an executor. The 
>>> intention is to execute only heavy / long running queries in concurrent 
>>> mode, so we use either Collector or CollectorManager API to control this 
>>> behavior. But the concurrent rewrite in KnnVectorQuery is effectively 
>>> always enabled if the IndexSearcher is configured with an executor, so we 
>>> need to find another way to turn it on and off when needed.
>>>
>>> Knowing that IndexSearcher#search(Query, Collector) is going to be removed 
>>> eventually, and a similar change was implemented for DrillSideays, my 
>>> understanding is that the long-term plan is to rely only on the presence of 
>>> the executor in IndexSearcher to select the sequential/concurrent code 
>>> path. Is this correct, or would people be open to introducing an additional 
>>> flag (e.g. in IndexSearch#search) to be able to override the default 
>>> behavior?
>>>
>>> --
>>> Regards,
>>> Alex
>>>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Enabling concurrent search only for certain queries

2023-07-18 Thread Adrien Grand
Hi Alexander,

You mentioned that your current implementation relies on a single
IndexSearcher. Could you have two instead? One that configures an executor
for long running queries and another one that doesn't?

For reference, IndexSearchers are cheap to create, it would be ok to create
one per query if that helps.


Le mar. 18 juil. 2023, 23:59, Alexander Lukyanchikov <
alexanderlukyanchi...@gmail.com> a écrit :

> Hi everyone,
> We performed testing of the concurrent rewrite for knn vector queries in
> Lucene 9.7 and the results look great, we see up to x9 improvement on large
> datasets.
>
> Our current implementation for intra-query concurrency relies on a single
> IndexSearcher per index which is always configured with an executor. The
> intention is to execute only heavy / long running queries in concurrent
> mode, so we use either Collector or CollectorManager API to control this
> behavior. But the concurrent rewrite in KnnVectorQuery is effectively
> always enabled if the IndexSearcher is configured with an executor, so we
> need to find another way to turn it on and off when needed.
>
> Knowing that IndexSearcher#search(Query, Collector) is going to be removed
>  eventually, and a similar
> change  was implemented for
> DrillSideays, my understanding is that the long-term plan is to rely only
> on the presence of the executor in IndexSearcher to select the
> sequential/concurrent code path. Is this correct, or would people be open
> to introducing an additional flag (e.g. in IndexSearch#search) to be able
> to override the default behavior?
>
> --
> Regards,
> Alex
>
>


Re: [JENKINS] Lucene-9.x-Linux (64bit/hotspot/jdk-17.0.5) - Build # 11322 - Unstable!

2023-06-27 Thread Adrien Grand
I opened a PR at https://github.com/apache/lucene/pull/12400 with a
fix, I tried to explain in the PR description why AssertingScorer has
this check. Even though it's not documented in BulkScorer#score, I
think it's a good check to keep.

On Wed, Jun 28, 2023 at 6:25 AM Adrien Grand  wrote:
>
> Thanks Patrick, I will look into it this morning.
>
> Le mer. 28 juin 2023, 06:20, Patrick Zhai  a écrit :
>>
>> Yeah I think that's the commit, I'm definitely not an expert on scorer as 
>> well so maybe @jpou...@gmail.com could you take a look?
>>
>> Patrick
>>
>> On Tue, Jun 27, 2023 at 5:34 AM Michael McCandless 
>>  wrote:
>>>
>>> Thanks for digging Patrick!
>>>
>>> I sort of think MaxScoreBulkScorer should be returning NO_MORE_DOCS in this 
>>> case?  But I'm far from an expert.  This may be related to the recent 
>>> MAXScore improvements for disjunctions?  
>>> (https://github.com/apache/lucene/commit/8703e449cee0693e50a7922a86c1cbc7dcf95d13)
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jun 27, 2023 at 2:34 AM Patrick Zhai  wrote:
>>>>
>>>> The exception was thrown because TimeLimitingBulkScorer passed in a "max" 
>>>> which is larger than the maxDoc in the segment. And then 
>>>> MaxScoreBulkScorer directly returns the rangeEnd as the next estimation 
>>>> here and finally makes AssertingBulkScorer unhappy because it expects a 
>>>> NO_MORE_DOC in case that the "max" or "next" is larger than maxDoc. (here)
>>>>
>>>> I'm not super sure what's the right fix, seems to me neither 
>>>> TimeLimitingBulkScorer nor MaxScoreBulkScorer has violated the contract 
>>>> (as we never state in javadoc guarantee that if there's no more doc the 
>>>> method will return NO_MORE_DOC), so perhaps we should just let 
>>>> AssertingBulkScorer tolerate the case?
>>>>
>>>> Patrick
>>>>
>>>> On Mon, Jun 26, 2023 at 10:54 PM Policeman Jenkins Server 
>>>>  wrote:
>>>>>
>>>>> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/11322/
>>>>> Java: 64bit/hotspot/jdk-17.0.5 -XX:-UseCompressedOops -XX:+UseSerialGC
>>>>>
>>>>> 1 tests failed.
>>>>> FAILED:  org.apache.lucene.expressions.TestExpressionSorts.testQueries
>>>>>
>>>>> Error Message:
>>>>> java.lang.AssertionError
>>>>>
>>>>> Stack Trace:
>>>>> java.lang.AssertionError
>>>>> at 
>>>>> __randomizedtesting.SeedInfo.seed([9D337074B96D1F8C:C1BDBCAFA304AA22]:0)
>>>>> at 
>>>>> org.apache.lucene.test_framework@9.8.0-SNAPSHOT/org.apache.lucene.tests.search.AssertingBulkScorer.score(AssertingBulkScorer.java:105)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.TimeLimitingBulkScorer.score(TimeLimitingBulkScorer.java:82)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:776)
>>>>> at 
>>>>> org.apache.lucene.test_framework@9.8.0-SNAPSHOT/org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:78)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:694)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:688)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:668)
>>>>> at 
>>>>> org.apache.lucene.core@9.8.0-SNAPSHOT/org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:571)
>>>>> at 
>>>>> org.apache.lucene.expressions.TestExpressionSorts.assertQuery(TestExpressionSorts.java:119)
>>>>> at 
>>>>> org.apache.lucene.expressions.TestExpressionSorts.assertQuery(TestExpressionSorts.java:113)
>>>>> at 
>>>>> org.apache.lucene.expressions.TestExpressionSorts.testQueries(TestExpressionSorts.java:92)
>>>>>  

Re: [JENKINS] Lucene-9.x-Linux (64bit/hotspot/jdk-17.0.5) - Build # 11322 - Unstable!

2023-06-27 Thread Adrien Grand
Thanks Patrick, I will look into it this morning.

Le mer. 28 juin 2023, 06:20, Patrick Zhai  a écrit :

> Yeah I think that's the commit, I'm definitely not an expert on scorer as
> well so maybe @jpou...@gmail.com  could you take a
> look?
>
> Patrick
>
> On Tue, Jun 27, 2023 at 5:34 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Thanks for digging Patrick!
>>
>> I sort of think MaxScoreBulkScorer should be returning NO_MORE_DOCS in
>> this case?  But I'm far from an expert.  This may be related to the recent
>> MAXScore improvements for disjunctions?  (
>> https://github.com/apache/lucene/commit/8703e449cee0693e50a7922a86c1cbc7dcf95d13
>> )
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Jun 27, 2023 at 2:34 AM Patrick Zhai  wrote:
>>
>>> The exception was thrown because TimeLimitingBulkScorer passed in a
>>> "max" which is larger than the maxDoc in the segment. And then
>>> MaxScoreBulkScorer directly returns the rangeEnd as the next estimation
>>> here
>>> 
>>>  and
>>> finally makes AssertingBulkScorer unhappy because it expects a NO_MORE_DOC
>>> in case that the "max" or "next" is larger than maxDoc. (here
>>> 
>>> )
>>>
>>> I'm not super sure what's the right fix, seems to me neither
>>> TimeLimitingBulkScorer nor MaxScoreBulkScorer has violated the contract (as
>>> we never state in javadoc guarantee that if there's no more doc the method
>>> will return NO_MORE_DOC), so perhaps we should just let AssertingBulkScorer
>>> tolerate the case?
>>>
>>> Patrick
>>>
>>> On Mon, Jun 26, 2023 at 10:54 PM Policeman Jenkins Server <
>>> jenk...@thetaphi.de> wrote:
>>>
 Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/11322/
 Java: 64bit/hotspot/jdk-17.0.5 -XX:-UseCompressedOops -XX:+UseSerialGC

 1 tests failed.
 FAILED:  org.apache.lucene.expressions.TestExpressionSorts.testQueries

 Error Message:
 java.lang.AssertionError

 Stack Trace:
 java.lang.AssertionError
 at
 __randomizedtesting.SeedInfo.seed([9D337074B96D1F8C:C1BDBCAFA304AA22]:0)
 at org.apache.lucene.test_framework@9.8.0-SNAPSHOT
 /org.apache.lucene.tests.search.AssertingBulkScorer.score(AssertingBulkScorer.java:105)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.TimeLimitingBulkScorer.score(TimeLimitingBulkScorer.java:82)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:776)
 at org.apache.lucene.test_framework@9.8.0-SNAPSHOT
 /org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:78)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:694)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:688)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:668)
 at org.apache.lucene.core@9.8.0-SNAPSHOT
 /org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:571)
 at
 org.apache.lucene.expressions.TestExpressionSorts.assertQuery(TestExpressionSorts.java:119)
 at
 org.apache.lucene.expressions.TestExpressionSorts.assertQuery(TestExpressionSorts.java:113)
 at
 org.apache.lucene.expressions.TestExpressionSorts.testQueries(TestExpressionSorts.java:92)
 at
 java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)
 at
 java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
 at
 java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:568)
 at randomizedtesting.runner@2.8.1
 /com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
 at randomizedtesting.runner@2.8.1
 /com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
 at randomizedtesting.runner@2.8.1
 /com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
 at randomizedtesting.runner@2.8.1
 /com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
 at org.apache.lucene.test_framework@9.8.0-SNAPSHOT
 

[ANNOUNCE] Apache Lucene 9.7.0 released

2023-06-26 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.7.0.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search across high-dimensionality vectors, spell
correction or query suggestions.

This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:

  

### Lucene 9.7.0 Release Highlights:

 New features

 * The new IndexWriter#updateDocuments(Query, Iterable) allows updating
multiple documents that match a query at the same time.

 * Function queries can now compute similarity scores between kNN vectors.

 Optimizations

 * KNN indexing and querying can now take advantage of vectorization for
distance computation between vectors. To enable this, use exactly Java 20
or 21, and pass --add-modules jdk.incubator.vector as a command-line
parameter to the Java program.

 * KNN queries now run concurrently if the IndexSearcher has been created
with an executor.

 * Queries sorted by field are now able to dynamically prune hits only
using the after value. This yields major speedups when paginating deeply.

 * Reduced merge-time overhead of computing the number of soft deletes.

 Changes in runtime behavior

 * KNN vectors are now disallowed to have non-finite values such as NaN or
±Infinity.

 Bug fixes

 * Backward reading is no longer an adversarial case for
BufferedIndexInput, used by NIOFSDirectory and SimpleFSDirectory. This
addresses a performance bug when performing terms dictionary lookups with
either of these directories.

 * GraphTokenStreamFiniteStrings#articulationPointsRecurse may no longer
overflow the stack.

 * ... plus a number of helpful bug fixes!

Please read CHANGES.txt for a full list of new features and changes:

  

-- 
Adrien


[RESULTI] [VOTE] Release Lucene 9.7.0 RC1

2023-06-25 Thread Adrien Grand
It's been >72h since the vote was initiated and the result is:

+1  7  (5 binding)
 0  1
-1  0

This vote has PASSED.

Thanks all for voting, and in particular Uwe for doing more manual testing
with JDK21.

On Sun, Jun 25, 2023 at 12:33 AM Patrick Zhai  wrote:

> SUCCESS! [0:53:17.495903]
>
> +1 (non-binding), thank you Adrien!
>
> Patrick
>
> On Sat, Jun 24, 2023 at 3:00 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1
>>
>> SUCCESS! [0:16:13.144051]
>>
>> Mike
>>
>> On Fri, Jun 23, 2023, 11:48 PM Gautam Worah 
>> wrote:
>>
>>> SUCCESS! [0:32:53.769993]
>>>
>>> +1 (non-binding)
>>>
>>> Regards,
>>> Gautam Worah.
>>>
>>>
>>> On Fri, Jun 23, 2023 at 3:50 PM Mayya Sharipova
>>>  wrote:
>>>
>>>> Thank you  Adrien!
>>>>
>>>> SUCCESS! [0:59:16.681584]
>>>> +1
>>>>
>>>> On Fri, Jun 23, 2023 at 3:35 AM Uwe Schindler  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> SUCCESS! [1:04:57.975885]
>>>>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/27/console
>>>>>
>>>>> Smoke tester ran with Java 11 and Java 17. Unfortunately theres still
>>>>> no support by Smoketester to run it with a set of arbitrary JDKs (some
>>>>> limited conformance tests with gradle should be executed to not make it
>>>>> take forever). We should open issue for that, I would have created a PR
>>>>> already but my Python knowledge is minimal and my brain only supports
>>>>> copypaste!
>>>>>
>>>>> I verified in addition the following:
>>>>>
>>>>>- Changes for completeness; I also updated the release notes
>>>>>(function query support for vectors was missing)
>>>>>- I regenerated the JDK 21 API signatures with latest JDK21 EA
>>>>>build 28, no changes - all fine.
>>>>>- I started Luke with Java 21, MMapDirectory was using memory
>>>>>segments.
>>>>>- I did not specifically test Java 20/21 vector support (see
>>>>>smoketester issue above).
>>>>>
>>>>> +1 to release!
>>>>>
>>>>> Uwe
>>>>> Am 21.06.2023 um 16:36 schrieb Adrien Grand:
>>>>>
>>>>> Please vote for release candidate 1 for Lucene 9.7.0
>>>>>
>>>>> The artifacts can be downloaded from:
>>>>>
>>>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
>>>>>
>>>>> You can run the smoke tester directly with this command:
>>>>>
>>>>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>>>
>>>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf
>>>>>
>>>>> The vote will be open for at least 72 hours i.e. until 2023-06-24
>>>>> 15:00 UTC.
>>>>>
>>>>> [ ] +1  approve
>>>>> [ ] +0  no opinion
>>>>> [ ] -1  disapprove (and reason why)
>>>>>
>>>>> Here is my +1
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>> --
>>>>> Uwe Schindler
>>>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>>>>> eMail: u...@thetaphi.de
>>>>>
>>>>>

-- 
Adrien


[VOTE] Release Lucene 9.7.0 RC1

2023-06-21 Thread Adrien Grand
Please vote for release candidate 1 for Lucene 9.7.0

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.7.0-RC1-rev-ccf4b198ec328095d45d2746189dc8ca633e8bcf

The vote will be open for at least 72 hours i.e. until 2023-06-24 15:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1

-- 
Adrien


Re: Scorer#getMinScore()

2023-06-21 Thread Adrien Grand
Your guesses sound right to me:
 - A query that does subtractions could yield negative scores, which are
not supported.
 - We'd need to store the least competitive impacts for each block of
postings, which would double the amount of CPU and space we spend on
impacts, while min scores would likely be much less frequently useful than
max scores?

On Fri, Jun 9, 2023 at 10:10 PM Marc D'Mello  wrote:

> Hi all,
>
> I was wondering why there is no Scorer#getMinScore() equivalent to
> Scorer#getMaxScore() (here
> ).
> I think it could potentially be useful for skipping when you have scoring
> functions with a subtraction in it.
>
> As a contrived example, say I wrote a SubtractionAndQuery(Query a, Query
> b) that matched a conjunction of a and b but the score was a.score() -
> b.score(). When creating a scorer, the best getMaxScore() function I could
> create would look like this:
>
> float getMaxScore(int upto) {
> return a.getMaxScore(upto);
> }
>
> However, this would not give me the tightest upper bound score possible as
> I am completely neglecting the "b" term here. Something like this would be
> better:
>
> float getMaxScore(int upto) {
> return Math.max(a.getMaxScore(upto) - b.getMinScore(upto), 0);
> }
>
> So I was wondering if not including this API was by design (the same
> reason why Lucene doesn't allow negative scores for queries) or if it was
> because the added block level metadata required to store the min term
> scores would be too much? I'm sure there's some other issues I could be
> overlooking as well.
>
> Any answers would be greatly appreciated!
>
> Thanks,
> Marc
>


-- 
Adrien


Draft of release notes for 9.7

2023-06-21 Thread Adrien Grand
Hello all,

I put up a draft of release notes for 9.7, am I missing important changes?

https://cwiki.apache.org/confluence/display/LUCENE/Release+notes+9.7

-- 
Adrien


Re: Richer Aggregations in Lucene

2023-06-20 Thread Adrien Grand
Hey Shradha,

Such a contribution would be welcome. There is no good reason not to
support richer aggregations in Lucene. One thing that I have found
interesting with faceting/aggregations is that every implementation seems
to make different trade-offs, e.g.
 - Lucene's faceting historically required adding side-car data, but we
seem to want to make it work more and more with regular doc values instead
of the side-car index?
 - Both Lucene's faceting module and Solr (I think) load the set of matches
into a bitset first, and then compute facets against this bitset while
Elasticsearch computes aggregations within the collector.
 - Both Elasticsearch and Solr have composable aggregations, e.g. break
down by category, and then within each category by brand, but Lucene's
facets don't support this.

If you're going to build a new one, I have some suggestions:
 - Let's avoid dependencies on side-car indexes?
 - I don't think we should load matches into an int[] or BitSet. It takes
too much memory. However it's also true that collecting docs one-by-one
makes some things slower. Maybe we should look into doing
something in-between like batching computation of aggregations? This could
still allow taking advantage of e.g. vectorization if computing, say, the
average of a field.


On Fri, Jun 16, 2023 at 4:14 PM Shradha Shankar 
wrote:

> Hi Lucene devs,
>
> I work on product search at Amazon, where we use Lucene faceting
> to compute aggregations. There's a few functionalities I'm missing with
> faceting. For example, faceting will always aggregate all the way up to the
> dimension and it can't compute multiple aggregations in one pass of the
> match-set.
>
> Lucene-based search engines (like Elastic or OpenSearch) have feature-rich
> aggregation engines which allow different collection modes and give the
> user
> more control over the granularity of the scopes for which aggregations are
> computed.
>
> Are there historical reasons not to have this type of aggregation engine
> directly in Lucene? If it seems like a worthwhile idea to pursue, I've
> experimented a bit with how we could fulfill these needs in Lucene and I
> can
> open an issue/PR.
>
> Thanks,
> Shradha
>


-- 
Adrien


Welcome Chris Hegarty to the Lucene PMC

2023-06-19 Thread Adrien Grand
I'm pleased to announce that Chris Hegarty has accepted an invitation to
join the Lucene PMC!

Congratulations Chris, and welcome aboard!

-- 
Adrien


New branch and feature freeze for Lucene 9.7.0

2023-06-16 Thread Adrien Grand
NOTICE:

Branch branch_9_7 has been cut and versions updated to 9.8 on stable branch.

Please observe the normal rules:

* No new features may be committed to the branch.
* Documentation patches, build patches and serious bug fixes may be
  committed to the branch. However, you should submit all patches you
  want to commit as pull requests first to give others the chance to review
  and possibly vote against them. Keep in mind that it is our
  main intention to keep the branch as stable as possible.
* All patches that are intended for the branch should first be committed
  to the unstable branch, merged into the stable branch, and then into
  the current release branch.
* Normal unstable and stable branch development may continue as usual.
  However, if you plan to commit a big change to the unstable branch
  while the branch feature freeze is in effect, think twice: can't the
  addition wait a couple more days? Merges of bug fixes into the branch
  may become more difficult.
* Only Github issues with Milestone 9.7
  and priority "Blocker" will delay a release candidate build.

-- 
Adrien


Re: Lucene 9.7 release

2023-06-12 Thread Adrien Grand
Hi Alessandro,

It's ok to merge changes before feature freeze, currently planned for
Friday. From a quick look, this is a new feature rather than a bug fix, so
if it's not ready by Friday it could wait until the next minor?

On Mon, Jun 12, 2023 at 6:15 PM Alessandro Benedetti 
wrote:

> Hi,
> we are finalizing https://github.com/apache/lucene/pull/12253, we got
> some last-minute valuable review comments and we would like to apply the
> suggestions and bring them in 9.x .
> Cheers
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Fri, 9 Jun 2023 at 23:53, Uwe Schindler  wrote:
>
>> Hi,
>>
>> BTW, there was a slight change in APIJARs caused by this API change:
>> https://github.com/openjdk/jdk/commit/5fc9b5787dc4d7f00d2c59288bc8d840fdf5b495
>> (this does not affect our code, but it was done 3 weeks ago). I hope
>> something like this won't happen. I updated the PR, no code changes needed
>> as those methods were not used by Lucene.
>>
>> I'd like to update the APIJARS again shortly before the feature branch is
>> created.
>>
>> Uwe
>> Am 09.06.2023 um 23:10 schrieb Uwe Schindler:
>>
>> Let me merge and backport the java 21 map PR first. It has all new source
>> directories and APIJAR files.
>>
>> For safety I will regenerate the 21 APIJAR with newest jdk build. Fyi, to
>> regenerate you need to have an environment variable with jdk21 as
>> autoprovisioning doesn't work.
>>
>> After that we can copy-paste the vector impl to the main/java21 folder
>> and add vector classes to it.
>>
>> Uwe
>>
>>
>> Am 9. Juni 2023 22:30:09 MESZ schrieb Chris Hegarty
>> 
>> :
>>
>>> Hi,
>>>
>>> On 9 Jun 2023, at 17:19, Uwe Schindler 
>>>  wrote:
>>>
>>> Hi,
>>>
>>> if possible I would like to get the Java 21 changes (MemorySegments and
>>> Vector) into the release. I'd like to ask Chris who has better knowledge
>>> how to proceed. If he suggests to wait maybe a week or 2, I'd suggest to
>>> wait that time.
>>>
>>> Chris Hegarthy: Do you know if the API of JDK 21 is finalized or not.
>>> From my understanding the final phases have started, so API changes are
>>> unlikely. If there are bug fixes they won't affect public APIs or the
>>> incubator module, right?
>>>
>>> Your understanding is correct. I do not expect any API changes at this
>>> point.
>>>
>>> The MMapDir changes are already tested all the time, vector API needs
>>> the forward port to 21.
>>>
>>> We are also doing some early testing with JDK 21 EA, and it would be
>>> great to get the 21-version of Panama VectorUtils in. I can help get this
>>> done.
>>>
>>> Uwe, what has been done so far? If nothing, as that is still the case
>>> tomorrow, I can start on it.
>>>
>>> -Chris.
>>>
>>> Uwe
>>> Am 09.06.2023 um 18:07 schrieb Adrien Grand:
>>>
>>> Hello all,
>>>
>>> There is some good stuff that is scheduled for 9.7 already, I found the
>>> following changes in the changelog that look especially interesting:
>>>  - Concurrent query rewrites for vector queries.
>>>  - Speedups to vector indexing/search via integration of the Panama
>>> vector API.
>>>  - Reduced overhead of soft deletes.
>>>  - Support for update by query.
>>>
>>> I propose we start the process for a 9.7 release, and I volunteer to be
>>> the release manager. I suggest the following schedule:
>>>  - Feature freeze on June 16th, one week from now. This is when the 9.7
>>> branch will be cut.
>>>  - Open a vote on June 21st, which we'll possibly delay if blockers get
>>> identified.
>>>
>>> --
>>> Adrien
>>>
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>

-- 
Adrien


Lucene 9.7 release

2023-06-09 Thread Adrien Grand
Hello all,

There is some good stuff that is scheduled for 9.7 already, I found the
following changes in the changelog that look especially interesting:
 - Concurrent query rewrites for vector queries.
 - Speedups to vector indexing/search via integration of the Panama vector
API.
 - Reduced overhead of soft deletes.
 - Support for update by query.

I propose we start the process for a 9.7 release, and I volunteer to be the
release manager. I suggest the following schedule:
 - Feature freeze on June 16th, one week from now. This is when the 9.7
branch will be cut.
 - Open a vote on June 21st, which we'll possibly delay if blockers get
identified.

-- 
Adrien


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-08 Thread Adrien Grand
As Dawid pointed out earlier on this thread, this is the rule for
Apache projects: a single -1 vote on a code change is a veto and
cannot be overridden. Furthermore, Robert is one of the people on this
project who worked the most on debugging subtle bugs, making Lucene
more robust and improving our test framework, so I'm listening when he
voices quality concerns.

The argument against removing/raising the limit that resonates with me
the most is that it is a one-way door. As MikeS highlighted earlier on
this thread, implementations may want to take advantage of the fact
that there is a limit at some point too. This is why I don't want to
remove the limit and would prefer a slight increase, such as 2048 as
suggested in the original issue, which would enable most of the things
that users who have been asking about raising the limit would like to
do.

I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.

On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
 wrote:
>
> I am very attentive to listen opinions but I am un-convinced here and I an 
> not sure that a single person opinion should be allowed to be detrimental for 
> such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in low 
> dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance, they 
> may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with 
> high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will 
> take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever, actually I 
> pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir,  wrote:
>>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov  wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer on 
>> > merge is wrong. That's the point I was trying to make. Your concerns about 
>> > merge costs are not wrong, but your conclusion that we need to limit 
>> > dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it scales 
>> > linearly with dimension you just ignore that and complain about something 
>> > entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but when 
>> > they do, you don't listen and you won't put in the work yourself or 
>> > complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir  wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>> >>  wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-03-31 Thread Adrien Grand
I'm supportive of bumping the limit on the maximum dimension for
vectors to something that is above what the majority of users need,
but I'd like to keep a limit. We have limits for other things like the
max number of docs per index, the max term length, the max number of
dimensions of points, etc. and there are a few things that we don't
have limits on that I wish we had limits on. These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

That said, these other limits we have in place are quite high. E.g.
the 32kB term limit, nobody would ever type a 32kB term in a text box.
Likewise for the max of 8 dimensions for points: a segment cannot
possibly have 2 splits per dimension on average if it doesn't have
512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
than 8 would likely defeat the point of indexing. In contrast, our
limit on the number of dimensions of vectors seems to be under what
some users would like, and while I understand the performance argument
against bumping the limit, it doesn't feel to me like something that
would be so bad that we need to prevent users from using numbers of
dimensions in the low thousands, e.g. top-k KNN searches would still
look at a very small subset of the full dataset.

So overall, my vote would be to bump the limit to 2048 as suggested by
Mayya on the issue that you linked.

On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
 wrote:
>
> Thanks Alessandro for summarizing the discussion below!
>
> I understand that there is no clear reasoning re what is the best embedding 
> size, whereas I think heuristic approaches like described by the following 
> link can be helpful
>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>
> Having said this, we see various embedding services providing higher 
> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>
> And it would be great if we could run benchmarks without having to recompile 
> Lucene ourselves.
>
> Therefore I would to suggest to either increase the limit or even better to 
> remove the limit and add a disclaimer, that people should be aware of 
> possible crashes etc.
>
> Thanks
>
> Michael
>
>
>
>
> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>
>
> I've been monitoring various discussions on Pull Requests about changing the 
> max number of dimensions allowed for Lucene HNSW vectors:
>
> https://github.com/apache/lucene/pull/12191
>
> https://github.com/apache/lucene/issues/11507
>
>
> I would like to set up a discussion and potentially a vote about this.
>
> I have seen some strong opposition from a few people but a majority of favor 
> in this direction.
>
>
> Motivation
>
> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, 
> Marcus Eagan, and David Smiley about some neural search integrations in Solr: 
> https://github.com/openai/chatgpt-retrieval-plugin
>
>
> Proposal
>
> No hard limit at all.
>
> As for many other Lucene areas, users will be allowed to push the system to 
> the limit of their resources and get terrible performances or crashes if they 
> want.
>
>
> What we are NOT discussing
>
> - Quality and scalability of the HNSW algorithm
>
> - dimensionality reduction
>
> - strategies to fit in an arbitrary self-imposed limit
>
>
> Benefits
>
> - users can use the models they want to generate vectors
>
> - removal of an arbitrary limit that blocks some integrations
>
>
> Cons
>
>  - if you go for vectors with high dimensions, there's no guarantee you get 
> acceptable performance for your use case
>
>
>
> I want to keep it simple, right now in many Lucene areas, you can push the 
> system to not acceptable performance/ crashes.
>
> For example, we don't limit the number of docs per index to an arbitrary 
> maximum of N, you push how many docs you like and if they are too much for 
> your system, you get terrible performance/crashes/whatever.
>
>
> Limits caused by primitive java types will stay there behind the scene, and 
> that's acceptable, but I would prefer to not have arbitrary hard-coded ones 
> that may limit the software usability and integration which is extremely 
> important for a library.
>
>
> I strongly encourage people to add benefits and cons, that I missed (I am 
> sure I missed some of them, but wanted to keep it simple)
>
>
> Cheers
>
> --
> Alessandro Benedetti
> Director @ Sease Ltd.
> Apache Lucene/Solr Committer
> Apache Solr PMC Member
>
> e-mail: a.benede...@sease.io
>
>
> Sease - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io
> LinkedIn | Twitter | Youtube | Github
>
>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-MMAPv2-Linux (64bit/openj9/jdk-17.0.5) - Build # 644 - Unstable!

2023-03-15 Thread Adrien Grand
We have seen this issue a few times over the past months. I would
default to assuming a bug in J9, do we have a contact on the J9 team
that we should make aware of this?

On Sat, Mar 11, 2023 at 12:33 PM Policeman Jenkins Server
 wrote:
>
> Build: https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Linux/644/
> Java: 64bit/openj9/jdk-17.0.5 -XX:-UseCompressedOops -Xgcpolicy:metronome
>
> 1 tests failed.
> FAILED:  
> org.apache.lucene.index.TestDocumentsWriterDeleteQueue.testAdvanceReferencesOriginal
>
> Error Message:
> java.lang.AssertionError: expected null, but was:
>
> Stack Trace:
> java.lang.AssertionError: expected null, but was:
> at 
> __randomizedtesting.SeedInfo.seed([961DEC0B9DF87C04:F037B61A4B9C829F]:0)
> at app//org.junit.Assert.fail(Assert.java:89)
> at app//org.junit.Assert.failNotNull(Assert.java:756)
> at app//org.junit.Assert.assertNull(Assert.java:738)
> at app//org.junit.Assert.assertNull(Assert.java:748)
> at 
> app//org.apache.lucene.index.TestDocumentsWriterDeleteQueue.testAdvanceReferencesOriginal(TestDocumentsWriterDeleteQueue.java:42)
> at 
> java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> at 
> java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at 
> java.base@17.0.5/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base@17.0.5/java.lang.reflect.Method.invoke(Method.java:568)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at 
> app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at 
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at 
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at 
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at 
> app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at 
> app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at 
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at 
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at 
> app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> 

Re: Lucene PMC Chair Greg Miller

2023-03-07 Thread Adrien Grand
Thank you Bruno and Greg!

Le lun. 6 mars 2023, 18:15, Bruno Roustant  a écrit :

> Hello Lucene developers,
>
> Lucene Program Management Committee has elected a new chair, Greg Miller,
> and the Board has approved.
>
> Greg, thank you for stepping up, and congratulations!
>
>
> - Bruno
>


Re: [JENKINS] Lucene-9.x-MacOSX (64bit/hotspot/jdk-11.0.15) - Build # 1806 - Failure!

2023-02-07 Thread Adrien Grand
It's an interesting failure: it fell through the cracks because
Javadoc 17 leniently accepts the bad reference (int instead of float
in the constructor signature) and automatically fixes it (the produced
HTML says "float") while Javadoc 11 rejects the bad reference.

On Tue, Feb 7, 2023 at 10:24 AM Adrien Grand  wrote:
>
> I'm looking into it.
>
> On Tue, Feb 7, 2023 at 8:08 AM Policeman Jenkins Server
>  wrote:
> >
> > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/1806/
> > Java: 64bit/hotspot/jdk-11.0.15 -XX:-UseCompressedOops -XX:+UseParallelGC
> >
> > No tests ran.
> >
> > -
> > To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: builds-h...@lucene.apache.org
>
>
>
> --
> Adrien



-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-9.x-MacOSX (64bit/hotspot/jdk-11.0.15) - Build # 1806 - Failure!

2023-02-07 Thread Adrien Grand
I'm looking into it.

On Tue, Feb 7, 2023 at 8:08 AM Policeman Jenkins Server
 wrote:
>
> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/1806/
> Java: 64bit/hotspot/jdk-11.0.15 -XX:-UseCompressedOops -XX:+UseParallelGC
>
> No tests ran.
>
> -
> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> For additional commands, e-mail: builds-h...@lucene.apache.org



-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Welcome Ben Trent as Lucene committer

2023-01-27 Thread Adrien Grand
I'm pleased to announce that Ben Trent has accepted the PMC's
invitation to become a committer.

Ben, the tradition is that new committers introduce themselves with a
brief bio.

Congratulations and welcome!

-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.5 release notes draft

2023-01-26 Thread Adrien Grand
Thanks Luca, the release notes look good to me.

On Thu, Jan 26, 2023 at 10:11 AM Luca Cavanna  wrote:
>
> Hi all,
> I published a draft of the release notes for Lucene 9.5 here: 
> https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.5
>
> Could you please review it? Feel free to make suggestions/edits directly in 
> Confluence.
>
> Thanks
> Luca



-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene 9.5.0 RC1

2023-01-26 Thread Adrien Grand
+1

I lost my console so I no longer have the time that smoketester took
but it passed. Changes look good to me too.

On Thu, Jan 26, 2023 at 10:23 AM Ignacio Vera  wrote:
>
> +1
>
> SUCCESS! [0:44:15.998020]
>
>
> On Thu, Jan 26, 2023 at 9:19 AM Jan Høydahl  wrote:
>>
>> +1
>>
>> SUCCESS! [0:36:32.191785]
>>
>> Jan
>>
>> 25. jan. 2023 kl. 19:43 skrev Luca Cavanna :
>>
>> Please vote for release candidate 1 for Lucene 9.5.0
>>
>> The artifacts can be downloaded from:
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21
>>
>> The vote will be open for at least 72 hours i.e. until 2023-01-28 19:00 UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>>


-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.5.0 release

2023-01-23 Thread Adrien Grand
We did a major cleanup to the vector API in 9.5 but there are a few things
that still annoy me a bit that are worth fixing in my opinion:
 - VectorValues, the API for float vectors, still exposes a binaryValue()
API. We should remove it and only expose floats in the API?
 - Byte vectors should be represented as byte[] instead of BytesRef in
ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery.

If these two changes are low-hanging fruits, maybe we can fold them in
order to avoid pushing another breaking change to the API later on. I'll
give it a try later today if none beats me to it.

On Mon, Jan 23, 2023 at 11:21 AM Luca Cavanna 
wrote:

> Hi all,
> I meant to start the release today and I see this PR is not merged yet:
> https://github.com/apache/lucene/pull/12029 . Alessandro, do you still
> plan on merging it shortly?
>
> Thanks
> Luca
>
> On Sat, Jan 21, 2023 at 11:41 AM Michael Wechner <
> michael.wech...@wyona.com> wrote:
>
>> I tried to understand the issue described on github, but unfortunately do
>> not really understand it.
>>
>> Can you explain a little more?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:
>>
>> Hi,
>> this would be nice to have in 9.5 :
>> https://github.com/apache/lucene/issues/12099
>>
>> It's a minor (adding getters to KnnQuery) but can be beneficial in Apache
>> Solr as soon as possible.
>> Planning to merge in a few hours if no objections.
>> --
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io 
>> LinkedIn  | Twitter
>>  | Youtube
>>  | Github
>> 
>>
>>
>> On Thu, 19 Jan 2023 at 14:38, Luca Cavanna 
>>  wrote:
>>
>>> Thanks Robert for the help with the github milestone.
>>>
>>> I am planning on cutting the release branch on Monday if there are no
>>> objections.
>>>
>>> Cheers
>>> Luca
>>>
>>> On Tue, Jan 17, 2023 at 7:08 PM Robert Muir  wrote:
>>>
 +1 to release, thank you for volunteering to be RM!

 I went thru 9.5 section of CHANGES.txt and tagged all the GH issues in
 there with milestone too, if they didn't already have it. It looks
 even bigger now.

 On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna 
 wrote:
 >
 > Hi all,
 > I'd like to propose that we release Lucene 9.5.0. There is a decent
 amount of changes that would go into it looking at the github milestone:
 https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
 release manager. There is one PR open listed for the 9.5 milestone:
 https://github.com/apache/lucene/pull/11873 . Is this something that
 we do want to address before we release? Is anybody aware of outstanding
 work that we would like to include or known blocker issues that are not
 listed in the 9.5 milestone?
 >
 > Cheers
 > Luca
 >
 >
 >
 >

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


>>

-- 
Adrien


Re: Lucene 9.5.0 release

2023-01-13 Thread Adrien Grand
+1 to doing a 9.5 release, it's overdue

Le ven. 13 janv. 2023, 10:54, Luca Cavanna  a écrit :

> Hi all,
> I'd like to propose that we release Lucene 9.5.0. There is a decent amount
> of changes that would go into it looking at the github milestone:
> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
> release manager. There is one PR open listed for the 9.5 milestone:
> https://github.com/apache/lucene/pull/11873 . Is this something that we
> do want to address before we release? Is anybody aware of outstanding work
> that we would like to include or known blocker issues that are not listed
> in the 9.5 milestone?
>
> Cheers
> Luca
>
>
>
>
>


Re: Request for naming help

2023-01-01 Thread Adrien Grand
Sorry Marc, I had missed your message. This is what I meant indeed.

On Fri, Dec 30, 2022 at 4:36 PM Greg Miller  wrote:
>
> OK, great! Thanks Marc. I plan on merging the PR today.
>
> Cheers,
> -Greg
>
> On Thu, Dec 29, 2022 at 3:23 PM Marc D'Mello  wrote:
>>
>> Hi Greg,
>>
>> I'm also OK merging as is since this is a new feature and doesn't affect any 
>> of the current functionality. I also think there are no glaring issues with 
>> the API in its current state. However, I do think that merging the range and 
>> rangeonrange functionality makes sense and I like Adrien's suggestion of 
>> providing factory methods. I think if we merge in its current state we 
>> should create a new issue to refactor the range and rangeonrange faceting 
>> package into one and follow the RangeFieldQuery model more closely.
>>
>> On Thu, Dec 29, 2022 at 2:58 PM Greg Miller  wrote:
>>>
>>> Hey Marc-
>>>
>>> I don't want to speak for Adrien as he might have something different in 
>>> mind, but I think that's more-or-less the idea. I'm not sure the factory 
>>> methods belong on the LongRange/DoubleRange classes, or if separate classes 
>>> should be created for this purpose (which is more how I thought of it)?
>>>
>>> To do this cleanly though, I'd really like us to try to consolidate all the 
>>> "range related" faceting functionality into one java package and 
>>> consolidate the API a bit. As part of this, I think we can be a little 
>>> smarter about not duplicating the "range" classes themselves.
>>>
>>> All this said, given that I think your "range on range" faceting PR is 
>>> ready to be merged as it currently exists, and has been through a number of 
>>> iteration already, I'm OK if we want to merge that work as it stands and 
>>> follow up with revisiting the API/naming/etc. as a future project. What do 
>>> you think?
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Tue, Dec 13, 2022 at 7:23 PM Marc D'Mello  wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm a bit unsure about what is being suggested. Is the idea to rename 
>>>> range#LongRange and rangeonrange#LongRange to LongFieldFacets and 
>>>> LongRangeFacets respectively and stick the static getters in there? In 
>>>> that case, I also think that the idea makes a lot of sense and that it 
>>>> would match our current range query API much better.
>>>>
>>>> In addition, looking at document#LongRange, there are queries like 
>>>> newContainsQuery() and newWithinQuery() that we can probably mimic to 
>>>> avoid exposing RangeFieldQuery.QueryType to the user.
>>>>
>>>> On Tue, Dec 13, 2022 at 5:04 PM Greg Miller  wrote:
>>>>>
>>>>> Thanks for the suggestion Adrien. I like this idea! Marc- what do you 
>>>>> think?
>>>>>
>>>>> We might need to rework the package structure under the facets module to 
>>>>> make this clean, but that might not be a terrible thing anyway. The 
>>>>> existing sub-packages will make it challenging to get the visibility 
>>>>> right. I think it would be ideal to flatten the package so we can reduce 
>>>>> visibility of the class definitions and only expose the factory methods.
>>>>>
>>>>> Cheers,
>>>>> -Greg
>>>>>
>>>>> On Tue, Dec 13, 2022 at 01:18 Adrien Grand  wrote:
>>>>>>
>>>>>> I wonder if the facets actually require a different name, since they
>>>>>> look to me like a generalization of range facets for range fields,
>>>>>> while we previously only supported range facets on numeric fields. We
>>>>>> could keep calling them range facets?
>>>>>>
>>>>>> Maybe we could use the same model we used for queries by not exposing
>>>>>> query classes to users and providing factory methods, e.g. we could
>>>>>> have something like:
>>>>>>
>>>>>> public class LongFieldFacets {
>>>>>>
>>>>>>   public static Facets getRangeFacetCounts(String field,
>>>>>> FacetsCollector hits, LongRange... ranges) {
>>>>>> return new LongRangeFacetCounts(...);
>>>>>>   }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> public class LongRangeFacets {

Re: Request for naming help

2022-12-13 Thread Adrien Grand
I wonder if the facets actually require a different name, since they
look to me like a generalization of range facets for range fields,
while we previously only supported range facets on numeric fields. We
could keep calling them range facets?

Maybe we could use the same model we used for queries by not exposing
query classes to users and providing factory methods, e.g. we could
have something like:

public class LongFieldFacets {

  public static Facets getRangeFacetCounts(String field,
FacetsCollector hits, LongRange... ranges) {
return new LongRangeFacetCounts(...);
  }

}

public class LongRangeFacets {

  // same function name
  public static Facets getRangeFacetCounts(String field,
FacetsCollector hits, RangeFieldQuery.QueryType queryType,
LongRange... ranges) {
return new LongRangeOnRangeFacetCounts(...);
  }

}

We'd still need to give a name for these classes, but the name would
be less important since these class names would be only for ourselves.
Users would never see them and refer to this new functionality as
range facets on range fields?

On Mon, Dec 12, 2022 at 10:11 PM Gus Heck  wrote:
>
> In that case, maybe "Range Logic Faceting" ?
>
> Relation seems too broad and too overloaded elsewhere, makes me think of 
> RDBMS, related-ness, joins and such via word associations.
>
> On Mon, Dec 12, 2022 at 3:27 PM Greg Miller  wrote:
>>
>> Thank for the suggestion! I like the descriptiveness of it. My only 
>> hesitation is that is supports more than range intersection based on the 
>> provided QueryType instance (e.g., within, contains). I _imagine_ that 
>> intersection will be most common, but I don’t really know of course. I 
>> thought about generalizing your suggestion to something like “Range Relation 
>> Faceting,” but fear that would be confusing.
>>
>> Thanks again!
>>
>> Cheers,
>> -Greg
>>
>> On Mon, Dec 12, 2022 at 10:19 Gus Heck  wrote:
>>>
>>> Maybe "Range Intersect Faceting"?
>>>
>>> On Mon, Dec 12, 2022 at 1:11 PM Greg Miller  wrote:

 Folks-

 Naming is hard! (But you all know that already).

 Marc D'Mello and I have been working on a new faceting implementation 
 that's meant to complement Lucene's existing range-relation queries (e.g., 
 LongRange#newIntersectsQuery, DoubleRange#newContainsQuery, 
 LongRangeDocValuesField#newSlowIntersectsQuery, etc.). Well, I should say 
 Marc is working on the change and I'm just providing nit-picky feedback on 
 his PR, which is here: https://github.com/apache/lucene/pull/11901. The 
 general idea of this feature is to allow users to get facet counts for 
 these sorts of range-relation filters before they're applied. For example, 
 if a user is indexing ranges with their documents, they may have a set of 
 query-ranges they want to facet on, based on some range relationship 
 (e.g., intersection, contains, etc.).

 As a concrete example, imagine that documents contain a price range (maybe 
 a document represents some e-commerce product but the price varies based 
 on some configuration options), and a user wants to build a price range 
 filter that applies filtering based on whether-or-not the two ranges 
 intersect (i.e., DoubleRange#newIntersectsQuery to apply a price range 
 filter). This user wants faceting capabilities over the different price 
 ranges they want to make available, so they need a way to facet over a 
 list of provided query-ranges, based on the "intersect" relationship with 
 the doc-encoded ranges. That's what Marc's "RangeOnRange" faceting is 
 trying to accomplish.

 In my opinion, the PR is really close to being ready (thanks again Marc!), 
 but I'm wondering if we can come up with a more descriptive name. As it 
 currently stands, the feature is termed "RangeOnRange Faceting," which 
 feels just a bit wonky to me. That said, I can't really come up with 
 anything better.

 ** Does anyone have suggestions on a better name? **

 Any / all suggestions appreciated! (And of course, any other input on the 
 PR is welcome if anyone is interested).

 Cheers,
 -Greg
>>>
>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)



-- 
Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [lucene] branch main updated: More refactoring work, and fix a distance calculation.

2022-11-24 Thread Adrien Grand
Karl, this commit has been failing precommit because it introduced
dead code. I just pushed a fix.


On Thu, Nov 24, 2022 at 10:47 AM  wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> kwright pushed a commit to branch main
> in repository https://gitbox.apache.org/repos/asf/lucene.git
>
>
> The following commit(s) were added to refs/heads/main by this push:
>  new 839dfb5a2dc More refactoring work, and fix a distance calculation.
> 839dfb5a2dc is described below
>
> commit 839dfb5a2dc46c4b2d16d9db5ea9f31ca1e8d907
> Author: Karl David Wright 
> AuthorDate: Wed Nov 23 23:36:15 2022 -0500
>
> More refactoring work, and fix a distance calculation.
> ---
>  .../lucene/spatial3d/geom/GeoDegeneratePath.java   | 32 ++---
>  .../lucene/spatial3d/geom/GeoStandardPath.java | 54 
> --
>  .../apache/lucene/spatial3d/geom/TestGeoPath.java  | 12 +++--
>  3 files changed, 62 insertions(+), 36 deletions(-)
>
> diff --git 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoDegeneratePath.java
>  
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoDegeneratePath.java
> index 524451ac68a..d1a452ca566 100644
> --- 
> a/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoDegeneratePath.java
> +++ 
> b/lucene/spatial3d/src/java/org/apache/lucene/spatial3d/geom/GeoDegeneratePath.java
> @@ -282,7 +282,7 @@ class GeoDegeneratePath extends GeoBasePath {
>  minDistance = newDistance;
>}
>  }
> -return minDistance;
> +return distanceStyle.fromAggregationForm(minDistance);
>}
>
>@Override
> @@ -468,6 +468,15 @@ class GeoDegeneratePath extends GeoBasePath {
>return this.point.isIdentical(x, y, z);
>  }
>
> +public boolean isWithinSection(final double x, final double y, final 
> double z) {
> +  for (final Membership cutoffPlane : cutoffPlanes) {
> +if (!cutoffPlane.isWithin(x, y, z)) {
> +  return false;
> +}
> +  }
> +  return true;
> +}
> +
>  /**
>   * Compute interior path distance.
>   *
> @@ -502,7 +511,7 @@ class GeoDegeneratePath extends GeoBasePath {
>return Double.POSITIVE_INFINITY;
>  }
>}
> -  return distanceStyle.computeDistance(this.point, x, y, z);
> +  return 
> distanceStyle.toAggregationForm(distanceStyle.computeDistance(this.point, x, 
> y, z));
>  }
>
>  /**
> @@ -516,7 +525,7 @@ class GeoDegeneratePath extends GeoBasePath {
>   */
>  public double outsideDistance(
>  final DistanceStyle distanceStyle, final double x, final double y, 
> final double z) {
> -  return distanceStyle.computeDistance(this.point, x, y, z);
> +  return 
> distanceStyle.toAggregationForm(distanceStyle.computeDistance(this.point, x, 
> y, z));
>  }
>
>  /**
> @@ -578,7 +587,7 @@ class GeoDegeneratePath extends GeoBasePath {
>
>  @Override
>  public String toString() {
> -  return point.toString();
> +  return "SegmentEndpoint: " + point;
>  }
>}
>
> @@ -659,6 +668,10 @@ class GeoDegeneratePath extends GeoBasePath {
>&& normalizedConnectingPlane.evaluateIsZero(x, y, z);
>  }
>
> +public boolean isWithinSection(final double x, final double y, final 
> double z) {
> +  return startCutoffPlane.isWithin(x, y, z) && 
> endCutoffPlane.isWithin(x, y, z);
> +}
> +
>  /**
>   * Compute path center distance (distance from path to current point).
>   *
> @@ -671,7 +684,7 @@ class GeoDegeneratePath extends GeoBasePath {
>  public double pathCenterDistance(
>  final DistanceStyle distanceStyle, final double x, final double y, 
> final double z) {
>// First, if this point is outside the endplanes of the segment, 
> return POSITIVE_INFINITY.
> -  if (!startCutoffPlane.isWithin(x, y, z) || !endCutoffPlane.isWithin(x, 
> y, z)) {
> +  if (!isWithinSection(x, y, z)) {
>  return Double.POSITIVE_INFINITY;
>}
>// (1) Compute normalizedPerpPlane.  If degenerate, then there is no 
> such plane, which means
> @@ -710,7 +723,7 @@ class GeoDegeneratePath extends GeoBasePath {
>"Can't find world intersection for point x=" + x + " y=" + y + 
> " z=" + z);
>  }
>}
> -  return distanceStyle.computeDistance(thePoint, x, y, z);
> +  return 
> distanceStyle.toAggregationForm(distanceStyle.computeDistance(thePoint, x, y, 
> z));
>  }
>
>  /**
> @@ -726,7 +739,7 @@ class GeoDegeneratePath extends GeoBasePath {
>  public double nearestPathDistance(
>  final DistanceStyle distanceStyle, final double x, final double y, 
> final double z) {
>// First, if this point is outside the endplanes of the segment, 
> return POSITIVE_INFINITY.
> -  if (!startCutoffPlane.isWithin(x, y, z) || !endCutoffPlane.isWithin(x, 
> y, z)) {
> +  if (!isWithinSection(x, y, z)) {
>  return Double.POSITIVE_INFINITY;
>   

[ANNOUNCE] Apache Lucene 9.4.2 released

2022-11-23 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.2

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search on high-dimensionality vectors, spell correction or
query suggestions.

This patch release contains an important fix for a bug affecting version
9.4.1. The release is available for immediate download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.4.2 Release Highlights

Bug fixes
 - Fixed integer overflow when opening segments containing more than ~16M
KNN vectors.
 - Fixed cost computation of BitSets created via DocIdSetBuilder, such as
for multi-term queries. This may improve performance of multi-term queries.

Enhancements
 - CheckIndex now verifies the consistency of KNN vectors more thoroughly.

Further details of changes are available in the change log available at:
https://lucene.apache.org/core/9_4_2/changes/Changes.html.

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation now uses a content distribution
network (CDN) for distributing releases.

-- 
Adrien


Re: Main website not building

2022-11-22 Thread Adrien Grand
Thanks Uwe!

On Tue, Nov 22, 2022 at 6:54 PM Uwe Schindler  wrote:

> Hi I fixed this.
>
> This was caused because due to maerging from main->production the fix of
> .asf.yaml was merged, too (this was caused by repairing the problems from
> lucenepy with a duplicate master branch). So the production branch was
> building but was deployed to Staging.
>
> I tried to exclude asf.yaml from any merging, but there is no way to put a
> "sticky" bit on it. If anybody has an idea how to tell git: never touch
> asf.aml while merging, tell me!
>
> Uwe
> Am 22.11.2022 um 18:37 schrieb Adrien Grand:
>
> Hello,
>
> I've managed to make changes to the website for 9.4.2 and they are
> correctly reflected on lucene.staged.apache.org. However pushing to the
> `production` branch doesn't seem to trigger a build on
> https://ci2.apache.org/#/builders/3 and I'm not seeing the production
> website getting updated either.
>
> Is someone familiar with how the build hooks with git pushes and could
> give me pointers to debug why pushes to the production branch are not
> triggering builds?
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


Main website not building

2022-11-22 Thread Adrien Grand
Hello,

I've managed to make changes to the website for 9.4.2 and they are
correctly reflected on lucene.staged.apache.org. However pushing to the
`production` branch doesn't seem to trigger a build on
https://ci2.apache.org/#/builders/3 and I'm not seeing the production
website getting updated either.

Is someone familiar with how the build hooks with git pushes and could give
me pointers to debug why pushes to the production branch are not triggering
builds?

-- 
Adrien


[RESULT] [VOTE] Release Lucene 9.4.2 RC1

2022-11-21 Thread Adrien Grand
It's been >72h since the vote was initiated and the result is:

+1  8  (8 binding)
 0  0
-1  0

This vote has PASSED.

On Sat, Nov 19, 2022 at 3:22 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1
>
> SUCCESS! [0:27:27.923430]
>
> I also see the same GPG warning as Mike S but it's likely a local gpg
> problem for me too ;)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Nov 18, 2022 at 10:35 AM Michael Sokolov 
> wrote:
>
>> (I don't really believe the .asc files are broken; probably a local
>> gpg problem I don't understand)
>>
>> SUCCESS! [0:44:08.338731]
>> +1 from me
>>
>> On Fri, Nov 18, 2022 at 10:18 AM Uwe Schindler  wrote:
>> >
>> > Hi,
>> >
>> > the second build succeeded. I really think it was another job running
>> at same time that also tried to communicate with GPG and used another home
>> dir.
>> >
>> > Log: https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console
>> >
>> > SUCCESS! [1:43:46.817984]
>> > Finished: SUCCESS
>> >
>> > After jenkins finished the job it killed all child processes and all
>> agents are gone.
>> >
>> > In the meantime I also did some manual checks: Running Luke from
>> windows with whitespace in dir worked and I was able to open my test index.
>> I also started with Java 19 and --enable-preview and the Luke log showed
>> that it uses the new MMapDire impl.
>> >
>> > I correct my previous vote: ++1 to release. 
>> >
>> > Uwe
>> >
>> > Am 18.11.2022 um 16:06 schrieb Uwe Schindler:
>> >
>> > I had also seen this message. My guess: Another build was running in
>> Jenkins that also spawned an agent with different home dir! I think Robert
>> already talked about this. We should kill the agents before/after we have
>> used them.
>> >
>> > Uwe
>> >
>> > Am 18.11.2022 um 15:47 schrieb Adrien Grand:
>> >
>> > Reading Uwe's error message more carefully, I had first assumed that
>> the GPG failure was due to the lack of an ultimately trusted signature, but
>> it seems like it's due to "can't connect to the agent: IPC connect call
>> failed" actually, which suggests an issue with the GPG agent?
>> >
>> > On Fri, Nov 18, 2022 at 3:00 PM Michael Sokolov 
>> wrote:
>> >>
>> >> I got this message when initially downloading the artifacts:
>> >>
>> >> Downloading
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
>> >> File:
>> /tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucene-9.4.2-src.tgz.gpg.verify.log
>> >> verify trust
>> >>   GPG: gpg: WARNING: This key is not certified with a trusted
>> signature!
>> >>
>> >> is it related?
>> >>
>> >> On Fri, Nov 18, 2022 at 8:43 AM Uwe Schindler  wrote:
>> >> >
>> >> > The problem is: it is working like this since years - the 9.4.1
>> release worked fine. No change!
>> >> >
>> >> > And I can't configure this because GPG uses its own home directory
>> setup by smoke tester (see paths below). So it should not look anywhere
>> else? In addition "gpg: no ultimately trusted keys found" is just a
>> warning, it should not cause gpg to exit.
>> >> >
>> >> > Also why does it only happens at the time of Maven? It checks
>> signatures before, too. This is why I restarted the build:
>> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console (still
>> running)
>> >> >
>> >> > Uwe
>> >> >
>> >> > Am 18.11.2022 um 14:21 schrieb Adrien Grand:
>> >> >
>> >> > Uwe, the error message suggests that Policeman Jenkins is not
>> ultimately trusting any of the keys. Does it work if you configure it to
>> ultimately trust your "Uwe Schindler (CODE SIGNING KEY) <
>> uschind...@apache.org>" key (which I assume you would be ok with)?
>> >> >
>> >> > On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler 
>> wrote:
>> >> >>
>> >> >> I am restarting the build, maybe it was some hickup. Interestingly
>> it only failed for the Maven dependencies. P.S.: Why does it import the key
>> file over and over? It would be enough to do this once at beginning of
>> smoker.
>&g

Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-18 Thread Adrien Grand
Reading Uwe's error message more carefully, I had first assumed that the
GPG failure was due to the lack of an ultimately trusted signature, but it
seems like it's due to "can't connect to the agent: IPC connect call
failed" actually, which suggests an issue with the GPG agent?

On Fri, Nov 18, 2022 at 3:00 PM Michael Sokolov  wrote:

> I got this message when initially downloading the artifacts:
>
> Downloading
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db/lucene/lucene-9.4.2-src.tgz.asc
> File:
> /tmp/smoke_lucene_9.4.2_858d9b437047a577fa9457089afff43eefa461db/lucene.lucene-9.4.2-src.tgz.gpg.verify.log
> verify trust
>   GPG: gpg: WARNING: This key is not certified with a trusted
> signature!
>
> is it related?
>
> On Fri, Nov 18, 2022 at 8:43 AM Uwe Schindler  wrote:
> >
> > The problem is: it is working like this since years - the 9.4.1 release
> worked fine. No change!
> >
> > And I can't configure this because GPG uses its own home directory setup
> by smoke tester (see paths below). So it should not look anywhere else? In
> addition "gpg: no ultimately trusted keys found" is just a warning, it
> should not cause gpg to exit.
> >
> > Also why does it only happens at the time of Maven? It checks signatures
> before, too. This is why I restarted the build:
> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/25/console (still
> running)
> >
> > Uwe
> >
> > Am 18.11.2022 um 14:21 schrieb Adrien Grand:
> >
> > Uwe, the error message suggests that Policeman Jenkins is not ultimately
> trusting any of the keys. Does it work if you configure it to ultimately
> trust your "Uwe Schindler (CODE SIGNING KEY) " key
> (which I assume you would be ok with)?
> >
> > On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler  wrote:
> >>
> >> I am restarting the build, maybe it was some hickup. Interestingly it
> only failed for the Maven dependencies. P.S.: Why does it import the key
> file over and over? It would be enough to do this once at beginning of
> smoker.
> >>
> >> Uwe
> >>
> >> Am 18.11.2022 um 14:12 schrieb Uwe Schindler:
> >>
> >> Hi,
> >>
> >> I get a failure because your key is somehow rejected by GPG (Ubuntu
> 22.04):
> >>
> >> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/24/console
> >>
> >> verify maven artifact sigs command "gpg --homedir
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg --import
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/KEYS" failed: gpg:
> keybox
> '/home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/pubring.kbx'
> created gpg:
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/trustdb.gpg:
> trustdb created gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley <
> yo...@apache.org>" imported gpg: can't connect to the agent: IPC connect
> call failed gpg: key E48025ED13E57FFC: public key "Upayavira <
> u...@odoko.co.uk>" imported [...] gpg: key 051A0FAF76BC6507: public key
> "Adrien Grand (CODE SIGNING KEY) " imported [...]
> gpg: key 32423B0E264B5CBA: public key "Julie Tibshirani (New code signing
> key) " imported gpg: Total number processed: 62
> gpg: imported: 62 gpg: no ultimately trusted keys found
> >> It looks like for others it succeeds? No idea why. Maybe Ubuntu 22.04
> has a too-new GPG or it needs to use gpg2?
> >>
> >> -1 to release until this is sorted out.
> >>
> >> Uwe
> >>
> >> Am 17.11.2022 um 15:18 schrieb Adrien Grand:
> >>
> >> Please vote for release candidate 1 for Lucene 9.4.2
> >>
> >> The artifacts can be downloaded from:
> >>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
> >>
> >> You can run the smoke tester directly with this command:
> >>
> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
> >>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
> >>
> >> The vote will be open for at least 72 hours i.e. until 2022-11-20 15:00
> UTC.
> >>
> >> [ ] +1  approve
> >> [ ] +0  no opinion
> >> [ ] -1  disapprove (and reason why)
> >>
> >> Here is my +1.
> >>
> >> --
> >> Adrien
> >>
> >> --
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremen
> >> https://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >> --
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremen
> >> https://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >
> >
> >
> > --
> > Adrien
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: u...@thetaphi.de
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien


Re: [VOTE] Release Lucene 9.4.2 RC1

2022-11-18 Thread Adrien Grand
Uwe, the error message suggests that Policeman Jenkins is not ultimately
trusting any of the keys. Does it work if you configure it to ultimately
trust your "Uwe Schindler (CODE SIGNING KEY) " key
(which I assume you would be ok with)?

On Fri, Nov 18, 2022 at 2:18 PM Uwe Schindler  wrote:

> I am restarting the build, maybe it was some hickup. Interestingly it only
> failed for the Maven dependencies. P.S.: Why does it import the key file
> over and over? It would be enough to do this once at beginning of smoker.
>
> Uwe
> Am 18.11.2022 um 14:12 schrieb Uwe Schindler:
>
> Hi,
>
> I get a failure because your key is somehow rejected by GPG (Ubuntu 22.04):
>
> https://jenkins.thetaphi.de/job/Lucene-Release-Tester/24/console
>
> verify maven artifact sigs command "gpg --homedir
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg --import
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/KEYS" failed: gpg:
> keybox
> '/home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/pubring.kbx'
> created gpg:
> /home/jenkins/workspace/Lucene-Release-Tester/smoketmp/lucene.gpg/trustdb.gpg:
> trustdb created gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley
>  " imported gpg: can't connect to the
> agent: IPC connect call failed gpg: key E48025ED13E57FFC: public key
> "Upayavira  " imported [...] gpg: key
> 051A0FAF76BC6507: public key "Adrien Grand (CODE SIGNING KEY)
>  " imported [...] gpg: key
> 32423B0E264B5CBA: public key "Julie Tibshirani (New code signing key)
>  " imported gpg: Total number
> processed: 62 gpg: imported: 62 gpg: no ultimately trusted keys found
> It looks like for others it succeeds? No idea why. Maybe Ubuntu 22.04 has
> a too-new GPG or it needs to use gpg2?
>
> -1 to release until this is sorted out.
>
> Uwe
> Am 17.11.2022 um 15:18 schrieb Adrien Grand:
>
> Please vote for release candidate 1 for Lucene 9.4.2
>
> The artifacts can be downloaded from:
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.4.2-RC1-rev-858d9b437047a577fa9457089afff43eefa461db
>
> The vote will be open for at least 72 hours i.e. until 2022-11-20 15:00
> UTC.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1.
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


  1   2   3   4   5   6   7   8   9   10   >