Re: Lucene 10

2024-03-18 Thread Luca Cavanna
Hey Patrick,
your help on search concurrency will be much appreciated :)  I have some
very hacky branch that I'd like to use as a base for discussion of
the issues I found and needed adjustments. Lots to do there. I will ping
you once I put up a draft PR.

Cheers
Luca

On Fri, Mar 15, 2024 at 9:55 PM Patrick Zhai  wrote:

> Thanks Adrien +1 to the timelines.
>
> I'm also willing to work on/ review the Decouple search concurrency from
> index geometry  task,
> that's a very nice one to have for those latency sensitive applications
> (rather than have to tune
> merge policy case by case). But I cannot guarantee anything yet so if
> others are also
> working on it I'm happy to share the ideas/ efforts (if any).
>
> Patrick
>
> On Thu, Mar 14, 2024 at 12:09 PM Michael Sokolov 
> wrote:
>
>> timing makes sense to me. +1 for having a deadline to reduce
>> procrastination, but Adrien I don't honestly believe anyone who is
>> paying attention thinks that is what you have been doing!
>>
>> On Wed, Mar 13, 2024 at 10:40 AM Adrien Grand  wrote:
>> >
>> > Hello everyone!
>> >
>> > It's been ~2.5 years since we released Lucene 9.0 (December 2021) and
>> I'd like us to start working towards Lucene 10.0. I'm volunteering for
>> being the release manager and propose the following timeline:
>> >  - ~September 15th: main gets bumped to 11.x, branch_10x gets created
>> >  - ~September 22nd: Do a last 9.x minor release.
>> >  - ~October 1st: Release 10.0.
>> >
>> > This may sound like a long notice period. My motivation is that there
>> are a few changes I have on my mind that are likely worthy of a major
>> release, and I plan on taking advantage of a date being set to stop
>> procrastinating and finally start moving these enhancements forward. These
>> are not blockers, only my wish list for Lucene 10.0, if they are not ready
>> in time we can have discussions about letting them slip until the next
>> major.
>> >  - Greater I/O concurrency. Can Lucene better utilize modern disks that
>> are plenty concurrent?
>> >  - Decouple search concurrency from index geometry. Can Lucene better
>> utilize modern CPUs that are plenty concurrent?
>> >  - "Sparse indexing" / "zone indexing" for sorted indexes. This is one
>> of the most efficient techniques that OLAP databases take advantage of to
>> make search fast. Let's bring it to Lucene.
>> >
>> > This list isn't meant to be an exhaustive list of release highlights
>> for Lucene 10, feel free to add your own. There are also a number of
>> cleanups we may want to consider. I wanted to share this list for
>> visibility though in case you have thoughts on these enhancements and/or
>> would like to help.
>> >
>> > --
>> > Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Welcome Ben Trent to the Lucene PMC

2024-02-22 Thread Luca Cavanna
I'm pleased to announce that Ben Trent has accepted an invitation to join
the Lucene PMC!

Congratulations Ben, and welcome aboard!


Cheers
Luca


Re: Welcome Zhang Chao as Lucene committer

2024-02-20 Thread Luca Cavanna
Congrats and welcome

On Tue, Feb 20, 2024 at 6:28 PM Adrien Grand  wrote:

> I'm pleased to announce that Zhang Chao has accepted the PMC's
> invitation to become a committer.
>
> Chao, the tradition is that new committers introduce themselves with a
> brief bio.
>
> Congratulations and welcome!
>
> --
> Adrien
>


Re: [VOTE] Release Lucene 9.10.0 RC1

2024-02-15 Thread Luca Cavanna
+1

SUCCESS! [0:44:04.729177]

On Thu, Feb 15, 2024 at 8:10 PM Tomás Fernández Löbbe 
wrote:

> +1
>
> SUCCESS! [1:22:54.621515]
>
> On Thu, Feb 15, 2024 at 7:13 AM Robert Muir  wrote:
>
>> On Thu, Feb 15, 2024 at 9:54 AM Uwe Schindler  wrote:
>> >
>> > Hi,
>> >
>> > My Python knowledge is too limited to fix the build script to allow to
>> test the smoker with arbitrary JAVA_HOME dircetories next to the baseline
>> (Java 11). With lots of copypaste I can make it run on Java 21 in addition
>> to 17, but that looks like too unflexible.
>> >
>> > Mike McCandless: If you could help me to make it more flexible, it
>> would be good. I can open an issue, but if you have an easy solution. I
>> think of the following:
>> >
>> > JAVA_HOME must run be Java 11 (in 9.x)
>> > At moment you can pass "--test-java17 ", but this one is also
>> checked to be really java 17 (by parsing strings from its version output),
>> but I'd like to pass "--test-alternative-java " multiple times and it
>> would just run all those as part of smoking, maxbe the version number can
>> be extracted to be printed out.
>> >
>> > To me this is a hopeless task with Python.
>> >
>> > Uwe
>> >
>> > Am 15.02.2024 um 12:50 schrieb Uwe Schindler:
>> >
>>
>> I opened https://github.com/apache/lucene/issues/13107 as I have
>> struggles with the smoketester java 21 support too. Java is moving
>> faster these days, we should make it easier to maintain the script.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: Welcome Stefan Vodita as Lucene committter

2024-01-18 Thread Luca Cavanna
Congratulations and welcome!

On Thu, Jan 18, 2024 at 5:21 PM Dawid Weiss  wrote:

>
> Welcome, Stefan!
> Dawid
>
> On Thu, Jan 18, 2024 at 4:54 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> I'm pleased to announce that Stefan Vodita has accepted the Lucene PMC's
>> invitation to become a committer!
>>
>> Stefan, the tradition is that new committers introduce themselves with a
>> brief bio.
>>
>> Congratulations, welcome, and thank you for all your improvements to
>> Lucene and our community,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>


Re: The need for a Lucene 9.9.1 release

2023-12-09 Thread Luca Cavanna
I believe your assessment that it is "only" a read problem is correct. I
can see how using the "corruption" wording may have caused confusion. It is
a severe bug though that affects multi term queries and I thought it's a
good idea to patch that, given that folks have reproduced it and found the
root cause. I agree on adding tests that cover it and not rushing anything
out, yet people upgrading to 9.9.0 are affected by it and that seems bad.
Thanks for the feedback.




On Sat, Dec 9, 2023 at 12:31 PM Robert Muir  wrote:

> I don't understand use of the word corruption, isn't it just a bug in
> intersect() that only affects wildcards etc? e.g. its not gonna merge
> into new segments or impact written data in any way.
>
> And i don't think we should rushout some bugfix release without any
> test for this?
>
> On Sat, Dec 9, 2023 at 5:30 AM Luca Cavanna  wrote:
> >
> > Based on the discussions in
> https://github.com/apache/lucene/issues/12895 , it seems like reverting
> the change that caused the corruption on read is the quickest fix, so that
> we can speed up releasing 9.9.1. I opened a PR for that:
> https://github.com/apache/lucene/pull/12899. Is there additional testing
> that needs to be done to ensure that this is enough to address the
> corruption?
> >
> > Regarding a fix for the JVM SIGSEGV crash, how far are we from a fix
> that protects Lucene from it? Should we wait for that to be included in
> 9.9.1? Asking because the corruption above looks like it needs to be
> addressed rather quickly. It would be great to include both, but I don't
> know how long that delays 9.9.1.
> >
> > Cheers
> > Luca
> >
> >
> >
> > On Sat, Dec 9, 2023 at 11:13 AM Chris Hegarty
>  wrote:
> >>
> >> Oh, and I’m happy to be Release Manager for 9.9.1 (given my recent
> experience on 9.9.0)
> >>
> >> -Chris.
> >>
> >> > On 9 Dec 2023, at 09:09, Chris Hegarty <
> christopher.hega...@elastic.co> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We’ve encounter two very serious issues with the recent Lucene 9.9.0
> release, both of which (even if taken by themselves) would warrant a 9.9.1.
> The issues are:
> >> >
> >> > 1. https://github.com/apache/lucene/issues/12895 - Corruption read
> on term dictionaries in Lucene 9.9
> >> >
> >> > 2. https://github.com/apache/lucene/issues/12898 - JVM SIGSEGV crash
> when compiling computeCommonPrefixLengthAndBuildHistogram Lucene 9.9.0
> >> >
> >> > There is still a little investigation and work left to bring these
> issues to a point where we’re comfortable with proposing a solution. I
> would be hopeful that we’ll get there by early next week. If so, then a
> Lucene 9.9.1 release can be proposed.
> >> >
> >> > Thanks,
> >> > -Chris.
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: The need for a Lucene 9.9.1 release

2023-12-09 Thread Luca Cavanna
Based on the discussions in https://github.com/apache/lucene/issues/12895 ,
it seems like reverting the change that caused the corruption on read is
the quickest fix, so that we can speed up releasing 9.9.1. I opened a PR
for that: https://github.com/apache/lucene/pull/12899. Is there additional
testing that needs to be done to ensure that this is enough to address the
corruption?

Regarding a fix for the JVM SIGSEGV crash, how far are we from a fix that
protects Lucene from it? Should we wait for that to be included in 9.9.1?
Asking because the corruption above looks like it needs to be addressed
rather quickly. It would be great to include both, but I don't know how
long that delays 9.9.1.

Cheers
Luca



On Sat, Dec 9, 2023 at 11:13 AM Chris Hegarty
 wrote:

> Oh, and I’m happy to be Release Manager for 9.9.1 (given my recent
> experience on 9.9.0)
>
> -Chris.
>
> > On 9 Dec 2023, at 09:09, Chris Hegarty 
> wrote:
> >
> > Hi,
> >
> > We’ve encounter two very serious issues with the recent Lucene 9.9.0
> release, both of which (even if taken by themselves) would warrant a 9.9.1.
> The issues are:
> >
> > 1. https://github.com/apache/lucene/issues/12895 - Corruption read on
> term dictionaries in Lucene 9.9
> >
> > 2. https://github.com/apache/lucene/issues/12898 - JVM SIGSEGV crash
> when compiling computeCommonPrefixLengthAndBuildHistogram Lucene 9.9.0
> >
> > There is still a little investigation and work left to bring these
> issues to a point where we’re comfortable with proposing a solution. I
> would be hopeful that we’ll get there by early next week. If so, then a
> Lucene 9.9.1 release can be proposed.
> >
> > Thanks,
> > -Chris.
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [VOTE] Release Lucene 9.9.0 RC2

2023-12-01 Thread Luca Cavanna
SUCCESS! [0:34:53.150902]


+1


On Fri, Dec 1, 2023 at 9:06 AM Ignacio Vera  wrote:

> SUCCESS! [1:20:23.570231]
>
>
> +1
>
> On Fri, Dec 1, 2023 at 6:55 AM Shyamsunder Mutcha 
> wrote:
>
>> SUCCESS! [0:38:41.054860]
>> +1
>>
>> On Thu, Nov 30, 2023 at 9:59 PM Nhat Nguyen
>>  wrote:
>>
>>> SUCCESS! [1:22:43.808415]
>>>
>>> +1
>>>
>>> On Thu, Nov 30, 2023 at 6:09 PM Christian Moen  wrote:
>>>
 SUCCESS! [1:49:26.873909]

 +1

 On Fri, Dec 1, 2023 at 3:09 AM Chris Hegarty
  wrote:

> Please vote for release candidate 2 for Lucene 9.9.0
>
>
> The artifacts can be downloaded from:
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>
>
> You can run the smoke tester directly with this command:
>
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>
>
> The vote will be open for at least 72 hours, and given the weekend in
> between, let’s keep it open until 2023-12-04 12:00 UTC.
>
> [ ] +1  approve
>
> [ ] +0  no opinion
>
> [ ] -1  disapprove (and reason why)
>
>
> Here is my +1
>
>
> -Chris.
>
>


Re: GitHub issues vs PRs vs Lucene's CHANGES.txt

2023-11-30 Thread Luca Cavanna
Sounds like we could automate assigning the milestone, given that it is a
commonly forgotten step, based on the section of CHANGES where the PR gets
added?

I am pretty sure that I forgot to add entries to CHANGES too. That could be
maybe suggested in github. Whenever there's a PR that does not touch the
CHANGES.txt, more often than not it's a mistake?

I am wondering if it still makes sense to  have to track changes associated
to versions in both milestones as well as the CHANGES.txt file. There is
some duplication there. Could the CHANGES file be generated from the
milestone, if it was set correctly, and the description of the change taken
from the title of the PR? Sorry if I am bringing up something that has been
discussed before.

On Thu, Nov 30, 2023 at 11:03 PM Dongyu Xu  wrote:

> Hopefully this is relevant.
>
> There are useful tools like git-cliff​ for automating changelog
> generation.
>
> https://github.com/orhun/git-cliff
>
> Tony X
> --
> *From:* Michael McCandless 
> *Sent:* Thursday, November 30, 2023 4:30 AM
> *To:* dev@lucene.apache.org 
> *Subject:* Re: GitHub issues vs PRs vs Lucene's CHANGES.txt
>
> Well, I created a starting tool to at least help us keep the
> what-should-be-identical-yet-is-nearly-impossible-for-us-to-achieve
> sections in CHANGES.txt in sync:
> https://github.com/apache/lucene/pull/12860
>
> Right now it finds a number of mostly minor differences in the 9.9.0
> sections in main vs branch_9_9:
>
> NOTE: resolving branch_9_9 -->
> https://raw.githubusercontent.com/apache/lucene/branch_9_9/lucene/CHANGES.txt
> NOTE: resolving main -->
> https://raw.githubusercontent.com/apache/lucene/main/lucene/CHANGES.txt
> 15a16,18
> > * GITHUB#12646, GITHUB#12690: Move FST#addNode to FSTCompiler to avoid a
> circular dependency
> >   between FST and FSTCompiler (Anh Dung Bui)
> >
> 27,30c30
> < * GITHUB#12646, GITHUB#12690: Move FST#addNode to FSTCompiler to avoid a
> circular dependency
> <   between FST and FSTCompiler (Anh Dung Bui)
> <
> < * GITHUB#12709 Consolidate FSTStore and BytesStore in FST. Created
> FSTReader which contains the common methods
> ---
> > * GITHUB#12709: Consolidate FSTStore and BytesStore in FST. Created
> FSTReader which contains the common methods
> 33,34d32
> < * GITHUB#12735: Remove FSTCompiler#getTermCount() and
> FSTCompiler.UnCompiledNode#inputCount (Anh Dung Bui)
> <
> 37a36,37
> > * GITHUB#12735: Remove FSTCompiler#getTermCount() and
> FSTCompiler.UnCompiledNode#inputCount (Anh Dung Bui)
> >
> 166a167,168
> > * GITHUB#12748: Specialize arc store for continuous label in FST. (Guo
> Feng, Zhang Chao)
> >
> 173,177d174
> < * GITHUB#12748: Specialize arc store for continuous label in FST. (Guo
> Feng, Chao Zhang)
> <
> < * GITHUB#12825, GITHUB#12834: Hunspell: improved dictionary loading
> performance, allowed in-memory entry sorting.
> <   (Peter Gromov)
> <
> 185,186d181
> <
> < * GITHUB#12552: Make FSTPostingsFormat load FSTs off-heap. (Tony X)
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 29, 2023 at 6:01 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> Oh, and that the CHANGES.txt entries in e.g. 9.9.0 section match on 9.x
> and main branches... I think that one we have some automation to catch?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 29, 2023 at 5:58 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> Hi Team,
>
> I see Chris is tagging issues that were left open after their linked PRs
> were merged (thanks!).
>
> Is there something in our release tooling that cross-checks all the weakly
> linked metadata today: Milestone marked (or more often: not) on an issue vs
> commits to the respective branches vs location in Lucene's CHANGES.txt vs
> open/closed issue matching the linked PRs?
>
> It seems like some simple automation here could catch mistakes.  E.g. I'm
> uncertain I properly moved all the FST related CHANGES.txt entries to the
> right places.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>


Re: [VOTE] Release Lucene 9.9.0 RC1

2023-11-30 Thread Luca Cavanna
SUCCESS! [0:33:10.432870]

+1

On Thu, Nov 30, 2023 at 2:59 PM Chris Hegarty
 wrote:

> Hi Mike,
>
> On 30 Nov 2023, at 11:41, Michael McCandless 
> wrote:
>
> +1 to release.
>
> I hit a corner-case test failure and opened a PR to fix it:
> https://github.com/apache/lucene/pull/12859
>
>
> Good find!  It looks like the fix for this issue is well in hand - great.
>
> I don't think this should block the release? -- it looks exotic.
>
>
> I’m not sure how likely this bug is to show in real (non-test) scenarios,
> but it does look kinda “exotic” to me too. So unless there are counter
> arguments, I do not see it as critical, and therefore not needing a respin.
>
> -Chris.
>
>
> Thanks Chris!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Nov 30, 2023 at 1:16 AM Patrick Zhai  wrote:
>
>> SUCCESS! [1:03:54.880200]
>>
>> +1. Thank you Chris!
>>
>> On Wed, Nov 29, 2023 at 8:45 PM Nhat Nguyen
>>  wrote:
>>
>>> SUCCESS! [1:11:30.037919]
>>>
>>> +1. Thanks, Chris!
>>>
>>> On Wed, Nov 29, 2023 at 8:53 AM Chris Hegarty
>>>  wrote:
>>>
 Hi,

 Please vote for release candidate 1 for Lucene 9.9.0

 The artifacts can be downloaded from:

 https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037

 You can run the smoke tester directly with this command:

 python3 -u dev-tools/scripts/smokeTestRelease.py \

 https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037

 The vote will be open for at least 72 hours, and given the weekend in
 between, let’s it open until 2023-12-04 12:00 UTC.

 [ ] +1  approve
 [ ] +0  no opinion
 [ ] -1  disapprove (and reason why)

 Here is my +1

 Draft release highlights can be viewed here (comments and feedback
 welcome):
 https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_0

 -Chris.

>>>
>


Re: Welcome Patrick Zhai to the Lucene PMC

2023-11-13 Thread Luca Cavanna
Congrats Patrick!

On Sun, Nov 12, 2023 at 7:14 PM Patrick Zhai  wrote:

> Thank you everyone!
>
> On Sun, Nov 12, 2023, 09:34 Dawid Weiss  wrote:
>
>>
>>
>> Congratulations and welcome, Patrick!
>>
>> Dawid
>>
>> On Fri, Nov 10, 2023 at 9:05 PM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> I'm happy to announce that Patrick Zhai has accepted an invitation to
>>> join the Lucene Project Management Committee (PMC)!
>>>
>>> Congratulations Patrick, thank you for all your hard work improving
>>> Lucene's community and source code, and welcome aboard!
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>


Re: Welcome Guo Feng to the Lucene PMC

2023-10-25 Thread Luca Cavanna
Congrats and welcome!

On Wed, Oct 25, 2023 at 7:42 PM Dawid Weiss  wrote:

>
> Congratulations and welcome, Feng!
> Dawid
>
> On Tue, Oct 24, 2023 at 7:04 PM Adrien Grand  wrote:
>
>> I'm pleased to announce that Guo Feng has accepted an invitation to join
>> the Lucene PMC!
>>
>> Congratulations Feng, and welcome aboard!
>>
>> --
>> Adrien
>>
>


Re: Welcome Luca Cavanna to the Lucene PMC

2023-10-23 Thread Luca Cavanna
Thanks all, I am thrilled to join the PMC!

On Mon, Oct 23, 2023 at 2:17 AM Nhat Nguyen 
wrote:

> Congratulations, Luca!
>
> On Sun, Oct 22, 2023 at 5:08 PM Anshum Gupta 
> wrote:
>
>> Congratulations and welcome, Luca!
>>
>> On Thu, Oct 19, 2023 at 10:51 PM Adrien Grand  wrote:
>>
>>> I'm pleased to announce that Luca Cavanna has accepted an invitation to
>>> join the Lucene PMC!
>>>
>>> Congratulations Luca, and welcome aboard!
>>>
>>> --
>>> Adrien
>>>
>>
>>
>> --
>> Anshum Gupta
>>
>


Re: github milestones vs. releases mystery

2023-09-28 Thread Luca Cavanna
I opened https://github.com/apache/lucene/pull/12607 to add the missing
step to the release wizard script.

On Thu, Sep 28, 2023 at 8:43 PM Luca Cavanna  wrote:

> Creating the github release is a step that has been missed before. I know
> that I have when I was the release manager for 9.5. I think that it's not
> part of the release script that we follow for a Lucene release, which needs
> updating. The script does include closing the current milestone.
>
> I just created the missing 9.7.0 release, although the release date will
> be off, sadly. I will look at what needs to be done to update the release
> script so that we don't miss it in the future.
>
>
> On Thu, Sep 28, 2023 at 7:23 PM Houston Putman  wrote:
>
>> Making a release in github is quite easy. You can do it from the release
>> git tag, so it's "retroactive". (we can do it for 9.7.0 right now)
>>
>> For the release wizard, the Solr Operator has a section to do this:
>> https://github.com/apache/solr-operator/blob/main/hack/release/wizard/releaseWizard.yaml#L1392-L1400
>>
>> I'm surprised it's not in the Lucene one already, since I think Solr has
>> it as well. But it's easy to add nonetheless.
>>
>> - Houston
>>
>> On Thu, Sep 28, 2023 at 1:17 PM Christine Poerschke (BLOOMBERG/ LONDON) <
>> cpoersc...@bloomberg.net> wrote:
>>
>>> Hello Everyone,
>>>
>>> I just semi-randomly noticed that
>>> https://github.com/apache/lucene/releases shows 9.6.0 as the latest
>>> release i.e. not 9.7.0 but on
>>> https://github.com/apache/lucene/milestones the 9.7.0 milestone is
>>> marked as closed, as expected.
>>>
>>>
>>> https://github.com/apache/lucene/blob/releases/lucene/9.7.0/dev-tools/scripts/releaseWizard.yaml#L1517-L1524
>>> looks to be the relevant section of the release wizard.
>>>
>>> Is anyone else also surprised or puzzled by this and/or has any insights
>>> on what (if anything) to do for 9.7.0 retrospectively and 9.8.0 and others
>>> in future?
>>>
>>> Thanks,
>>> Christine
>>>
>>


Re: github milestones vs. releases mystery

2023-09-28 Thread Luca Cavanna
Creating the github release is a step that has been missed before. I know
that I have when I was the release manager for 9.5. I think that it's not
part of the release script that we follow for a Lucene release, which needs
updating. The script does include closing the current milestone.

I just created the missing 9.7.0 release, although the release date will be
off, sadly. I will look at what needs to be done to update the release
script so that we don't miss it in the future.


On Thu, Sep 28, 2023 at 7:23 PM Houston Putman  wrote:

> Making a release in github is quite easy. You can do it from the release
> git tag, so it's "retroactive". (we can do it for 9.7.0 right now)
>
> For the release wizard, the Solr Operator has a section to do this:
> https://github.com/apache/solr-operator/blob/main/hack/release/wizard/releaseWizard.yaml#L1392-L1400
>
> I'm surprised it's not in the Lucene one already, since I think Solr has
> it as well. But it's easy to add nonetheless.
>
> - Houston
>
> On Thu, Sep 28, 2023 at 1:17 PM Christine Poerschke (BLOOMBERG/ LONDON) <
> cpoersc...@bloomberg.net> wrote:
>
>> Hello Everyone,
>>
>> I just semi-randomly noticed that
>> https://github.com/apache/lucene/releases shows 9.6.0 as the latest
>> release i.e. not 9.7.0 but on https://github.com/apache/lucene/milestones
>> the 9.7.0 milestone is marked as closed, as expected.
>>
>>
>> https://github.com/apache/lucene/blob/releases/lucene/9.7.0/dev-tools/scripts/releaseWizard.yaml#L1517-L1524
>> looks to be the relevant section of the release wizard.
>>
>> Is anyone else also surprised or puzzled by this and/or has any insights
>> on what (if anything) to do for 9.7.0 retrospectively and 9.8.0 and others
>> in future?
>>
>> Thanks,
>> Christine
>>
>


Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1132 - Unstable!

2023-09-25 Thread Luca Cavanna
I opened https://github.com/apache/lucene/pull/12588 as one way to address
this.

On Mon, Sep 25, 2023 at 2:07 PM Luca Cavanna  wrote:

> This is caused by https://github.com/apache/lucene/pull/12183 , yet I
> don't think there is anything wrong with the PR itself.
>
> Looking at the test and QueryUtils, we create a new searcher for each leaf
> reader, and each searcher gets a separate executor. That would be ok as
> long as the executors are shutdown promptly, but that is not the case here
> as they get terminated via a close listener associated with each index
> reader, which I believe is called only later in the AfterClass.
> I am thinking that we end up creating too many executors and that is why
> we in turn end up creating too many threads. I am looking into fixing this.
>
> On Fri, Sep 22, 2023 at 9:30 AM Dawid Weiss  wrote:
>
>>
>> This failed because the thread limit has been exhausted.
>>
>> [3241.014s][warning][os,thread] Failed to start thread "Unknown thread" - 
>> pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 
>> 0k, detached.
>> [3241.015s][warning][os,thread] Failed to start the native thread for 
>> java.lang.Thread "LuceneTestCase-7564-thread-1
>>
>> org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs > testP7 FAILED
>> java.lang.OutOfMemoryError: unable to create native thread: possibly out 
>> of memory or process/resource limits reached
>> at 
>> __randomizedtesting.SeedInfo.seed([B5DBA9A0960A3ECC:1F73384DEAF58B95]:0)
>> at java.base/java.lang.Thread.start0(Native Method)
>> at java.base/java.lang.Thread.start(Thread.java:798)
>> at 
>> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
>> at 
>> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
>> at 
>> org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:73)
>> at org.apache.lucene.index.TermStates.build(TermStates.java:119)
>> at 
>> org.apache.lucene.search.PhraseQuery$1.getStats(PhraseQuery.java:458)
>> at org.apache.lucene.search.PhraseWeight.(PhraseWeight.java:44)
>> at 
>> org.apache.lucene.search.PhraseQuery$1.(PhraseQuery.java:439)
>> at 
>> org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:439)
>> at 
>> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:893)
>> at 
>> org.apache.lucene.tests.search.AssertingIndexSearcher.createWeight(AssertingIndexSearcher.java:62)
>> at 
>> org.apache.lucene.search.BooleanWeight.(BooleanWeight.java:59)
>> at 
>> org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:245)
>> at 
>> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:893)
>> at 
>> org.apache.lucene.tests.search.AssertingIndexSearcher.createWeight(AssertingIndexSearcher.java:62)
>> at 
>> org.apache.lucene.tests.search.QueryUtils$4.doSetNextReader(QueryUtils.java:617)
>> at 
>> org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:31)
>> at 
>> org.apache.lucene.search.FilterCollector.getLeafCollector(FilterCollector.java:38)
>> at 
>> org.apache.lucene.tests.search.AssertingCollector.getLeafCollector(AssertingCollector.java:54)
>> at 
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:748)
>> at 
>> org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:79)
>> at 
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:547)
>> at 
>> org.apache.lucene.tests.search.QueryUtils.checkFirstSkipTo(QueryUtils.java:549)
>> at 
>> org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:138)
>> at 
>> org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:131)
>> at 
>> org.apache.lucene.tests.search.CheckHits.checkHitCollector(CheckHits.java:106)
>> at 
>> org.apache.lucene.tests.search.BaseExplanationTestCase.qtest(BaseExplanationTestCase.java:110)
>> at 
>> org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs.qtest(TestSimpleExplanationsWithFillerDocs.java:116)
>> at 
>> org.apache.lucene.search.TestSimpleExplanations.testP7(TestSimpleExplanations.java:87)
>> at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>> at 
>&g

Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1132 - Unstable!

2023-09-25 Thread Luca Cavanna
This is caused by https://github.com/apache/lucene/pull/12183 , yet I don't
think there is anything wrong with the PR itself.

Looking at the test and QueryUtils, we create a new searcher for each leaf
reader, and each searcher gets a separate executor. That would be ok as
long as the executors are shutdown promptly, but that is not the case here
as they get terminated via a close listener associated with each index
reader, which I believe is called only later in the AfterClass.
I am thinking that we end up creating too many executors and that is why we
in turn end up creating too many threads. I am looking into fixing this.

On Fri, Sep 22, 2023 at 9:30 AM Dawid Weiss  wrote:

>
> This failed because the thread limit has been exhausted.
>
> [3241.014s][warning][os,thread] Failed to start thread "Unknown thread" - 
> pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 
> 0k, detached.
> [3241.015s][warning][os,thread] Failed to start the native thread for 
> java.lang.Thread "LuceneTestCase-7564-thread-1
>
> org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs > testP7 FAILED
> java.lang.OutOfMemoryError: unable to create native thread: possibly out 
> of memory or process/resource limits reached
> at 
> __randomizedtesting.SeedInfo.seed([B5DBA9A0960A3ECC:1F73384DEAF58B95]:0)
> at java.base/java.lang.Thread.start0(Native Method)
> at java.base/java.lang.Thread.start(Thread.java:798)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
> at 
> org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:73)
> at org.apache.lucene.index.TermStates.build(TermStates.java:119)
> at 
> org.apache.lucene.search.PhraseQuery$1.getStats(PhraseQuery.java:458)
> at org.apache.lucene.search.PhraseWeight.(PhraseWeight.java:44)
> at org.apache.lucene.search.PhraseQuery$1.(PhraseQuery.java:439)
> at 
> org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:439)
> at 
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:893)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.createWeight(AssertingIndexSearcher.java:62)
> at 
> org.apache.lucene.search.BooleanWeight.(BooleanWeight.java:59)
> at 
> org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:245)
> at 
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:893)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.createWeight(AssertingIndexSearcher.java:62)
> at 
> org.apache.lucene.tests.search.QueryUtils$4.doSetNextReader(QueryUtils.java:617)
> at 
> org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:31)
> at 
> org.apache.lucene.search.FilterCollector.getLeafCollector(FilterCollector.java:38)
> at 
> org.apache.lucene.tests.search.AssertingCollector.getLeafCollector(AssertingCollector.java:54)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:748)
> at 
> org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:79)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:547)
> at 
> org.apache.lucene.tests.search.QueryUtils.checkFirstSkipTo(QueryUtils.java:549)
> at 
> org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:138)
> at 
> org.apache.lucene.tests.search.QueryUtils.check(QueryUtils.java:131)
> at 
> org.apache.lucene.tests.search.CheckHits.checkHitCollector(CheckHits.java:106)
> at 
> org.apache.lucene.tests.search.BaseExplanationTestCase.qtest(BaseExplanationTestCase.java:110)
> at 
> org.apache.lucene.search.TestSimpleExplanationsWithFillerDocs.qtest(TestSimpleExplanationsWithFillerDocs.java:116)
> at 
> org.apache.lucene.search.TestSimpleExplanations.testP7(TestSimpleExplanations.java:87)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at 
> 

Re: Custom SliceExecutor and slices computation in IndexSearcher

2023-06-06 Thread Luca Cavanna
Thanks Sorabh,
I will have a look in the coming days.

On Tue, Jun 6, 2023 at 12:04 AM SorabhApache  wrote:

> Hi Luca,
> I looked into moving the slice computation to SliceExecutor and using that
> in the default case as well. This way the package private constructor with
> SliceExecutor can be exposed and utilized by different extensions to
> customize the slice computation and execution as well. I have created a
> GitHub issue[1] and PR[2] to share the changes and will look forward to the
> feedback.
>
> [1]: https://github.com/apache/lucene/issues/12347
> [2]: https://github.com/apache/lucene/pull/12348
>
> Thanks,
> Sorabh
>
> On Sat, May 27, 2023 at 1:09 AM SorabhApache  wrote:
>
>> Hi Luca,
>> Thanks for the suggestion. Let me explore more on it and I will get back.
>> But I agree making the slices non-final will require consumers to ensure
>> slices are not mutated after the construction time and is not preferred.
>>
>> Thanks,
>> Sorabh
>>
>>
>>
>> On Tue, May 23, 2023 at 3:14 AM Luca Cavanna  wrote:
>>
>>> Hi Sorabh,
>>> thanks for explaining. I see what you mean, it does get awkward to
>>> customize how slices are created. If the plan is to make SliceExecutor
>>> public and extensible, would it make sense to figure out what its public
>>> methods should be, and include the slice creation in there so it is
>>> detached from the IndexSearcher? I think that in practice there is a
>>> correlation between how slices are created and how they are executed (e.g.
>>> you may want to limit the number of slices created and execute each one on
>>> a separate thread, or possibly have more slices than threads, and reuse the
>>> same thread to execute multiple slices). Moving the slice creation to the
>>> component responsible for their execution would give the desired
>>> flexibility for users to customize their logic without having to poke with
>>> IndexSearcher constructors? What do you think? I personally prefer this
>>> option over adding a function argument to the existing searcher
>>> constructor, or making the slices non-final which affects the inter-segment
>>> concurrency design (slices should not be mutable?).
>>>
>>> Cheers
>>> Luca
>>>
>>> On Fri, May 19, 2023 at 9:02 AM SorabhApache  wrote:
>>>
>>>> Hi Luca,
>>>> Thanks for your reply. Sharing an example below for clarity.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *public class CustomIndexSearcher extends IndexSearcher {
>>>>  public CustomIndexSearcher(IndexReader reader, Executor executor, int
>>>> maxSliceCount) {super(reader, executor); }
>>>>  @Override  protected LeafSlice[] slices(List
>>>> leaves) {*
>>>> * // cannot use maxSliceCount here to control custom
>>>> logic as this is called from constructor of super class*
>>>> * // I want to use parameter[s] in the constructor
>>>> input to control this slice computation*
>>>> *  }*
>>>> *}*
>>>>
>>>> Yes, the SliceExecutor class will become public. Will also need a
>>>> constructor in IndexSearcher which can take an implementation from the
>>>> extensions.
>>>>
>>>> Thanks,
>>>> Sorabh
>>>>
>>>>
>>>> On Thu, May 18, 2023 at 11:40 PM Luca Cavanna 
>>>> wrote:
>>>>
>>>>> Hi Sorabh,
>>>>> You'll want to override the protected slices method to include your
>>>>> custom logic for creating the leaf slices. Your IndexSearcher extension 
>>>>> can
>>>>> also retrieve the slices through the getSlices public method. I don't
>>>>> understand what makes the additional constructor necessary, could you
>>>>> clarify that for me?
>>>>>
>>>>> One thing that may make sense to do is making the SliceExecutor
>>>>> extensible. Currently it is package private, and I can see how users may
>>>>> want to provide their own implementation when it comes to handling
>>>>> rejections, executing on the caller thread in certain scenarios. Possibly
>>>>> even the task creation, and the coordination of their execution could be
>>>>> moved to the SliceExecutor too.
>>>>>
>>>>> Cheers
>>>>> Luca
>

Re: Custom SliceExecutor and slices computation in IndexSearcher

2023-05-23 Thread Luca Cavanna
Hi Sorabh,
thanks for explaining. I see what you mean, it does get awkward to
customize how slices are created. If the plan is to make SliceExecutor
public and extensible, would it make sense to figure out what its public
methods should be, and include the slice creation in there so it is
detached from the IndexSearcher? I think that in practice there is a
correlation between how slices are created and how they are executed (e.g.
you may want to limit the number of slices created and execute each one on
a separate thread, or possibly have more slices than threads, and reuse the
same thread to execute multiple slices). Moving the slice creation to the
component responsible for their execution would give the desired
flexibility for users to customize their logic without having to poke with
IndexSearcher constructors? What do you think? I personally prefer this
option over adding a function argument to the existing searcher
constructor, or making the slices non-final which affects the inter-segment
concurrency design (slices should not be mutable?).

Cheers
Luca

On Fri, May 19, 2023 at 9:02 AM SorabhApache  wrote:

> Hi Luca,
> Thanks for your reply. Sharing an example below for clarity.
>
>
>
>
>
>
>
>
> *public class CustomIndexSearcher extends IndexSearcher { public
> CustomIndexSearcher(IndexReader reader, Executor executor, int
> maxSliceCount) {super(reader, executor); }
>  @Override  protected LeafSlice[] slices(List
> leaves) {*
> * // cannot use maxSliceCount here to control custom logic
> as this is called from constructor of super class*
> * // I want to use parameter[s] in the constructor input
> to control this slice computation*
> *  }*
> *}*
>
> Yes, the SliceExecutor class will become public. Will also need a
> constructor in IndexSearcher which can take an implementation from the
> extensions.
>
> Thanks,
> Sorabh
>
>
> On Thu, May 18, 2023 at 11:40 PM Luca Cavanna 
> wrote:
>
>> Hi Sorabh,
>> You'll want to override the protected slices method to include your
>> custom logic for creating the leaf slices. Your IndexSearcher extension can
>> also retrieve the slices through the getSlices public method. I don't
>> understand what makes the additional constructor necessary, could you
>> clarify that for me?
>>
>> One thing that may make sense to do is making the SliceExecutor
>> extensible. Currently it is package private, and I can see how users may
>> want to provide their own implementation when it comes to handling
>> rejections, executing on the caller thread in certain scenarios. Possibly
>> even the task creation, and the coordination of their execution could be
>> moved to the SliceExecutor too.
>>
>> Cheers
>> Luca
>>
>> On Fri, May 19, 2023, 03:27 SorabhApache  wrote:
>>
>>> Hi All,
>>>
>>> For concurrent segment search, lucene uses the *slices* method to
>>> compute the number of work units which can be processed concurrently.
>>>
>>> a) It calculates *slices* in the constructor of *IndexSearcher*
>>> <https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L239>
>>> with default thresholds for document count and segment counts.
>>> b) Provides an implementation of *SliceExecutor* (i.e.
>>> QueueSizeBasedExecutor)
>>> <https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L1008>
>>> based on executor type which applies the backpressure in concurrent
>>> execution based on a limiting factor of 1.5 times the passed in threadpool
>>> maxPoolSize.
>>>
>>> In OpenSearch, we have a search threadpool which serves the search
>>> request to all the lucene indices (or OpenSearch shards) assigned to a
>>> node. Each node can get the requests to some or all the indices on that
>>> node.
>>> I am exploring a mechanism such that I can dynamically control the max
>>> slices for each lucene index search request. For example: search requests
>>> to some indices on that node to have max 4 slices each and others to have 2
>>> slices each. Then the threadpool shared to execute these slices does not
>>> have any limiting factor. In this model the top level search threadpool
>>> will limit the number of active search requests which will limit the number
>>> of work units in the SliceExecutor threadpool.
>>>
>>> For this the derived implementation of IndexSearcher can get an input
>>> value in the constructor to control the slice count computa

Re: Custom SliceExecutor and slices computation in IndexSearcher

2023-05-19 Thread Luca Cavanna
Hi Sorabh,
You'll want to override the protected slices method to include your custom
logic for creating the leaf slices. Your IndexSearcher extension can also
retrieve the slices through the getSlices public method. I don't understand
what makes the additional constructor necessary, could you clarify that for
me?

One thing that may make sense to do is making the SliceExecutor extensible.
Currently it is package private, and I can see how users may want to
provide their own implementation when it comes to handling rejections,
executing on the caller thread in certain scenarios. Possibly even the task
creation, and the coordination of their execution could be moved to the
SliceExecutor too.

Cheers
Luca

On Fri, May 19, 2023, 03:27 SorabhApache  wrote:

> Hi All,
>
> For concurrent segment search, lucene uses the *slices* method to compute
> the number of work units which can be processed concurrently.
>
> a) It calculates *slices* in the constructor of *IndexSearcher*
> 
> with default thresholds for document count and segment counts.
> b) Provides an implementation of *SliceExecutor* (i.e.
> QueueSizeBasedExecutor)
> 
> based on executor type which applies the backpressure in concurrent
> execution based on a limiting factor of 1.5 times the passed in threadpool
> maxPoolSize.
>
> In OpenSearch, we have a search threadpool which serves the search request
> to all the lucene indices (or OpenSearch shards) assigned to a node. Each
> node can get the requests to some or all the indices on that node.
> I am exploring a mechanism such that I can dynamically control the max
> slices for each lucene index search request. For example: search requests
> to some indices on that node to have max 4 slices each and others to have 2
> slices each. Then the threadpool shared to execute these slices does not
> have any limiting factor. In this model the top level search threadpool
> will limit the number of active search requests which will limit the number
> of work units in the SliceExecutor threadpool.
>
> For this the derived implementation of IndexSearcher can get an input
> value in the constructor to control the slice count computation. Even
> though the slice method is protected it gets called from the constructor of
> base IndexSearcher class which prevents the derived class from using the
> passed in input.
>
> To achieve this I can think of the following ways (in order of preference)
> and would like to submit a pull request for it. But I wanted to get some
> feedback if option 1 looks fine or take some other approach.
>
> 1. Provide another constructor in IndexSearcher which takes in 4 input
> parameters:
>   protected IndexSearcher(IndexReaderContext context, Executor executor,
> SliceExecutor sliceExecutor, Function, LeafSlice[]>
> sliceProvider)
>
> 2. Make the `leafSlices` member protected and non final. After it is
> initialized by the IndexSearcher (using default mechanism in lucene), the
> derived implementation can again update it if need be (like based on some
> input parameter to its own constructor). Also make the constructor with
> SliceExecutor input protected such that derived implementation can provide
> its own implementation of SliceExecutor. This mechanism will have redundant
> computation of leafSlices.
>
>
> Thanks,
> Sorabh
>


Release wizard: generage asciidoc guide

2023-01-31 Thread Luca Cavanna
Hi all,
I was wondering what the " 6 - Generate Asciidoc guide for this
release" step of the release wizard does. It failed for me yesterday
and asking around I was not sure if it is a required step. Would love
to clarify this and make adjustments if needed, so the next release
manager does not have the same question.

Cheers
Luca

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 9.5.0 released

2023-01-30 Thread Luca Cavanna
The Lucene PMC is pleased to announce the release of Apache Lucene 9.5.0.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search across high-dimensionality vectors, spell
correction or query suggestions.

This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:

  

### Lucene 9.5.0 Release Highlights:

 New features

 * Added KnnByteVectorField and ByteVectorQuery that are specialized for
indexing and querying byte-sized vectors. Deprecated KnnVectorField,
KnnVectorQuery and LeafReader#getVectorValues in favour of the newly
introduced KnnFloatVectorField, KnnFloatVectorQuery and
LeafReader#getFloatVectorValues that are specialized for float vectors.
 * Added IntField, LongField, FloatField and DoubleField: easy to use
numeric fields that perform well both for filtering and sorting.
 * Support for Java 19 foreign memory access ("project Panama") was enabled
by default removing the need to provide the "--enable-preview" flag.
 * Added ByteWritesTrackingDirectoryWrapper to expose metrics for bytes
merged, flushed, and overall write amplification factor.

 Optimizations

* Improved storage efficiency of connections in the HNSW graph used for
vector search
* Added  new stored fields and term vectors interfaces:
IndexReader#storedFields and IndexReader#termVectors. These do not rely
upon ThreadLocal storage for each index segment, which can greatly reduce
RAM requirements when there are many threads and/or segments.
* Several improvements were made to
IndexSortSortedNumericDocValuesRangeQuery including query execution
optimization with points for descending sorts and BoundedDocIdSetIterator
construction sped up using bkd binary search.

 Other

* Moved DocValuesNumbersQuery from sandbox to
NumericDocValuesField#newSlowSetQuery
* Fix exponential runtime for nested BooleanQuery#rewrite with non scoring
clauses

Please read CHANGES.txt for a full list of new features and changes:

  


Re: [VOTE] Release Lucene 9.5.0 RC1

2023-01-30 Thread Luca Cavanna
It's been >72h since the vote was initiated and the result is:

+1  7  (4 binding)
 0  0
-1  0

This vote has PASSED

On Thu, Jan 26, 2023 at 8:09 PM Alessandro Benedetti 
wrote:

> +1
>
> SUCCESS! [1:18:35.364317]
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Thu, 26 Jan 2023 at 10:23, Ignacio Vera  wrote:
>
>> +1
>>
>> SUCCESS! [0:44:15.998020]
>>
>> On Thu, Jan 26, 2023 at 9:19 AM Jan Høydahl 
>> wrote:
>>
>>> +1
>>>
>>> SUCCESS! [0:36:32.191785]
>>>
>>> Jan
>>>
>>> 25. jan. 2023 kl. 19:43 skrev Luca Cavanna :
>>>
>>> Please vote for release candidate 1 for Lucene 9.5.0
>>>
>>> The artifacts can be downloaded from:
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21
>>>
>>> You can run the smoke tester directly with this command:
>>>
>>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21
>>>
>>> The vote will be open for at least 72 hours i.e. until 2023-01-28 19:00
>>> UTC.
>>>
>>> [ ] +1  approve
>>> [ ] +0  no opinion
>>> [ ] -1  disapprove (and reason why)
>>>
>>> Here is my +1
>>>
>>>
>>>


Re: Welcome Ben Trent as Lucene committer

2023-01-27 Thread Luca Cavanna
Welcome Ben, congratulations!

On Fri, Jan 27, 2023 at 4:27 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Welcome and congratulations, Ben!
>
> On Fri, Jan 27, 2023 at 8:48 PM Adrien Grand  wrote:
> >
> > I'm pleased to announce that Ben Trent has accepted the PMC's
> > invitation to become a committer.
> >
> > Ben, the tradition is that new committers introduce themselves with a
> > brief bio.
> >
> > Congratulations and welcome!
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Lucene 9.5 release notes draft

2023-01-26 Thread Luca Cavanna
Hi all,
I published a draft of the release notes for Lucene 9.5 here:
https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.5

Could you please review it? Feel free to make suggestions/edits directly in
Confluence.

Thanks
Luca


Re: [JENKINS] Lucene-9.x-Linux (64bit/hotspot/jdk-18) - Build # 8135 - Unstable!

2023-01-25 Thread Luca Cavanna
Forgot to reply, but this was fixed by
https://github.com/apache/lucene/pull/12110 .

On Tue, Jan 24, 2023 at 9:45 AM Luca Cavanna  wrote:

> This one reproduces on 9x as well as on main. It may very well be related
> to the changes I made yesterday. Looking into it.
>
> On Tue, Jan 24, 2023 at 4:11 AM Policeman Jenkins Server <
> jenk...@thetaphi.de> wrote:
>
>> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/8135/
>> Java: 64bit/hotspot/jdk-18 -XX:-UseCompressedOops -XX:+UseParallelGC
>>
>> 1 tests failed.
>> FAILED:
>> org.apache.lucene.util.hnsw.TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults
>>
>> Error Message:
>> java.lang.AssertionError: expected:<[199, 162, 217, 214, 16]> but
>> was:<[199, 162, 217, 214, 96]>
>>
>> Stack Trace:
>> java.lang.AssertionError: expected:<[199, 162, 217, 214, 16]> but
>> was:<[199, 162, 217, 214, 96]>
>> at
>> __randomizedtesting.SeedInfo.seed([AC1FE49367DE9C2B:E76BA1C072C906E2]:0)
>> at org.junit.Assert.fail(Assert.java:89)
>> at org.junit.Assert.failNotEquals(Assert.java:835)
>> at org.junit.Assert.assertEquals(Assert.java:120)
>> at org.junit.Assert.assertEquals(Assert.java:146)
>> at
>> org.apache.lucene.util.hnsw.HnswGraphTestCase.testSortedAndUnsortedIndicesReturnSameResults(HnswGraphTestCase.java:244)
>> at
>> org.apache.lucene.util.hnsw.TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults(TestHnswByteVectorGraph.java:36)
>> at
>> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:577)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>> at
>> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>> at
>> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>> at
>> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>> at
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>> at
>> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
>> at
>> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>> at
>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at
>> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at
>> com.carrotsearch.randomizedtesting.rules.StatementAdapt

[VOTE] Release Lucene 9.5.0 RC1

2023-01-25 Thread Luca Cavanna
Please vote for release candidate 1 for Lucene 9.5.0

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.5.0-RC1-rev-13803aa6ea7fee91f798cfeded4296182ac43a21

The vote will be open for at least 72 hours i.e. until 2023-01-28 19:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1


Re: [JENKINS] Lucene » Lucene-Check-9.5 - Build # 1 - Failure!

2023-01-25 Thread Luca Cavanna
Hopefully the next run will work, I missed an underscore in the branch name
when configuring the job :)

On Wed, Jan 25, 2023 at 4:38 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build: https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.5/1/
>
> No tests ran.
>
> Build Log:
> [...truncated 19 lines...]
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
> Archiving artifacts
> Recording test results
> ERROR: Step ‘Publish JUnit test result report’ failed: No test report
> files were found. Configuration error?
> Email was triggered for: Failure - Any
> Sending email for trigger: Failure - Any
>
> -
> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> For additional commands, e-mail: builds-h...@lucene.apache.org


New branch and feature freeze for Lucene 9.5.0

2023-01-25 Thread Luca Cavanna
NOTICE:

Branch branch_9_5 has been cut and versions updated to 9.6 on stable branch.

Please observe the normal rules:

* No new features may be committed to the branch.
* Documentation patches, build patches and serious bug fixes may be
  committed to the branch. However, you should submit all patches you
  want to commit as pull requests first to give others the chance to review
  and possibly vote against them. Keep in mind that it is our
  main intention to keep the branch as stable as possible.
* All patches that are intended for the branch should first be committed
  to the unstable branch, merged into the stable branch, and then into
  the current release branch.
* Normal unstable and stable branch development may continue as usual.
  However, if you plan to commit a big change to the unstable branch
  while the branch feature freeze is in effect, think twice: can't the
  addition wait a couple more days? Merges of bug fixes into the branch
  may become more difficult.
* Only Github issues with Milestone 9.5
  and priority "Blocker" will delay a release candidate build.


Re: Lucene 9.5.0 release

2023-01-25 Thread Luca Cavanna
Hi all,
we made the changes that Adrien suggested above and addressed two recent
test failures around vector queries. We should be good to go, I am going to
start the release process now.

On Mon, Jan 23, 2023 at 5:04 PM Michael Wechner 
wrote:

> thanks :-)
>
> Am 23.01.23 um 12:31 schrieb Alessandro Benedetti:
>
> Yes Luca, doing it right now!
>
> For Michael, it's just few getters.
>
> Cheers
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Mon, 23 Jan 2023 at 11:21, Luca Cavanna 
>  wrote:
>
>> Hi all,
>> I meant to start the release today and I see this PR is not merged yet:
>> https://github.com/apache/lucene/pull/12029 . Alessandro, do you still
>> plan on merging it shortly?
>>
>> Thanks
>> Luca
>>
>> On Sat, Jan 21, 2023 at 11:41 AM Michael Wechner <
>> michael.wech...@wyona.com> wrote:
>>
>>> I tried to understand the issue described on github, but unfortunately
>>> do not really understand it.
>>>
>>> Can you explain a little more?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:
>>>
>>> Hi,
>>> this would be nice to have in 9.5 :
>>> https://github.com/apache/lucene/issues/12099
>>>
>>> It's a minor (adding getters to KnnQuery) but can be beneficial in
>>> Apache Solr as soon as possible.
>>> Planning to merge in a few hours if no objections.
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>>
>>> On Thu, 19 Jan 2023 at 14:38, Luca Cavanna 
>>>  wrote:
>>>
>>>> Thanks Robert for the help with the github milestone.
>>>>
>>>> I am planning on cutting the release branch on Monday if there are no
>>>> objections.
>>>>
>>>> Cheers
>>>> Luca
>>>>
>>>> On Tue, Jan 17, 2023 at 7:08 PM Robert Muir  wrote:
>>>>
>>>>> +1 to release, thank you for volunteering to be RM!
>>>>>
>>>>> I went thru 9.5 section of CHANGES.txt and tagged all the GH issues in
>>>>> there with milestone too, if they didn't already have it. It looks
>>>>> even bigger now.
>>>>>
>>>>> On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna 
>>>>> wrote:
>>>>> >
>>>>> > Hi all,
>>>>> > I'd like to propose that we release Lucene 9.5.0. There is a decent
>>>>> amount of changes that would go into it looking at the github milestone:
>>>>> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be
>>>>> the release manager. There is one PR open listed for the 9.5 milestone:
>>>>> https://github.com/apache/lucene/pull/11873 . Is this something that
>>>>> we do want to address before we release? Is anybody aware of outstanding
>>>>> work that we would like to include or known blocker issues that are not
>>>>> listed in the 9.5 milestone?
>>>>> >
>>>>> > Cheers
>>>>> > Luca
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>>>
>>>
>


Re: [JENKINS] Lucene-9.x-Linux (64bit/hotspot/jdk-18) - Build # 8135 - Unstable!

2023-01-24 Thread Luca Cavanna
This one reproduces on 9x as well as on main. It may very well be related
to the changes I made yesterday. Looking into it.

On Tue, Jan 24, 2023 at 4:11 AM Policeman Jenkins Server <
jenk...@thetaphi.de> wrote:

> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/8135/
> Java: 64bit/hotspot/jdk-18 -XX:-UseCompressedOops -XX:+UseParallelGC
>
> 1 tests failed.
> FAILED:
> org.apache.lucene.util.hnsw.TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults
>
> Error Message:
> java.lang.AssertionError: expected:<[199, 162, 217, 214, 16]> but
> was:<[199, 162, 217, 214, 96]>
>
> Stack Trace:
> java.lang.AssertionError: expected:<[199, 162, 217, 214, 16]> but
> was:<[199, 162, 217, 214, 96]>
> at
> __randomizedtesting.SeedInfo.seed([AC1FE49367DE9C2B:E76BA1C072C906E2]:0)
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:120)
> at org.junit.Assert.assertEquals(Assert.java:146)
> at
> org.apache.lucene.util.hnsw.HnswGraphTestCase.testSortedAndUnsortedIndicesReturnSameResults(HnswGraphTestCase.java:244)
> at
> org.apache.lucene.util.hnsw.TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults(TestHnswByteVectorGraph.java:36)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
> at java.base/java.lang.reflect.Method.invoke(Method.java:577)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at
> org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at
> org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at
> org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at
> org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at
> 

Re: Lucene 9.5.0 release

2023-01-23 Thread Luca Cavanna
Hi Ishan,
thanks for asking, I am currently working on moving the byte vectors API
away from BytesRef like Adrien suggested. I would aim for cutting the
branch tomorrow, would that work for you too?

Cheers
Luca

On Mon, Jan 23, 2023 at 2:23 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi Luca,
> Thanks for volunteering.
> Approximately, when do you plan to cut the release branch? If there's
> time, can I include https://issues.apache.org/jira/browse/LUCENE-9302 for
> 9.5 still? It is not merged yet.
> Thanks,
> Ishan
>
> On Mon, Jan 23, 2023 at 5:31 PM Alessandro Benedetti 
> wrote:
>
>> Done on main and cherry-picked on 9.x, thanks Luca for your patience!
>> --
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Mon, 23 Jan 2023 at 12:31, Alessandro Benedetti 
>> wrote:
>>
>>> Yes Luca, doing it right now!
>>>
>>> For Michael, it's just few getters.
>>>
>>> Cheers
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>>>
>>> On Mon, 23 Jan 2023 at 11:21, Luca Cavanna 
>>> wrote:
>>>
>>>> Hi all,
>>>> I meant to start the release today and I see this PR is not merged yet:
>>>> https://github.com/apache/lucene/pull/12029 . Alessandro, do you still
>>>> plan on merging it shortly?
>>>>
>>>> Thanks
>>>> Luca
>>>>
>>>> On Sat, Jan 21, 2023 at 11:41 AM Michael Wechner <
>>>> michael.wech...@wyona.com> wrote:
>>>>
>>>>> I tried to understand the issue described on github, but unfortunately
>>>>> do not really understand it.
>>>>>
>>>>> Can you explain a little more?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:
>>>>>
>>>>> Hi,
>>>>> this would be nice to have in 9.5 :
>>>>> https://github.com/apache/lucene/issues/12099
>>>>>
>>>>> It's a minor (adding getters to KnnQuery) but can be beneficial in
>>>>> Apache Solr as soon as possible.
>>>>> Planning to merge in a few hours if no objections.
>>>>> --
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benede...@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>>
>>>>> On Thu, 19 Jan 2023 at 14:38, Luca Cavanna 
>>>>>  wrote:
>>>>>
>>>>>> Thanks Robert for the help with the github milestone.
>>>>>>
>>>>>> I am planning on cutting the release branch on Monday if there are n

Re: Lucene 9.5.0 release

2023-01-23 Thread Luca Cavanna
Hi all,
I meant to start the release today and I see this PR is not merged yet:
https://github.com/apache/lucene/pull/12029 . Alessandro, do you still plan
on merging it shortly?

Thanks
Luca

On Sat, Jan 21, 2023 at 11:41 AM Michael Wechner 
wrote:

> I tried to understand the issue described on github, but unfortunately do
> not really understand it.
>
> Can you explain a little more?
>
> Thanks
>
> Michael
>
>
>
> Am 21.01.23 um 11:00 schrieb Alessandro Benedetti:
>
> Hi,
> this would be nice to have in 9.5 :
> https://github.com/apache/lucene/issues/12099
>
> It's a minor (adding getters to KnnQuery) but can be beneficial in Apache
> Solr as soon as possible.
> Planning to merge in a few hours if no objections.
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Thu, 19 Jan 2023 at 14:38, Luca Cavanna 
>  wrote:
>
>> Thanks Robert for the help with the github milestone.
>>
>> I am planning on cutting the release branch on Monday if there are no
>> objections.
>>
>> Cheers
>> Luca
>>
>> On Tue, Jan 17, 2023 at 7:08 PM Robert Muir  wrote:
>>
>>> +1 to release, thank you for volunteering to be RM!
>>>
>>> I went thru 9.5 section of CHANGES.txt and tagged all the GH issues in
>>> there with milestone too, if they didn't already have it. It looks
>>> even bigger now.
>>>
>>> On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna  wrote:
>>> >
>>> > Hi all,
>>> > I'd like to propose that we release Lucene 9.5.0. There is a decent
>>> amount of changes that would go into it looking at the github milestone:
>>> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
>>> release manager. There is one PR open listed for the 9.5 milestone:
>>> https://github.com/apache/lucene/pull/11873 . Is this something that we
>>> do want to address before we release? Is anybody aware of outstanding work
>>> that we would like to include or known blocker issues that are not listed
>>> in the 9.5 milestone?
>>> >
>>> > Cheers
>>> > Luca
>>> >
>>> >
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>


Re: Lucene 9.5.0 release

2023-01-19 Thread Luca Cavanna
Thanks Robert for the help with the github milestone.

I am planning on cutting the release branch on Monday if there are no
objections.

Cheers
Luca

On Tue, Jan 17, 2023 at 7:08 PM Robert Muir  wrote:

> +1 to release, thank you for volunteering to be RM!
>
> I went thru 9.5 section of CHANGES.txt and tagged all the GH issues in
> there with milestone too, if they didn't already have it. It looks
> even bigger now.
>
> On Fri, Jan 13, 2023 at 4:54 AM Luca Cavanna  wrote:
> >
> > Hi all,
> > I'd like to propose that we release Lucene 9.5.0. There is a decent
> amount of changes that would go into it looking at the github milestone:
> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
> release manager. There is one PR open listed for the 9.5 milestone:
> https://github.com/apache/lucene/pull/11873 . Is this something that we
> do want to address before we release? Is anybody aware of outstanding work
> that we would like to include or known blocker issues that are not listed
> in the 9.5 milestone?
> >
> > Cheers
> > Luca
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Lucene 9.5.0 release

2023-01-13 Thread Luca Cavanna
Hi all,
I'd like to propose that we release Lucene 9.5.0. There is a decent amount
of changes that would go into it looking at the github milestone:
https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
release manager. There is one PR open listed for the 9.5 milestone:
https://github.com/apache/lucene/pull/11873 . Is this something that we do
want to address before we release? Is anybody aware of outstanding work
that we would like to include or known blocker issues that are not listed
in the 9.5 milestone?

Cheers
Luca


Re: [JENKINS] Lucene » Lucene-Check-9.x - Build # 3700 - Failure!

2022-11-24 Thread Luca Cavanna
I opened https://github.com/apache/lucene/pull/11975 to address this
compile error.

On Thu, Nov 24, 2022 at 6:35 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build: https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/3700/
>
> All tests passed
>
> Build Log:
> [...truncated 1349 lines...]
> BUILD FAILED in 18m 58s
> 768 actionable tasks: 768 executed
> Build step 'Invoke Gradle script' changed build result to FAILURE
> Build step 'Invoke Gradle script' marked build as failure
> Archiving artifacts
> Recording test results
> [Checks API] No suitable checks publisher found.
> Email was triggered for: Failure - Any
> Sending email for trigger: Failure - Any
>
> -
> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> For additional commands, e-mail: builds-h...@lucene.apache.org


Re: Welcome Luca Cavanna as Lucene committer

2022-10-06 Thread Luca Cavanna
Thanks all for the warm welcome, I am thrilled to become a Lucene
committer, thanks for the opportunity.

A bit about me: I have been working at Elastic for a bit longer than 9
years, where I contributed to many different areas of Elasticsearch. I am
currently the area tech lead for the Search team within Elasticsearch.
I started using Lucene as well as contributing to it even before my
involvement with Elasticsearch. It's such a powerful piece of software,
with a great community, and it's fantastic to be a part of it.
I am originally from the north of Italy, but I lived in Amsterdam
(Netherlands) for many years now. I have three boys (6, almost 3 and 1 year
old). When I get a chance I ride my motorbike and I play guitar.

Looking forward to contributing more!

Cheers
Luca


On Thu, Oct 6, 2022 at 11:41 AM Ignacio Vera  wrote:

> Welcome Luca!
>
> On Thu, Oct 6, 2022 at 11:31 AM Bruno Roustant 
> wrote:
>
>> Welcome!
>>
>> Le jeu. 6 oct. 2022 à 11:20, Michael Sokolov  a
>> écrit :
>>
>>> Welcome Luca!
>>>
>>> On Thu, Oct 6, 2022, 1:05 AM 陆徐刚  wrote:
>>>
>>>> Welcome!
>>>>
>>>> Xugang
>>>>
>>>> https://github.com/LuXugang
>>>>
>>>> On Oct 6, 2022, at 13:59, Mikhail Khludnev  wrote:
>>>>
>>>> 
>>>> Welcome, Luca.
>>>>
>>>> On Wed, Oct 5, 2022 at 8:04 PM Adrien Grand  wrote:
>>>>
>>>>> I'm pleased to announce that Luca Cavanna has accepted the PMC's
>>>>> invitation to become a committer.
>>>>>
>>>>> Luca, the tradition is that new committers introduce themselves with a
>>>>> brief bio.
>>>>>
>>>>> Congratulations and welcome!
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>>
>>>>


[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-31 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/31/19 10:08 AM:


I updated the PR and addressed all the comments, here are the latest benchmark 
results (with bitset optimization disabled on both ends):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
 MedTerm 1510.74  (6.8%) 1457.20  (8.4%)   
-3.5% ( -17% -   12%)
  Fuzzy1   70.49  (8.5%)   68.11  (9.8%)   
-3.4% ( -19% -   16%)
OrHighNotMed  650.57  (5.8%)  629.81  (6.0%)   
-3.2% ( -14% -9%)
   OrHighLow  447.13  (4.2%)  433.05  (4.5%)   
-3.2% ( -11% -5%)
OrNotHighMed  623.22  (6.3%)  605.19  (6.1%)   
-2.9% ( -14% -   10%)
OrHighNotLow  720.89  (7.0%)  701.26  (7.9%)   
-2.7% ( -16% -   13%)
   OrNotHighHigh  558.43  (6.3%)  544.82  (4.9%)   
-2.4% ( -12% -9%)
 LowTerm 1279.34  (4.9%) 1248.60  (5.2%)   
-2.4% ( -11% -8%)
  AndHighLow  690.75  (4.0%)  675.22  (5.3%)   
-2.2% ( -11% -7%)
   LowPhrase  358.90  (2.3%)  351.28  (4.0%)   
-2.1% (  -8% -4%)
PKLookup  139.97  (3.0%)  137.32  (3.5%)   
-1.9% (  -8% -4%)
OrNotHighLow  728.48  (6.8%)  714.79  (6.5%)   
-1.9% ( -14% -   12%)
HighTerm 1222.38  (6.3%) 1199.77  (7.1%)   
-1.8% ( -14% -   12%)
 AndHighHigh   58.93  (6.2%)   58.01  (5.8%)   
-1.6% ( -12% -   11%)
 Prefix3  152.21  (4.5%)  150.00  (5.0%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm   79.15 (10.7%)   78.06 (10.5%)   
-1.4% ( -20% -   22%)
   HighTermDayOfYearSort   95.28  (5.1%)   94.10  (7.8%)   
-1.2% ( -13% -   12%)
Wildcard   64.23  (2.3%)   63.45  (2.3%)   
-1.2% (  -5% -3%)
 MedSpanNear   81.15  (2.2%)   80.19  (2.8%)   
-1.2% (  -6% -3%)
HighSpanNear   10.20  (3.9%)   10.08  (4.2%)   
-1.2% (  -8% -7%)
HighIntervalsOrdered4.07  (1.8%)4.03  (2.2%)   
-1.1% (  -4% -2%)
 LowSpanNear   41.62  (3.1%)   41.20  (3.6%)   
-1.0% (  -7% -5%)
   IntNRQConjLowTerm   20.36  (4.1%)   20.15  (4.5%)   
-1.0% (  -9% -7%)
  IntNRQConjHighTerm   64.84  (9.6%)   64.21  (9.4%)   
-1.0% ( -18% -   19%)
  AndHighMed  229.08  (2.8%)  227.00  (2.5%)   
-0.9% (  -6% -4%)
   MedPhrase   18.73  (1.5%)   18.57  (2.3%)   
-0.8% (  -4% -2%)
 LowSloppyPhrase  124.52  (2.3%)  123.48  (2.6%)   
-0.8% (  -5% -4%)
 Respell   69.26  (3.0%)   68.68  (2.9%)   
-0.8% (  -6% -5%)
  HighPhrase   12.98  (1.6%)   12.88  (2.2%)   
-0.7% (  -4% -3%)
   PrefixConjLowTerm   42.11  (2.6%)   41.81  (3.0%)   
-0.7% (  -6% -5%)
   OrHighNotHigh  680.34  (6.1%)  676.16  (7.6%)   
-0.6% ( -13% -   13%)
 MedSloppyPhrase   34.06  (4.9%)   33.89  (4.5%)   
-0.5% (  -9% -9%)
  IntNRQ   89.97 (12.4%)   89.62 (12.0%)   
-0.4% ( -22% -   27%)
HighSloppyPhrase8.28  (4.0%)8.25  (3.9%)   
-0.3% (  -7% -7%)
 WildcardConjLowTerm   36.35  (2.7%)   36.26  (2.7%)   
-0.3% (  -5% -5%)
  OrHighHigh   27.89  (2.6%)   27.85  (3.1%)   
-0.1% (  -5% -5%)
  Fuzzy2   44.19  (3.8%)   44.17  (3.1%)   
-0.1% (  -6% -7%)
   OrHighMed   90.42  (2.8%)   90.57  (2.8%)
0.2% (  -5% -6%)
   PrefixConjMedTerm   45.56  (2.8%)   45.79  (2.9%)
0.5% (  -5% -6%)
WildcardConjHighTerm   33.08  (2.6%)   33.47  (3.0%)
1.2% (  -4% -6%)
  PrefixConjHighTerm   83.65  (2.6%)   86.23  (3.7%)
3.1% (  -3% -9%)
   HighTermMonthSort  130.35 (15.8%)  135.08 (12.1%)
3.6% ( -20% -   37%)
 WildcardConjMedTerm   99.19  (3.6%)  103.37  (4.1%)
4.2% (  -3% -   12%)
{noformat}


was (Author: lucacavanna):
I updated the PR and addressed all the comments, here are the latest benchmark 
results:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-31 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I updated the PR and addressed all the comments, here are the latest benchmark 
results:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
 MedTerm 1510.74  (6.8%) 1457.20  (8.4%)   
-3.5% ( -17% -   12%)
  Fuzzy1   70.49  (8.5%)   68.11  (9.8%)   
-3.4% ( -19% -   16%)
OrHighNotMed  650.57  (5.8%)  629.81  (6.0%)   
-3.2% ( -14% -9%)
   OrHighLow  447.13  (4.2%)  433.05  (4.5%)   
-3.2% ( -11% -5%)
OrNotHighMed  623.22  (6.3%)  605.19  (6.1%)   
-2.9% ( -14% -   10%)
OrHighNotLow  720.89  (7.0%)  701.26  (7.9%)   
-2.7% ( -16% -   13%)
   OrNotHighHigh  558.43  (6.3%)  544.82  (4.9%)   
-2.4% ( -12% -9%)
 LowTerm 1279.34  (4.9%) 1248.60  (5.2%)   
-2.4% ( -11% -8%)
  AndHighLow  690.75  (4.0%)  675.22  (5.3%)   
-2.2% ( -11% -7%)
   LowPhrase  358.90  (2.3%)  351.28  (4.0%)   
-2.1% (  -8% -4%)
PKLookup  139.97  (3.0%)  137.32  (3.5%)   
-1.9% (  -8% -4%)
OrNotHighLow  728.48  (6.8%)  714.79  (6.5%)   
-1.9% ( -14% -   12%)
HighTerm 1222.38  (6.3%) 1199.77  (7.1%)   
-1.8% ( -14% -   12%)
 AndHighHigh   58.93  (6.2%)   58.01  (5.8%)   
-1.6% ( -12% -   11%)
 Prefix3  152.21  (4.5%)  150.00  (5.0%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm   79.15 (10.7%)   78.06 (10.5%)   
-1.4% ( -20% -   22%)
   HighTermDayOfYearSort   95.28  (5.1%)   94.10  (7.8%)   
-1.2% ( -13% -   12%)
Wildcard   64.23  (2.3%)   63.45  (2.3%)   
-1.2% (  -5% -3%)
 MedSpanNear   81.15  (2.2%)   80.19  (2.8%)   
-1.2% (  -6% -3%)
HighSpanNear   10.20  (3.9%)   10.08  (4.2%)   
-1.2% (  -8% -7%)
HighIntervalsOrdered4.07  (1.8%)4.03  (2.2%)   
-1.1% (  -4% -2%)
 LowSpanNear   41.62  (3.1%)   41.20  (3.6%)   
-1.0% (  -7% -5%)
   IntNRQConjLowTerm   20.36  (4.1%)   20.15  (4.5%)   
-1.0% (  -9% -7%)
  IntNRQConjHighTerm   64.84  (9.6%)   64.21  (9.4%)   
-1.0% ( -18% -   19%)
  AndHighMed  229.08  (2.8%)  227.00  (2.5%)   
-0.9% (  -6% -4%)
   MedPhrase   18.73  (1.5%)   18.57  (2.3%)   
-0.8% (  -4% -2%)
 LowSloppyPhrase  124.52  (2.3%)  123.48  (2.6%)   
-0.8% (  -5% -4%)
 Respell   69.26  (3.0%)   68.68  (2.9%)   
-0.8% (  -6% -5%)
  HighPhrase   12.98  (1.6%)   12.88  (2.2%)   
-0.7% (  -4% -3%)
   PrefixConjLowTerm   42.11  (2.6%)   41.81  (3.0%)   
-0.7% (  -6% -5%)
   OrHighNotHigh  680.34  (6.1%)  676.16  (7.6%)   
-0.6% ( -13% -   13%)
 MedSloppyPhrase   34.06  (4.9%)   33.89  (4.5%)   
-0.5% (  -9% -9%)
  IntNRQ   89.97 (12.4%)   89.62 (12.0%)   
-0.4% ( -22% -   27%)
HighSloppyPhrase8.28  (4.0%)8.25  (3.9%)   
-0.3% (  -7% -7%)
 WildcardConjLowTerm   36.35  (2.7%)   36.26  (2.7%)   
-0.3% (  -5% -5%)
  OrHighHigh   27.89  (2.6%)   27.85  (3.1%)   
-0.1% (  -5% -5%)
  Fuzzy2   44.19  (3.8%)   44.17  (3.1%)   
-0.1% (  -6% -7%)
   OrHighMed   90.42  (2.8%)   90.57  (2.8%)
0.2% (  -5% -6%)
   PrefixConjMedTerm   45.56  (2.8%)   45.79  (2.9%)
0.5% (  -5% -6%)
WildcardConjHighTerm   33.08  (2.6%)   33.47  (3.0%)
1.2% (  -4% -6%)
  PrefixConjHighTerm   83.65  (2.6%)   86.23  (3.7%)
3.1% (  -3% -9%)
   HighTermMonthSort  130.35 (15.8%)  135.08 (12.1%)
3.6% ( -20% -   37%)
 WildcardConjMedTerm   99.19  (3.6%)  103.37  (4.1%)
4.2% (  -3% -   12%)
{noformat}

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Is

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
{noformat}
I also ran benchmarks with the bitset optimization in place on both ends:

{{{noformat}}}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 IntNRQ 63.46 (24.6%) 62.28 (24.2%) -1.9% ( -40% - 62

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
  

 
{noformat}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)
 {noformat}
 


was (Author: lucacavanna):
I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
{noformat}
I also ran benchmarks with the bitset optimization in place on both ends:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
  IntNRQ

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
  

{{{noformat}}}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)

 {{{noformat}}}

 


was (Author: lucacavanna):
I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: https://github.com/apache/lucene-solr/pull/667 .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):

{{
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
}}

I also ran benchmarks with the bitset optimization in place on both ends:

{{
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
  IntNRQ   63.46 (24.6%)   62.28 (24.2%)   
-1.9% ( -40

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835628#comment-16835628
 ] 

Luca Cavanna commented on LUCENE-8796:
--

You are right [~ysee...@gmail.com] I will make that change and re-run 
benchmarks.

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement
>        Reporter: Luca Cavanna
>Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
> its advance method use exponential search instead of binary search. This 
> should help performance of queries including conjunctions: given that 
> ConjunctionDISI uses leap frog, it advances through doc ids in small steps, 
> hence exponential search should be faster when advancing on average compared 
> to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: https://github.com/apache/lucene-solr/pull/667 .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
 
Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)

 

 

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement
>        Reporter: Luca Cavanna
>Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
> its advance method use exponential search instead of binary search. This 
> should help performance of queries including conjunctions: given that 
> ConjunctionDISI uses leap frog, it advances through doc ids in small steps, 
> hence exponential search should be faster when advancing on average compared 
> to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8796:


 Summary: Use exponential search in IntArrayDocIdSet advance method
 Key: LUCENE-8796
 URL: https://issues.apache.org/jira/browse/LUCENE-8796
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
its advance method use exponential search instead of binary search. This should 
help performance of queries including conjunctions: given that ConjunctionDISI 
uses leap frog, it advances through doc ids in small steps, hence exponential 
search should be faster when advancing on average compared to binary search.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2019-02-26 Thread Luca Cavanna (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna resolved LUCENE-5718.
--
Resolution: Duplicate

This will be addressed with LUCENE-3401, which is being worked on.

> More flexible compound queries (containing mtq) support in postings 
> highlighter
> ---
>
> Key: LUCENE-5718
> URL: https://issues.apache.org/jira/browse/LUCENE-5718
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 4.8.1
>    Reporter: Luca Cavanna
>Priority: Major
> Attachments: LUCENE-5718.patch
>
>
> The postings highlighter currently pulls the automata from multi term queries 
> and doesn't require calling rewrite to make highlighting work. In order to do 
> so it also needs to check whether the query is a compound one and eventually 
> extract its subqueries. This is currently done in the MultiTermHighlighting 
> class and works well but has two potential problems:
> 1) not all the possible compound queries are necessarily supported as we need 
> to go over each of them one by one (see LUCENE-5717) and this requires 
> keeping the "switch" up-to-date if new queries gets added to lucene
> 2) it doesn't support custom compound queries but only the set of queries 
> available out-of-the-box
> I've been thinking about how this can be improved and one of the ideas I came 
> up with is to introduce a generic way to retrieve the subqueries from 
> compound queries, like for instance have a new abstract base class with a 
> getLeaves or getSubQueries method and have all the compound queries extend 
> it. What this method would do is return a flat array of all the leaf queries 
> that the compound query is made of. 
> Not sure whether this would be needed in other places in lucene, but it 
> doesn't seem like a small change and it would definitely affect (or benefit?) 
> more than just the postings highlighter support for multi term queries.
> In particular the second problem (custom queries) seems hard to solve without 
> a way to expose this info directly from the query though, unless we want to 
> make the MultiTermHighlighting#extractAutomata method extensible in some way.
> Would like to hear what people think and work on this as soon as we 
> identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8664) Add equals/hashcode to TotalHits

2019-01-29 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755016#comment-16755016
 ] 

Luca Cavanna commented on LUCENE-8664:
--

I am not using TotalHits in a map. I would benefit from the equals method for 
comparisons in tests. For instance in Elasticsearch we return the lucene 
TotalHits to users as part of bigger objects that have their own equals method. 
We end up wrapping TotalHits into another internal class that has its own 
equals/hashcode (among others). Having equals/hashcode built-in into lucene 
would remove the need for a wrapper class, as well as making equality 
comparisons a one-liner, especially when comparing multiple instances of 
objects holding TotalHits. This is a minor thing obviously, but I did not think 
it would be a bug to consider two different TotalHits instances that have same 
value and relation equal? I was chatting to [~jim.ferenczi] about this and we 
thought we should propose adding this to Lucene. Happy to close this if you 
think it should not be done.

> Add equals/hashcode to TotalHits
> 
>
> Key: LUCENE-8664
> URL: https://issues.apache.org/jira/browse/LUCENE-8664
> Project: Lucene - Core
>  Issue Type: Improvement
>        Reporter: Luca Cavanna
>Priority: Minor
>
> I think it would be convenient to add equals/hashcode methods to the 
> TotalHits class. I opened a PR here: 
> [https://github.com/apache/lucene-solr/pull/552] .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8664) Add equals/hashcode to TotalHits

2019-01-29 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8664:


 Summary: Add equals/hashcode to TotalHits
 Key: LUCENE-8664
 URL: https://issues.apache.org/jira/browse/LUCENE-8664
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


I think it would be convenient to add equals/hashcode methods to the TotalHits 
class. I opened a PR here: [https://github.com/apache/lucene-solr/pull/552] .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps

2018-12-05 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8591:


 Summary: LegacyBM25Similarity doesn't expose getDiscountOverlaps
 Key: LUCENE-8591
 URL: https://issues.apache.org/jira/browse/LUCENE-8591
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna
Assignee: Luca Cavanna


When I worked on LUCENE-8563 I intended to expose all the needed public methods 
that BM25Similarity exposes, but I forgot to add getDiscountOverlaps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps

2018-12-05 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710228#comment-16710228
 ] 

Luca Cavanna commented on LUCENE-8591:
--

I just opened [https://github.com/apache/lucene-solr/pull/514] 

> LegacyBM25Similarity doesn't expose getDiscountOverlaps
> ---
>
> Key: LUCENE-8591
> URL: https://issues.apache.org/jira/browse/LUCENE-8591
> Project: Lucene - Core
>  Issue Type: Improvement
>        Reporter: Luca Cavanna
>    Assignee: Luca Cavanna
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I worked on LUCENE-8563 I intended to expose all the needed public 
> methods that BM25Similarity exposes, but I forgot to add getDiscountOverlaps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-29 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703382#comment-16703382
 ] 

Luca Cavanna commented on LUCENE-8563:
--

I updated the PR according to the latest comments, and deprecated the newly 
introduced similarity like Robert suggested.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-28 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702294#comment-16702294
 ] 

Luca Cavanna commented on LUCENE-8563:
--

I opened [https://github.com/apache/lucene-solr/pull/511] . 

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-14 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686370#comment-16686370
 ] 

Luca Cavanna commented on LUCENE-8563:
--

Hi folks, I would like to work on this issue.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6485) Add a custom separator break iterator

2015-05-15 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-6485:


 Summary: Add a custom separator break iterator
 Key: LUCENE-6485
 URL: https://issues.apache.org/jira/browse/LUCENE-6485
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


Lucene currently includes a WholeBreakIterator used to highlight entire fields 
using the postings highlighter, without breaking their content into sentences.

I would like to contribute a CustomSeparatorBreakIterator that breaks when a 
custom char separator is found in the text. This can be used for instance when 
wanting to highlight entire fields, value per value. One can subclass 
PostingsHighlighter and have getMultiValueSeparator return a control character, 
like U+ , then use the custom break iterator to break on U+ so that one 
snippet per value will be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6485) Add a custom separator break iterator

2015-05-15 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-6485:
-
Attachment: LUCENE-6485.patch

Patch attached

 Add a custom separator break iterator
 -

 Key: LUCENE-6485
 URL: https://issues.apache.org/jira/browse/LUCENE-6485
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna
 Attachments: LUCENE-6485.patch


 Lucene currently includes a WholeBreakIterator used to highlight entire 
 fields using the postings highlighter, without breaking their content into 
 sentences.
 I would like to contribute a CustomSeparatorBreakIterator that breaks when a 
 custom char separator is found in the text. This can be used for instance 
 when wanting to highlight entire fields, value per value. One can subclass 
 PostingsHighlighter and have getMultiValueSeparator return a control 
 character, like U+ , then use the custom break iterator to break on 
 U+ so that one snippet per value will be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2015-05-15 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546095#comment-14546095
 ] 

Luca Cavanna commented on LUCENE-5718:
--

Was wondering if there is interest around this patch, or maybe there are better 
solutions by now for the custom compound queries usecase? I can revive the 
patch if needed, lemme know what you think.

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-06-01 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5718:
-

Attachment: LUCENE-5718.patch

I like your extractQueries idea. I gave it a shot, patch attached.

The main difference compared to extractTerms is that it adds the query itself 
to the list by default instead of throwing UnsupportedOperationException. Also, 
I think this one doesn't necessarily require calling rewrite (not totally sure 
though). I overrode the extractQueries method for all the queries that contain 
one or more sub-queries, let's see if that's too many of if I missed any...you 
tell me ;)

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-06-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014994#comment-14014994
 ] 

Luca Cavanna commented on LUCENE-5718:
--

Also worth mentioning that my patch addresses only the compound queries 
usecase. It leaves the automaton related work for the different multi term 
queries as it is (in MultiTermHighlighting).

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries

2014-05-30 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5717:


 Summary: Postings highlighter support for multi term queries 
within filtered and constant score queries
 Key: LUCENE-5717
 URL: https://issues.apache.org/jira/browse/LUCENE-5717
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna


The automata extraction that is done to make multi term queries work with the 
postings highlighter does support boolean queries but it should also support 
other compound queries like filtered and constant score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries

2014-05-30 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5717:
-

Attachment: LUCENE-5717.patch

First patch attached. At this time there's no generic way to retrieve 
sub-queries from compound queries, thus I could only add two more ifs to the 
existing extractAutomata method. Maybe it's worth discussing if there's a way 
to make this more generic in a separate issue. Also not sure if there are 
others compound queries that I missed.

 Postings highlighter support for multi term queries within filtered and 
 constant score queries
 --

 Key: LUCENE-5717
 URL: https://issues.apache.org/jira/browse/LUCENE-5717
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5717.patch


 The automata extraction that is done to make multi term queries work with the 
 postings highlighter does support boolean queries but it should also support 
 other compound queries like filtered and constant score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-05-30 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5718:


 Summary: More flexible compound queries (containing mtq) support 
in postings highlighter
 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna


The postings highlighter currently pulls the automata from multi term queries 
and doesn't require calling rewrite to make highlighting work. In order to do 
so it also needs to check whether the query is a compound one and eventually 
extract its subqueries. This is currently done in the MultiTermHighlighting 
class and works well but has two potential problems:

1) not all the possible compound queries are necessarily supported as we need 
to go over each of them one by one (see LUCENE-5717) and this requires keeping 
the switch up-to-date if new queries gets added to lucene
2) it doesn't support custom compound queries but only the set of queries 
available out-of-the-box

I've been thinking about how this can be improved and one of the ideas I came 
up with is to introduce a generic way to retrieve the subqueries from compound 
queries, like for instance have a new abstract base class with a getLeaves or 
getSubQueries method and have all the compound queries extend it. What this 
method would do is return a flat array of all the leaf queries that the 
compound query is made of. 

Not sure whether this would be needed in other places in lucene, but it doesn't 
seem like a small change and it would definitely affect (or benefit?) more than 
just the postings highlighter support for multi term queries.

In particular the second problem (custom queries) seems hard to solve without a 
way to expose this info directly from the query though, unless we want to make 
the MultiTermHighlighting#extractAutomata method extensible in some way.

Would like to hear what people think and work on this as soon as we identified 
which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-09-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765400#comment-13765400
 ] 

Luca Cavanna commented on LUCENE-4906:
--

How about committing this? Would be great to have it with the next release!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-09-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765726#comment-13765726
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Thanks Mike!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-4906.patch, LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-09-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755677#comment-13755677
 ] 

Luca Cavanna commented on LUCENE-5057:
--

Hi Lukas,
can you share your findings? In my case it seemed to be a dictionary problem, 
but I'm curious to hear what you experienced.




 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I also wonder if it can be a dictionary issue since I 
 see that different words that have the word fiets as root don't get the 
 same stems, and using the AND operator at query time is a big issue.
 I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-22 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747493#comment-13747493
 ] 

Luca Cavanna commented on LUCENE-5181:
--

True, having the doc id would be useful there. Why not adding it directly to 
the Passage, to be able know which document the Passage comes from?

 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-14 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740107#comment-13740107
 ] 

Luca Cavanna commented on LUCENE-4906:
--

No problem, thanks Mike!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-11 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736240#comment-13736240
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Hi Mike,
I definitely agree that highlighting api should be simple and the postings 
highlighter is probably the only one that's really easy to use.

On the other hand, I think it's good to make explicit that if you use a 
FormatterYourObject, YourObject is what you're going to get back from the 
highlighter. People using the string version wouldn't notice the change, while 
advanced users would have to extend the base class and get type safety too, 
that in my opinion makes it clearier and easier. Using Object feels to me a 
little old-fashioned and bogus, but again that's probably me :)

I do trust your experience though. If you think the object version is better 
that's fine with me. What I care about is that this improvement gets committed 
soon, since it's a really useful one ;)

Thanks a lot for sharing your ideas

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-11 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736242#comment-13736242
 ] 

Luca Cavanna commented on LUCENE-4906:
--

One more thing: re-reading Robert's previous comments, I find also interesting 
the idea he had about changing the api to return a proper object instead of the 
MapString, String[], or the String[] for the simplest methods. I wonder if 
it's worth to address this as well in this issue, or if the current api is 
clear enough in your opinion. Any thoughts?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-09 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734623#comment-13734623
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Hi Mike,
I had a look at your patch, looks good to me. Being able to get back arbitrary 
objects is a great improvement.

The only thing I would love to improve here is the need to cast the returned 
Objects to the type that the custom PassageFormatter uses.

We could work around this using generics, but the fact that the 
PassageFormatter can vary per field makes it harder. The only way I see to work 
around this is to prevent the PassageFormatter from returning different types 
of objects per field. That would mean that even though every field can have his 
own PassageFormatter, they all must return the same type. It kinda makes sense 
to me since I wouldn't want to have heterogeneous types in the MapInteger, 
Object, but that is something that's currently possible. What do you think?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-09 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4906:
-

Attachment: LUCENE-4906.patch

I don't see why adding generics would complicate or limit the API. To me it 
would make it simpler and nicer (not a big change in terms of api itself 
though).

Attaching a patch with my thoughts to make more concrete what I had in mind, 
regardless of whether it will be integrated or not.

It's backwards compatible (even though the class is marked experimental): we 
have an abstract postings highlighter class that does most of the work and 
returns arbitrary objects (uses generics in order to do so). The 
PostingsHighlighter is its natural extension that returns String snippets.
 
I updated Mike's new test according to my changes. It should make it easier to 
understand what's needed to work with arbitrary objects in terms of code using 
this approach.

I find it more explicit that if you want to extend the abstract one you have to 
declare what type the formatter is supposed to return, which makes it more 
explicit and avoids any cast.

Limitations with this approach: 
1) as mentioned before (to me it's more of a benefit) there cannot be 
heterogeneous types returned by the same highlighter instance.
2) generics don't play well with arrays, thus all the highlight methods that 
returned arrays are still in the subclass that returns string snippets to keep 
backwards compatibility. Moving them to the base class would most likely 
require to return ListFormattedPassage instead (not backward compatible).

I haven't updated the javadoc yet, but if you like my approach I can go ahead 
with it.

I would love to hear what you guys think about it. Generics can be scary... but 
useful sometimes too ;)

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-07-19 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713487#comment-13713487
 ] 

Luca Cavanna commented on LUCENE-5057:
--

Thanks Adrien for looking into this, nice explanation!

 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I also wonder if it can be a dictionary issue since I 
 see that different words that have the word fiets as root don't get the 
 same stems, and using the AND operator at query time is a big issue.
 I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5057) Hunspell stemmer generates multiple tokens (original + stems)

2013-06-14 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5057:


 Summary: Hunspell stemmer generates multiple tokens (original + 
stems)
 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna


The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Summary: Hunspell stemmer generates multiple tokens  (was: Hunspell stemmer 
generates multiple tokens (original + stems))

 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I would work out a patch, I'd just need some help 
 deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed.

When I look at how snowball works, only one token is indexed, the stem, and 
that works great.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.


 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed.
 When I look at how snowball works, only one token is indexed, the stem, and 
 that works great.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I wonder if it can be a dictionary issue since I see that 
different words that have the word fiets as root don't get the same stems, 
and using the AND operator at query time is a big issue.



  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed.

When I look at how snowball works, only one token is indexed, the stem, and 
that works great.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.


 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I also wonder if it can be a dictionary issue since I see 
that different words that have the word fiets as root don't get the same 
stems, and using the AND operator at query time is a big issue.

I would love to contribute on this and looking forward to your feedback.



  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I wonder if it can be a dictionary issue since I see that 
different words that have the word fiets as root don't get the same stems, 
and using the AND operator at query time is a big issue.




 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query

[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629961#comment-13629961
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Sounds interesting. Is anybody working on this already? I'd like to volunteer. 
What do you have in mind exactly? Now the format method returns a string. What 
would you like to see as output instead?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629973#comment-13629973
 ] 

Luca Cavanna commented on LUCENE-4906:
--

I see! If you couldn't make it how can I make it? :)

But the idea is that you could have some kind of object as output instead of a 
string, like for example an array of tokens plus maybe some more information?

It would avoid to parse again the string output and somehow re-analyze the text 
as needed to have a snippet that we could provide as output directly?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630227#comment-13630227
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Sure, I hadn't seen that issue yet but I was about to propose the same looking 
at the code.

Thanks for your insight!
I thought about generics too, but then we'd have to be really careful otherwise 
the generics policeman jumps in :) 

I'll play around with some ideas and post the results here as soon as I have 
something.

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4896:
-

Attachment: LUCENE-4896.patch

No problem. It just felt weird to write an abstract class that looked to me 
like an interface.

Should be better now. I also made append protected.

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630392#comment-13630392
 ] 

Luca Cavanna commented on LUCENE-4896:
--

Just read your last comment, going to add another patch :)

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4896:
-

Attachment: LUCENE-4896.patch

Here it is.
The lucene.experimental was already there, I left it only in the abstract 
class. 

Fixed format javadocs too.

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630441#comment-13630441
 ] 

Luca Cavanna commented on LUCENE-4896:
--

I see what you mean Robert, thanks a lot for your explanation. I would have 
probably ended up with an interface + abstract class then ;)

Let's see what I can come up with for LUCENE-4906...





 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-13 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600918#comment-13600918
 ] 

Luca Cavanna commented on LUCENE-4825:
--

Hey Robert,
sorry but I don't quite understand why it would become an orange? :)

I mean, the PostingsHighlighter does (among others) two great things:
1) reads offsets from the postings list, as its name says
2) summarizes the content giving nice sentences as output

I think the two above features are a great improvement and pretty much what 
everybody would like to have!

I'm proposing to add support for positional queries, as a third optional 
feature. We would need to read the spans from the positional queries in order 
to highlight only the proper terms, otherwise the output is wrong from a user 
perspective. Would this make it that slower? I don't mean to reanalyze the 
text...

Don't get me wrong you must be right but I would like to understand more. 

You're saying that instead of adding 3) to 2) and 1) we should have another 
highlighter that does 1) 2) and 3)?





 PostingsHighlighter support for positional queries
 --

 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna

 I've been playing around with the brand new PostingsHighlighter. I'm really 
 happy with the result in terms of quality of the snippets and performance.
 On the other hand, I noticed it doesn't support positional queries. If you 
 make a span query, for example, all the single terms will be highlighted, 
 even though they haven't contributed to the match. That reminds me of the 
 difference between the QueryTermScorer and the QueryScorer (using the 
 standard Highlighter).
 I've been trying to adapt what the QueryScorer does, especially the 
 extraction of the query terms together with their positions (what 
 WeightedSpanTermExtractor does). Next step would be to take that information 
 into account within the formatter and highlight only the terms that actually 
 contributed to the match. I'm not quite ready yet with a patch to contribute 
 this back, but I certainly intend to do so. That's why I opened the issue and 
 in the meantime I would like to hear what you guys think about it and  
 discuss how best we can fix it. I think it would be a big improvement for 
 this new highlighter, which is already great!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-12 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-4825:


 Summary: PostingsHighlighter support for positional queries
 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna


I've been playing around with the brand new PostingsHighlighter. I'm really 
happy with the result in terms of quality of the snippets and performance.
On the other hand, I noticed it doesn't support positional queries. If you make 
a span query, for example, all the single terms will be highlighted, even 
though they haven't contributed to the match. That reminds me of the difference 
between the QueryTermScorer and the QueryScorer (using the standard 
Highlighter).

I've been trying to adapt what the QueryScorer does, especially the extraction 
of the query terms together with their positions (what 
WeightedSpanTermExtractor does). Next step would be to take that information 
into account within the formatter and highlight only the terms that actually 
contributed to the match. I'm not quite ready yet with a patch to contribute 
this back, but I certainly intend to do so. That's why I opened the issue and 
in the meantime I would like to hear what you guys think about it and  discuss 
how best we can fix it. I think it would be a big improvement for this new 
highlighter, which is already great!



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600572#comment-13600572
 ] 

Luca Cavanna commented on LUCENE-4825:
--

Thanks for your inputs Robert!

I see your point, even though from a user perspective I'd rather see only the 
complete phrase highlighted if I make a phrase query, not every single term. I 
think we can currently achieve this only like the old highlighter does, am I 
right? 
Maybe we can make this pluggable and have different implementations?




 PostingsHighlighter support for positional queries
 --

 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna

 I've been playing around with the brand new PostingsHighlighter. I'm really 
 happy with the result in terms of quality of the snippets and performance.
 On the other hand, I noticed it doesn't support positional queries. If you 
 make a span query, for example, all the single terms will be highlighted, 
 even though they haven't contributed to the match. That reminds me of the 
 difference between the QueryTermScorer and the QueryScorer (using the 
 standard Highlighter).
 I've been trying to adapt what the QueryScorer does, especially the 
 extraction of the query terms together with their positions (what 
 WeightedSpanTermExtractor does). Next step would be to take that information 
 into account within the formatter and highlight only the terms that actually 
 contributed to the match. I'm not quite ready yet with a patch to contribute 
 this back, but I certainly intend to do so. That's why I opened the issue and 
 in the meantime I would like to hear what you guys think about it and  
 discuss how best we can fix it. I think it would be a big improvement for 
 this new highlighter, which is already great!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-06-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287260#comment-13287260
 ] 

Luca Cavanna commented on LUCENE-3976:
--

Hi Chris, 
I agree with you. On the other hand with the affix rule mentioned, before 
LUCENE-4019 we had an AOE, so the additional catch would have been useful just 
to throw a nicer error message like Error while parsing the affix file. That 
one has been solved at its source, for now I don't see any other possible 
errors but I'm sure there are some, maybe plenty since we support only a subset 
of the formats and features.
It was just a way to introduce a generic error message but I totally agree that 
the right apporach would be fixing everything at the source.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch, LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Hi Chris, 
thanks for your feedback. Here is a new patch containing a new option in order 
to enable/disable the affix strict parsing, by default it is enabled. I updated 
the HunspellStemFilterFactory too in order to expose the new option to Solr.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch, LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Yeah, sorry for my mistakes, I corrected them.
And I added the line number to the ParseException.
Let me know if there's something more I can do!

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch, LUCENE-4019.patch, LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269516#comment-13269516
 ] 

Luca Cavanna commented on LUCENE-4019:
--

Thank you Robert for the explanation!
In this specific case it's hard to understand the differences between hunspell 
and Lucene, since Lucene doesn't even parse the affix file.
I've been in contact with the authors of those Ducth dictionaries, as well as 
with the hunspell author. It turned out that those affix rules are wrong and 
hunspell actually ignores them. I think it's better to ignore them in Lucene 
too, rather than throwing an exception, which makes impossible to use those 
dictionaries at all.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-07 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Small patch: affix rules with less than 5 elements are now ignored. I added a 
specific test with a new affix file containing an example of rule shorter than 
it should be. Let me know if you prefer to add a warning when a rule is 
skipped. Hunspell does that only with a specific command line option.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
 Attachments: LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna commented on LUCENE-3976:
--

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improve error messages anyway, in a more generic way if 
possible.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:03 AM:
---

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a more generic 
way.

  was (Author: lucacavanna):
The specific case of affix rule with less than 5 elements has been 
addressed in LUCENE-4019. Please ignore my first patch here since it's related 
to that specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improve error messages anyway, in a more generic way if 
possible.
  
 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:04 AM:
---

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a generic way.

  was (Author: lucacavanna):
The specific case of affix rule with less than 5 elements has been 
addressed in LUCENE-4019. Please ignore my first patch here since it's related 
to that specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a more generic 
way.
  
 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-3976:
-

Attachment: LUCENE-3976.patch

The patch tries to address unexpected errors while parsing affix files and 
dictionaries. I just added an external try catch with a generic Error while 
parsing the affix/dictionary file, in my opinion better than just eventually 
throwing some unchecked exception. Let me know if there's something else we can 
improve meanwhile.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch, LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3189) Removing a field using TemplateTransformer

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269561#comment-13269561
 ] 

Luca Cavanna commented on SOLR-3189:


Guys, don't you think this feature could be useful? Is there something I can do 
to convince you? :)

 Removing a field using TemplateTransformer
 --

 Key: SOLR-3189
 URL: https://issues.apache.org/jira/browse/SOLR-3189
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Reporter: Luca Cavanna
Priority: Minor
 Attachments: SOLR-3189.patch


 While importing documents through DataImportHandler I need to remove some 
 fields from the final SolrDocument before it's submitted to Solr.
 My usecase: the import query returns an A column which I use to fill in the B 
 field on the Solr instance. My Solr schema contains both the A and B fields, 
 so they are both filled in through dih. I'd like to force the deletion of A 
 from the generated SolrDocument since I need a value only on the B field and 
 want to leave empty the A field. The only way I found is using 
 ScriptTransformer, so I thought it could be useful to add this feature to the 
 TemplateTransformer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores

2012-05-07 Thread Luca Cavanna (JIRA)
Luca Cavanna created SOLR-3443:
--

 Summary: Optimize hunspell dictionary loading with multiple cores
 Key: SOLR-3443
 URL: https://issues.apache.org/jira/browse/SOLR-3443
 Project: Solr
  Issue Type: Improvement
Reporter: Luca Cavanna


The Hunspell dictionary is actually loaded into memory. Each core using 
hunspell loads its own dictionary, no matter if all the cores are using the 
same dictionary files. As a result, the same dictionary is loaded into memory 
multiple times, once for each core. I think we should share those dictionaries 
between all cores in order to optimize the memory usage. In fact, let's say a 
dictionary takes 20MB into memory (this is what I detected), if you have 20 
cores you are going to use 400MB only for dictionaries, which doesn't seem a 
good idea to me.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269573#comment-13269573
 ] 

Luca Cavanna commented on SOLR-3443:


The first thing I have in mind is a static map containing all loaded 
dictionaries with some kind of unique identifier, so that the same dictionary 
can be reused between cores.
But my question is: is there a mechanism to share object between cores in Solr? 
Is this the first time someone needs to share something between multiple cores?
I'd like to hear your thoughts!

 Optimize hunspell dictionary loading with multiple cores
 

 Key: SOLR-3443
 URL: https://issues.apache.org/jira/browse/SOLR-3443
 Project: Solr
  Issue Type: Improvement
Reporter: Luca Cavanna

 The Hunspell dictionary is actually loaded into memory. Each core using 
 hunspell loads its own dictionary, no matter if all the cores are using the 
 same dictionary files. As a result, the same dictionary is loaded into memory 
 multiple times, once for each core. I think we should share those 
 dictionaries between all cores in order to optimize the memory usage. In 
 fact, let's say a dictionary takes 20MB into memory (this is what I 
 detected), if you have 20 cores you are going to use 400MB only for 
 dictionaries, which doesn't seem a good idea to me.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >