from:"Bruno Roustant"

Re: Join module dependency

2024-05-19 Thread Bruno Roustant

It seems it would be nice to have hppc in the join and spatial modules.

Le dim. 19 mai 2024 à 14:46, Dawid Weiss  a écrit :

>
> I don't think there are any rules other than common sense - Lucene is used
> in so many different environments and subprojects that any dependency will
> eventually create a headache and a conflict. So no dependencies at all is
> great, if it can be achieved. If it's just one or two classes we care about
> then perhaps just moving them over is sufficient? Since we already have a
> HPPC dependency in another module, perhaps it's ok to propagate it to other
> modules? I'm not really sure how much impact it'll have downstream.
>
> D.
>
> On Sat, May 18, 2024 at 5:26 PM Bruno Roustant 
> wrote:
>
>> The facet module has a dependency on com.carrotsearch:hppc.
>>
>> Is it possible to add the same dependency to the join module ? What is
>> the rule ?
>>
>> Thanks
>>
>> Bruno
>>
>

Join module dependency

2024-05-18 Thread Bruno Roustant

The facet module has a dependency on com.carrotsearch:hppc.

Is it possible to add the same dependency to the join module ? What is the
rule ?

Thanks

Bruno

How much is ja.dict.UserDictionary used?

2024-05-18 Thread Bruno Roustant

Hi,

While looking at the various usages of Map with Integer keys, I found
ja.dict.UserDictionary with its lookup() method where there is a *TODO: can
we avoid this treemap/toIndexArray?*

I could propose something, but I would like to know how much it is used,
and if it is worth improving it.

Thanks

Bruno

Re: New Lucene PMC Chair: Chris Hegarty

2024-01-21 Thread Bruno Roustant

Thank you Chris, congrats!
And of course thank you Greg for the past year!

Le sam. 20 janv. 2024 à 01:15, Greg Miller  a écrit :

> Hello Lucene developers-
>
> I wanted to let you know that the Lucene PMC has elected a new Chair—Chris
> Hegarty—and the board has approved the appointment. It's been an honor to
> fill this role for the past year, but it's time to pass the torch to
> someone new.
>
> Chris- thank you for stepping up for this role and congratulations!
>
> Cheers,
> -Greg
>

Re: Welcome Stefan Vodita as Lucene committter

2024-01-21 Thread Bruno Roustant

Congrats Stefan!

Le sam. 20 janv. 2024 à 08:26, Michael Wechner 
a écrit :

> Hi Stefan, thank you very much for your contributions and helping to
> improve Lucene!
>
> All the best
>
> Michael
>
> Am 19.01.24 um 20:03 schrieb Stefan Vodita:
>
> Thank you all! It's an honor to join the project as a committer.
>
> I'm originally from a small town in southern Romania
> , so I'm really looking
> forward to seeing #12172 
> resolved, since both the characters in question (ș, ț)
> are supposed to show up in my name.
>
> In university , I had
> professors who contributed to open software 
> and I was
> lucky enough to be given a taste of the open source world. I had become a
> teaching assistant for a few of the courses (Data Structures, Control
> Theory),
> and it had crossed my mind to stay at the university. Then I got an offer
> to
> come work at Amazon, in Ireland
> . They gave me a list of teams
> I could join that
> only had the names of the teams - I thought Search Engine Tech sounded the
> coolest. I was right! That's how I first learned about Lucene and started
> working with/on it. It's a privilege, Lucene is an amazing piece of
> software and
> I'm proud to be contributing.
>
> Outside programming, I like history and philosophy. I've been a voracious
> reader basically since I learned how to read. Recently, I've been going
> down
> a spiral of increasingly obscure books, but nothing has topped
> Dostoevsky's
> classic, The Brothers Karamazov
> . Knowing books also happens
> to be useful
> for thinking up faceting examples
> ,
> so that's a plus.
> When I was in middle-school, I half-willingly went through 4 years of
> classical
> guitar training and was left with a life-long desire to be a good musician
> despite my inconsistent practice habits. Practice will have to wait until I
> finish up the next PR - looking forward to many more in the future!
>
> Cheers,
> Stefan
>
> On Thu, 18 Jan 2024 at 15:56, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hi Team,
>>
>> I'm pleased to announce that Stefan Vodita has accepted the Lucene PMC's
>> invitation to become a committer!
>>
>> Stefan, the tradition is that new committers introduce themselves with a
>> brief bio.
>>
>> Congratulations, welcome, and thank you for all your improvements to
>> Lucene and our community,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>
>

Re: Welcome Luca Cavanna to the Lucene PMC

2023-10-20 Thread Bruno Roustant

Welcome, congratulations!

Le ven. 20 oct. 2023 à 10:02, Dawid Weiss  a écrit :

>
> Congratulations, Luca!
>
> On Fri, Oct 20, 2023 at 7:51 AM Adrien Grand  wrote:
>
>> I'm pleased to announce that Luca Cavanna has accepted an invitation to
>> join the Lucene PMC!
>>
>> Congratulations Luca, and welcome aboard!
>>
>> --
>> Adrien
>>
>

Re: Branchless binary search in Java?

2023-08-01 Thread Bruno Roustant

>
> Wow, this looks very relevant to Lucene!  Could this index be used for
> faster implementation of our skip lists?  Even though they are static
> (computed once at segment-write time) vs dynamic/online that these learned
> indices are also able to handle, it looks like learned indices are still
> better than simple binary search, and quite a bit more compact, for static
> cases.


I do hope this index can be used for skip lists!

Our "best linear fit" approximation to compress monotonic longs
> (DirectMonoticWriter/Reader) looks like a simple example of these learned
> indices too.


Yes, this is the same linear approximation technique. With the concept of
an optimal (minimal) sequence of segments and a hierarchy of segment
layers, the paper gives a very compact yet precise approximation.

>

Re: Branchless binary search in Java?

2023-07-30 Thread Bruno Roustant

Interesting coincidence, I'm currently working on a learned index on sorted
keys that can advantageously replace binary search.
It is very compact (additional space of 2% of the sorted key array, e.g.
40KB for 200MB of keys), and it is between 2x to 3x faster than binary
search for the rank/indexOf methods. By design there are nearly no
branches: the index of a key is approximated by using hierarchical linear
segment computation.
The PGM-Index paper is there: https://pgm.di.unipi.it/
And my implementation is here, just submitted in HPPC:
https://github.com/carrotsearch/hppc/pull/39

Bruno

Le ven. 28 juil. 2023 à 13:04, Dawid Weiss  a écrit :

>
> Actually this is exactly the same for Java:
>>
>
> Yup, I know (we all know by now, I guess). People (including me) evidently
> crave this low, iron-level control, while at the same time mostly try to
> dodge writing any software in languages that are designed to be close to
> the hardware. There is a love-hate relationship there that I often find
> amusing.
>
> D.
>
>>

Re: Welcome Chris Hegarty to the Lucene PMC

2023-06-21 Thread Bruno Roustant

Welcome Chris!

Le mer. 21 juin 2023 à 13:43, Chris Hegarty
 a écrit :

> Thank you all for the warm welcome. Happy to be included in this very
> talented group of individuals :-)
>
> -Chris.
>
> On 21 Jun 2023, at 09:31, Uwe Schindler  wrote:
>
> Welcome Chris. 
>
> Uwe
>
>
> Am 19. Juni 2023 11:52:50 MESZ schrieb Adrien Grand :
>
>> I'm pleased to announce that Chris Hegarty has accepted an invitation to
>> join the Lucene PMC!
>>
>> Congratulations Chris, and welcome aboard!
>>
>> --
>> Adrien
>>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>
>
>

Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-22 Thread Bruno Roustant

I vote for option 3.
Then with a follow up work to have a simple extension codec in the "codecs"
package which is
1- not backward compatible, and 2- has a higher or configurable limit. That
way users can directly use this codec without any additional code.

Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-10 Thread Bruno Roustant

*Proposed option:* Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit would not bind any other
potential vector engine alternative/evolution.

*Motivation:* There seem to be contradictory performance interpretations
about the current HNSW implementation. Some consider its performance ok,
some not, and it depends on the target data set and use-case. Increasing
the max dimension limit where it is currently (in top level
FloatVectorValues) would not allow potential alternatives (e.g. for other
use-cases) to be based on a lower limit.

Bruno

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread Bruno Roustant

I agree with Robert Muir that an increase of the 1024 limit as it is
currently in FloatVectorValues or ByteVectorValues would bind the API, we
could not decrease it after, even if we needed to change the vector engine.

Would it be possible to move the limit definition to a HNSW specific
implementation, where it would only bind HNSW?
I don't know this area of code well. It seems to me the FloatVectorValues
implementation is unfortunately not HNSW specific. Is this on purpose? We
should be able to replace the vector engine, no?

Le sam. 6 mai 2023 à 22:44, Michael Wechner  a
écrit :

> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene implementation
> > for this plugin
> >
> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box with Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of OpenAI, fact
> > is, that it has a huge impact and I think it would be nice that Lucene
> > could be part of this "revolution". All we have to do is increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector dimension and
> > several members have done performance tests successfully and 1024
> > seems to have been chosen as max dimension quite arbitrarily in the
> > first place, I think it should not be a problem to increase the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Lucene PMC Chair Greg Miller

2023-03-06 Thread Bruno Roustant

Hello Lucene developers,

Lucene Program Management Committee has elected a new chair, Greg Miller,
and the Board has approved.

Greg, thank you for stepping up, and congratulations!


- Bruno

Re: Lucene 9.5.0 release

2023-01-16 Thread Bruno Roustant

+1
Thanks Luca!

Le lun. 16 janv. 2023 à 16:04, Ignacio Vera  a écrit :

> +1
>
> On Mon, Jan 16, 2023 at 12:58 PM Alan Woodward 
> wrote:
>
>> +1, thanks Luca!
>>
>> On 13 Jan 2023, at 09:54, Luca Cavanna  wrote:
>>
>> Hi all,
>> I'd like to propose that we release Lucene 9.5.0. There is a decent
>> amount of changes that would go into it looking at the github milestone:
>> https://github.com/apache/lucene/milestone/4 . I'd volunteer to be the
>> release manager. There is one PR open listed for the 9.5 milestone:
>> https://github.com/apache/lucene/pull/11873 . Is this something that we
>> do want to address before we release? Is anybody aware of outstanding work
>> that we would like to include or known blocker issues that are not listed
>> in the 9.5 milestone?
>>
>> Cheers
>> Luca
>>
>>
>>
>>
>>
>>

Re: Welcome Luca Cavanna as Lucene committer

2022-10-06 Thread Bruno Roustant

Welcome!

Le jeu. 6 oct. 2022 à 11:20, Michael Sokolov  a écrit :

> Welcome Luca!
>
> On Thu, Oct 6, 2022, 1:05 AM 陆徐刚  wrote:
>
>> Welcome！
>>
>> Xugang
>>
>> https://github.com/LuXugang
>>
>> On Oct 6, 2022, at 13:59, Mikhail Khludnev  wrote:
>>
>> 
>> Welcome, Luca.
>>
>> On Wed, Oct 5, 2022 at 8:04 PM Adrien Grand  wrote:
>>
>>> I'm pleased to announce that Luca Cavanna has accepted the PMC's
>>> invitation to become a committer.
>>>
>>> Luca, the tradition is that new committers introduce themselves with a
>>> brief bio.
>>>
>>> Congratulations and welcome!
>>>
>>> --
>>> Adrien
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>

Re: MergeTrigger consistency in MergePolicy "find merges"

2022-06-21 Thread Bruno Roustant

Ok, thanks for all the details. I understand MergeTrigger is not present on
purpose.

Le lun. 20 juin 2022 à 16:17, Adrien Grand  a écrit :

> Some comments on JIRA suggest that this is expected, because natural
> merges can have a variety of triggers while forced merges are always called
> by the app. I guess you could argue that MERGE_FINISHED is a different
> trigger, but are there use-cases for doing things differently in
> findForcedMerges depending on the merge trigger?
>
>
> https://issues.apache.org/jira/browse/LUCENE-4472?focusedCommentId=13476920=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13476920
> .
>
> On Mon, Jun 20, 2022 at 3:26 PM Bruno Roustant 
> wrote:
>
>> I agree this AlwaysForceMergePolicy is not working correctly. It's just a
>> test I did to easily understand how MergeTrigger.MERGE_FINISHED was working.
>>
>> Anyway my question is only about the MergeTrigger not present in the call
>> to findForcedMerges(), to know if it is expected or inconsistent with the
>> other find merges methods.
>>
>>
>> Le lun. 20 juin 2022 à 14:26, Adrien Grand  a écrit :
>>
>>> Wouldn't this be a bug in the AlwaysForceMergePolicy, which should
>>> return no merges if there is already a single segment with no deletes?
>>>
>>> On Mon, Jun 20, 2022 at 1:30 PM Bruno Roustant 
>>> wrote:
>>>
>>>> If I use a simple "AlwaysForceMergePolicy" in a test, I can see that
>>>> when a run IndexWriter.forceMerge(), the first call to
>>>> AlwaysForceMergePolicy.findForcedMerges() is done for the
>>>> MergeTrigger.EXPLICIT. But then, at IndexWriter.merge() line 4531,
>>>> MergePolicy.findForcedMerges() is called with MergeTrigger.MERGE_FINISHED
>>>> to merge the segments produced by the output of the first explicit forced
>>>> merge, and so on. For this degenerated AlwaysForceMergePolicy, the test
>>>> runs merges in an infinite loop.
>>>>
>>>> Le lun. 20 juin 2022 à 11:11, Adrien Grand  a
>>>> écrit :
>>>>
>>>>> You seem to imply that `forceMerge` runs a cascaded merge where the
>>>>> first merge creates some new segments that become inputs to a second 
>>>>> merge.
>>>>> Have you considered running a single merge? We had a discussion about
>>>>> cascaded forced merges and TieredMergePolicy last year and ended up
>>>>> changing `findForcedMerges` to never run cascaded merges:
>>>>> https://issues.apache.org/jira/browse/LUCENE-7020.
>>>>>
>>>>> On Mon, Jun 20, 2022 at 10:31 AM Bruno Roustant <
>>>>> bruno.roust...@gmail.com> wrote:
>>>>>
>>>>>> MergePolicy "find merges" methods take a MergeTrigger as parameter,
>>>>>> except findForcedMerges() and findForcedDeletesMerges().
>>>>>> In my use-case, I could leverage a MergeTrigger in
>>>>>> findForcedMerges(), which can be EXPLICIT or MERGE_FINISHED, to
>>>>>> differentiate the merge selection between the initial explicit call and 
>>>>>> the
>>>>>> subsequent calls triggered after the first merges.
>>>>>>
>>>>>> Should we add a MergeTrigger parameter to all MergePolicy "find
>>>>>> merges" methods for consistency?
>>>>>> If so, is it an internal or public API? (should this change stay in
>>>>>> the main branch only)
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>

Re: MergeTrigger consistency in MergePolicy "find merges"

2022-06-20 Thread Bruno Roustant

I agree this AlwaysForceMergePolicy is not working correctly. It's just a
test I did to easily understand how MergeTrigger.MERGE_FINISHED was working.

Anyway my question is only about the MergeTrigger not present in the call
to findForcedMerges(), to know if it is expected or inconsistent with the
other find merges methods.


Le lun. 20 juin 2022 à 14:26, Adrien Grand  a écrit :

> Wouldn't this be a bug in the AlwaysForceMergePolicy, which should return
> no merges if there is already a single segment with no deletes?
>
> On Mon, Jun 20, 2022 at 1:30 PM Bruno Roustant 
> wrote:
>
>> If I use a simple "AlwaysForceMergePolicy" in a test, I can see that when
>> a run IndexWriter.forceMerge(), the first call to
>> AlwaysForceMergePolicy.findForcedMerges() is done for the
>> MergeTrigger.EXPLICIT. But then, at IndexWriter.merge() line 4531,
>> MergePolicy.findForcedMerges() is called with MergeTrigger.MERGE_FINISHED
>> to merge the segments produced by the output of the first explicit forced
>> merge, and so on. For this degenerated AlwaysForceMergePolicy, the test
>> runs merges in an infinite loop.
>>
>> Le lun. 20 juin 2022 à 11:11, Adrien Grand  a écrit :
>>
>>> You seem to imply that `forceMerge` runs a cascaded merge where the
>>> first merge creates some new segments that become inputs to a second merge.
>>> Have you considered running a single merge? We had a discussion about
>>> cascaded forced merges and TieredMergePolicy last year and ended up
>>> changing `findForcedMerges` to never run cascaded merges:
>>> https://issues.apache.org/jira/browse/LUCENE-7020.
>>>
>>> On Mon, Jun 20, 2022 at 10:31 AM Bruno Roustant <
>>> bruno.roust...@gmail.com> wrote:
>>>
>>>> MergePolicy "find merges" methods take a MergeTrigger as parameter,
>>>> except findForcedMerges() and findForcedDeletesMerges().
>>>> In my use-case, I could leverage a MergeTrigger in findForcedMerges(),
>>>> which can be EXPLICIT or MERGE_FINISHED, to differentiate the merge
>>>> selection between the initial explicit call and the subsequent calls
>>>> triggered after the first merges.
>>>>
>>>> Should we add a MergeTrigger parameter to all MergePolicy "find merges"
>>>> methods for consistency?
>>>> If so, is it an internal or public API? (should this change stay in the
>>>> main branch only)
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>

Re: MergeTrigger consistency in MergePolicy "find merges"

2022-06-20 Thread Bruno Roustant

If I use a simple "AlwaysForceMergePolicy" in a test, I can see that when a
run IndexWriter.forceMerge(), the first call to
AlwaysForceMergePolicy.findForcedMerges() is done for the
MergeTrigger.EXPLICIT. But then, at IndexWriter.merge() line 4531,
MergePolicy.findForcedMerges() is called with MergeTrigger.MERGE_FINISHED
to merge the segments produced by the output of the first explicit forced
merge, and so on. For this degenerated AlwaysForceMergePolicy, the test
runs merges in an infinite loop.

Le lun. 20 juin 2022 à 11:11, Adrien Grand  a écrit :

> You seem to imply that `forceMerge` runs a cascaded merge where the first
> merge creates some new segments that become inputs to a second merge. Have
> you considered running a single merge? We had a discussion about cascaded
> forced merges and TieredMergePolicy last year and ended up changing
> `findForcedMerges` to never run cascaded merges:
> https://issues.apache.org/jira/browse/LUCENE-7020.
>
> On Mon, Jun 20, 2022 at 10:31 AM Bruno Roustant 
> wrote:
>
>> MergePolicy "find merges" methods take a MergeTrigger as parameter,
>> except findForcedMerges() and findForcedDeletesMerges().
>> In my use-case, I could leverage a MergeTrigger in findForcedMerges(),
>> which can be EXPLICIT or MERGE_FINISHED, to differentiate the merge
>> selection between the initial explicit call and the subsequent calls
>> triggered after the first merges.
>>
>> Should we add a MergeTrigger parameter to all MergePolicy "find merges"
>> methods for consistency?
>> If so, is it an internal or public API? (should this change stay in the
>> main branch only)
>>
>
>
> --
> Adrien
>

MergeTrigger consistency in MergePolicy "find merges"

2022-06-20 Thread Bruno Roustant

MergePolicy "find merges" methods take a MergeTrigger as parameter, except
findForcedMerges() and findForcedDeletesMerges().
In my use-case, I could leverage a MergeTrigger in findForcedMerges(),
which can be EXPLICIT or MERGE_FINISHED, to differentiate the merge
selection between the initial explicit call and the subsequent calls
triggered after the first merges.

Should we add a MergeTrigger parameter to all MergePolicy "find merges"
methods for consistency?
If so, is it an internal or public API? (should this change stay in the
main branch only)

Re: Welcome Greg Miller to the Lucene PMC

2022-06-07 Thread Bruno Roustant

Welcome Greg!

Le mar. 7 juin 2022 à 08:37, Adrien Grand  a écrit :

> I'm pleased to announce that Greg Miller has accepted an invitation to
> join the Lucene PMC!
>
> Congratulations Greg, and welcome aboard!
>
> --
> Adrien
>

Re: [VOTE] Migration to GitHub issue from Jira (LUCENE-10557)

2022-06-07 Thread Bruno Roustant

+0 (PMC)

While I like the simplification, I'm a little concerned by the risk of
disruption in history.

Le mar. 7 juin 2022 à 05:07, Tomoko Uchida  a
écrit :

> I'm sorry there was a mistake in the important date. This is the
> corrected version.
>
> ==
> this vote received 13 ballots in total (including +1, +0, and -1) so
> far, this does not reach the quorum of 15. I'll extend the term to
> 2022-06-13 16:00 UTC.
>
> This is a friendly reminder note in case you have missed it in my first
> post.
>
> *IMPORTANT NOTE*
> I set a local protocol for this vote.
> There are 95 committers on this project [3] - the vote will be
> effective if it successfully gains more than 15% of voters (>= 15)
> from committers (including PMC members). This means, that although
> only PMC member votes are counted for the final result, the votes from
> all committers are important to make the vote result effective.
>
> If there are less than 15 votes at 2022-06-06 16:00 UTC, I will expand
> the term to 2022-06-13 16:00 UTC. If this fails to get sufficient
> voters after the expanded time limit, I'll cancel this vote regardless
> of the result.
>
> Thanks,
> Tomoko
>
> 2022年6月7日(火) 12:03 Tomoko Uchida :
> >
> > Hi all,
> > this vote received 13 ballots in total (including +1, +0, and -1) so
> > far, this does not reach the quorum of 15. I'll extend the term to
> > 2022-06-06 16:00 UTC.
> >
> > This is a friendly reminder note in case you have missed it in my first
> post.
> >
> > *IMPORTANT NOTE*
> > I set a local protocol for this vote.
> > There are 95 committers on this project [3] - the vote will be
> > effective if it successfully gains more than 15% of voters (>= 15)
> > from committers (including PMC members). This means, that although
> > only PMC member votes are counted for the final result, the votes from
> > all committers are important to make the vote result effective.
> >
> > If there are less than 15 votes at 2022-06-06 16:00 UTC, I will expand
> > the term to 2022-06-13 16:00 UTC. If this fails to get sufficient
> > voters after the expanded time limit, I'll cancel this vote regardless
> > of the result.
> >
> > Thanks,
> > Tomoko
> >
> > 2022年6月1日(水) 6:19 Alessandro Benedetti :
> > >
> > > +1(committer, non PMC)
> > >
> > > Lately I kinda feel having to create the Jira, after I detailed a
> contribution in the pull request, is just a boilerplate activity of copying
> and pasting and tagging again.
> > > I would be happy to reduce this burden.
> > > I left other details in the discussion thread.
> > >
> > > Cheers
> > >
> > >
> > > On Tue, 31 May 2022, 21:19 Jason Gerlowski, 
> wrote:
> > >>
> > >> +1 (PMC)
> > >>
> > >> I understand concerns about handing governance over to a 3rd party,
> but letting that drive our decision-making here feels like optimizing for a
> rare case that might never occur.  I'd m,uch rather optimize for making
> things easiest for contributors, and then accommodate any "Github ToS ban,
> sanctions, etc." situations if and when they crop up on a case by case
> basis.
> > >>
> > >> Best,
> > >>
> > >> Jason
> > >>
> > >> On Tue, May 31, 2022 at 10:09 AM Gus Heck  wrote:
> > >>>
> > >>> -1 I think the disruption and bifurcation of where to find history
> is not worth it. I also noticed a comment in the lucene issue for migration
> with summaries by date range, status, affects version,  etc. sub-area,
> exactly the sort of thing I expect to be much more difficult to obtain from
> github. What I would find interesting is a deep integration of the two
> systems so that initiation and basic commenting could be handled on github,
> but transmitted to Jira where full metadata and reporting/tracking could be
> maintained.
> > >>>
> > >>> On Tue, May 31, 2022 at 12:17 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> > 
> >  -1
> > 
> >  On Tue, 31 May, 2022, 4:06 am Xi Chen,
>  wrote:
> > >
> > > +1 from me (committer, non-PMC)
> > >
> > > Thanks Tomoko for starting the discussion and organizing / leading
> this effort!
> > >
> > > Best,
> > > Zach
> > >
> > > On May 30, 2022, at 2:56 PM, Houston Putman 
> wrote:
> > >
> > > 
> > > +1 Approve (PMC)
> > >
> > > Thanks so much for doing all of the work for this Tomoko!
> > >
> > > - Houston
> > >
> > > On Mon, May 30, 2022 at 5:38 PM David Smiley 
> wrote:
> > >>
> > >> +1 Approve (PMC)
> > >>
> > >> ~ David Smiley
> > >> Apache Lucene/Solr Search Developer
> > >> http://www.linkedin.com/in/davidwsmiley
> > >>
> > >>
> > >> On Mon, May 30, 2022 at 11:40 AM Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> > >>>
> > >>> Hi everyone!
> > >>>
> > >>> As we had previous discussion thread [1], I propose migration to
> GitHub issue from Jira.
> > >>> It'd be technically possible (see [2] for details) and I think
> it'd be good for the project - not only for welcoming new developers who

Re: Lucene PMC Chair Bruno Roustant

2022-03-25 Thread Bruno Roustant

Thanks All,
If we extrapolate from the first release 9.1.0 of 2022, it's going to be a
great year!

Bruno

Le jeu. 24 mars 2022 à 16:35, Houston Putman  a écrit :

> Congrats Bruno, and thanks Michael for doing such an incredible job!
>
> - Houston
>
> On Thu, Mar 24, 2022 at 10:45 AM Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> Thanks Michael for the amazing work last year!
>> Welcome Bruno, I am sure you'll do great!
>> Cheers
>>
>> On Thu, 24 Mar 2022, 09:23 Uwe Schindler,  wrote:
>>
>>> Hi,
>>>
>>> Thanks Michael for all the hard work last year.
>>> Welcome Bruno!
>>>
>>> Uwe
>>>
>>> -
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremen
>>> https://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>> > -Original Message-
>>> > From: Michael Sokolov 
>>> > Sent: Wednesday, March 23, 2022 2:03 PM
>>> > To: Lucene Dev 
>>> > Subject: Lucene PMC Chair Bruno Roustant
>>> >
>>> > Hello, Lucene developers. Lucene Program Management Committee has
>>> > elected a new chair, Bruno Roustant, and the Board has approved.
>>> > Bruno, thank you for stepping up, and congratulations!
>>> >
>>> > -Mike
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: Lucene 9.1 release soon?

2022-03-01 Thread Bruno Roustant

+1 Thanks Julie

Le ven. 25 févr. 2022 à 13:58, Michael Sokolov  a
écrit :

> +1 thanks for volunteering
>
> On Thu, Feb 24, 2022, 5:41 AM Mayya Sharipova
>  wrote:
>
>> + 1
>>
>> On Thu, Feb 24, 2022 at 11:28 AM Ignacio Vera  wrote:
>>
>>> +1
>>>
>>> On Thu, Feb 24, 2022 at 9:05 AM Adrien Grand  wrote:
>>>
 +1

 On Thu, Feb 24, 2022 at 8:43 AM Michael Wechner
  wrote:
 >
 > I think this would be great :-) thank you very much for your efforts!
 >
 > Michael
 >
 > Am 24.02.22 um 00:28 schrieb Julie Tibshirani:
 > > Hello everyone,
 > >
 > > Would there be support for releasing Lucene 9.1 soon? It has been
 ~2.5
 > > months since 9.0 was released and we already have a long list of new
 > > features, optimizations, and bug fixes
 > > (https://github.com/apache/lucene/blob/branch_9x/lucene/CHANGES.txt
 ).
 > >
 > > If so, I am happy to take a shot at being release manager. I did not
 > > see any issues marked "blocker", but please let me know if there
 are any.
 > >
 > > Julie
 >
 >
 > -
 > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 > For additional commands, e-mail: dev-h...@lucene.apache.org
 >


 --
 Adrien

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Guo Feng as Lucene committer

2022-01-26 Thread Bruno Roustant

Welcome!

Le mar. 25 janv. 2022 à 22:38, Vigya Sharma  a
écrit :

> Congratulations Feng!
>
> On Tue, Jan 25, 2022 at 1:12 PM Julie Tibshirani 
> wrote:
>
>> Welcome!!
>>
>> On Tue, Jan 25, 2022 at 11:00 AM Marcus Eagan 
>> wrote:
>>
>>> Congratulations Feng!
>>>
>>> On Tue, Jan 25, 2022 at 10:51 AM Anshum Gupta 
>>> wrote:
>>>
 Congratulations and welcome, Feng!

 On Tue, Jan 25, 2022 at 1:09 AM Adrien Grand  wrote:

> I'm pleased to announce that Guo Feng has accepted the PMC's
> invitation to become a committer.
>
> Feng, the tradition is that new committers introduce themselves with a
> brief bio.
>
> Congratulations and welcome!
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

 --
 Anshum Gupta

>>> --
>>> Marcus Eagan
>>>
>>>
>
> --
> warm regards,
> Vigya
>

Re: [VOTE] Release Lucene/Solr 8.11.0 RC1

2021-11-11 Thread Bruno Roustant

+1
SUCCESS! [1:17:35.209577]

Le jeu. 11 nov. 2021 à 18:30, Julie Tibshirani  a
écrit :

> +1 (nonbinding)
> SUCCESS! [1:04:58.967300]
>
> On Thu, Nov 11, 2021 at 7:12 AM David Smiley  wrote:
>
>> +1
>> SUCCESS! [0:57:23.948714]
>>
>

Re: Welcome Zach Chen as Lucene committer

2021-04-20 Thread Bruno Roustant

Welcome Zach!

Le mar. 20 avr. 2021 à 10:59, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> Congrats, Zach! Thanks for your contributions, looking forward to more!
>
> On Tue, 20 Apr, 2021, 2:26 pm Alan Woodward,  wrote:
>
>> Congratulations and welcome!
>>
>> > On 19 Apr 2021, at 15:13, Adrien Grand  wrote:
>> >
>> > I'm pleased to announce that Zach Chen has accepted the PMC's
>> invitation to become a committer.
>> >
>> > Zach, the tradition is that new committers introduce themselves with a
>> brief bio.
>> >
>> > Congratulations and welcome!
>> >
>> > --
>> > Adrien
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: [VOTE] Release Lucene/Solr 8.8.2 RC1

2021-04-08 Thread Bruno Roustant

+1 (binding)

Ran the smoke tester successfully.

Bruno

Le jeu. 8 avr. 2021 à 04:38, Anshum Gupta  a écrit :

> +1 (binding)
>
> Ran a sample indexing/search app and browsed through the admin UI.
>
> Smoketester is happy!
>
> SUCCESS! [1:05:05.761354]
>
>
> On Tue, Apr 6, 2021 at 3:45 PM Mike Drob  wrote:
>
>> Please vote for release candidate 1 for Lucene/Solr 8.8.2
>>
>> The artifacts can be downloaded from:
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.2-RC1-reva92a05e195b775b30ca410bc0a26e8e79e7b3bfb
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.2-RC1-reva92a05e195b775b30ca410bc0a26e8e79e7b3bfb
>>
>> The vote will be open until 2021-04-12 00:00 UTC. I will tally votes on
>> Monday morning.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>
>
> --
> Anshum Gupta
>

Re: Welcome Peter Gromov as Lucene committer

2021-04-07 Thread Bruno Roustant

Welcome Peter!

Le mer. 7 avr. 2021 à 09:11, Peter Gromov
 a écrit :

> Thanks for the honor!
>
> (BTW I'm still not recognized by Github as having write access, and can't
> merge my pull requests :))
>
> > Peter, the tradition is that new committers introduce themselves with a
> brief bio.
>
> Okay, time for some bragging :) I've been working at JetBrains for some 17
> years, most of them on IntelliJ platform ,
> mainly supporting various languages and their infrastructure, analyzing
> snapshots and improving performance. Aiming to catch more bugs before they
> hit production, I've introduced property-based testing to IntelliJ by
> creating a small library called jetCheck
> . Recently I've switched to the
> Grazie  project and
> now I do some rule-based computational linguistics there and enhance the
> IDE support for English. As Grazie needs LanguageTool
>  and Hunspell, I've also spent some time
> rewriting the latter in Java (here in Lucene), and optimizing them both. In
> my free time, I like mountain hiking (Munich/Germany is a great location
> for that!), and some amateur piano/harmonica playing/singing
> .
>
>>

Re: Lucene and Solr repositories mirrored, main branch ready

2021-03-11 Thread Bruno Roustant

Thank you Dawid!

Le jeu. 11 mars 2021 à 02:28, Michael Sokolov  a écrit :

> Big thank you, Dawid, and Jan and others for taking the bull by the horns!
>
> On Wed, Mar 10, 2021, 3:14 PM Dawid Weiss  wrote:
>
>> > Just tested out the main branch of the new repo, packaged, started,
>> loaded data, searched from the UI. All looks great.
>>
>> Thank you, great to know!
>>
>> Dawid
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: [DISCUSS] Sunset the general@l.a.o mailing list?

2021-03-01 Thread Bruno Roustant

+1

Le dim. 28 févr. 2021 à 22:23, Andi Vajda  a écrit :

>
> On Sun, 28 Feb 2021, Jan Høydahl wrote:
>
> > Hi
> >
> > The general@ list is not being used for practically anything. I see
> some
> > user questions there and we announce releases. It may have had more
> > purpose when there were 5 sub projects in Lucene. Now it is more
> confusing
> > users and they do not get timely replies. The list has 1088 subscribers.
> >
> > I propose to discontinue the list, i.e. make it Read-Only and remove it
> > from the web page. Anyone who would miss it?
>
> I've been sending periodic PyLucene release votes there in order not to
> spam
> lucene-dev but I guess I can use lucene-dev instead ?
>
> Andi..
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Re: 8.8 Release

2020-12-19 Thread Bruno Roustant

+1 Thanks for volunteering

Le ven. 18 déc. 2020 à 01:41, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> Sure, Houston. I'll wait another week. Have a good new year and merry
> Christmas!
>
> On Fri, 18 Dec, 2020, 5:58 am Timothy Potter, 
> wrote:
>
>> Great point Houston! +1 on waiting until a week into January
>>
>> On Thu, Dec 17, 2020 at 4:46 PM Houston Putman 
>> wrote:
>>
>>> Thanks for volunteering Ishan.
>>>
>>> I think it might be a good idea to wait to cut and release 8.8 at least
>>> a week into January. Many people are going to be away during the holiday
>>> season, and particularly the last week of the year. Pushing into January
>>> just gives more people a chance to look at the release and be involved.
>>>
>>> - Houston
>>>
>>> On Fri, Dec 11, 2020 at 3:26 PM Noble Paul  wrote:
>>>
 Thanks Ishan for volunteering

 On Fri, Dec 11, 2020 at 5:07 AM Christine Poerschke (BLOOMBERG/
 LONDON)  wrote:
 >
 > With a view towards including it in the release, I'd appreciate code
 review input on
 >
 > https://github.com/apache/lucene-solr/pull/1992 for
 >
 > https://issues.apache.org/jira/browse/SOLR-14939 (JSON facets: range
 faceting to support cache=false parameter)
 >
 > if anyone has some time next week perhaps?
 >
 > Thanks in advance!
 >
 > Christine
 >
 > From: dev@lucene.apache.org At: 12/10/20 18:01:58
 > To: dev@lucene.apache.org
 > Subject: Re: 8.8 Release
 >
 > +1
 >
 > Joel Bernstein
 > http://joelsolr.blogspot.com/
 >
 >
 > On Thu, Dec 10, 2020 at 11:23 AM David Smiley 
 wrote:
 >>
 >> Thanks for volunteering!
 >>
 >> On Thu, Dec 10, 2020 at 11:11 AM Ishan Chattopadhyaya <
 ichattopadhy...@gmail.com> wrote:
 >>>
 >>> Hi Devs,
 >>> There are lots of changes accumulated and some underway. I wish to
 volunteer for a 8.8 release, if there are no objections. I'm planning to
 build the RC in three weeks, i.e. 31 December (and cut the branch about 3-4
 days before that). Please let me know if someone has any concerns.
 >>> Thanks and regards,
 >>> Ishan
 >>>
 >> --
 >> Sent from Gmail Mobile
 >
 >


 --
 -
 Noble Paul

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Julie Tibshirani as Lucene/Solr committer

2020-11-19 Thread Bruno Roustant

Congrats Julie!

Le jeu. 19 nov. 2020 à 11:38, Alessandro Benedetti  a
écrit :

> Welcome onboard Julie!
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io
>
>
> On Thu, 19 Nov 2020 at 03:23, Tomás Fernández Löbbe 
> wrote:
>
>> Welcome Julie!
>>
>> On Wed, Nov 18, 2020 at 6:59 PM Ilan Ginzburg  wrote:
>>
>>> Welcome Julie and congrats!
>>>
>>> On Thu, Nov 19, 2020 at 3:51 AM Julie Tibshirani 
>>> wrote:
>>>
 Thank you for the warm welcome! It’s a big honor for me -- I’ve been a
 Lucene fan since the start of my software career. I’m excited to contribute
 to such a great project.

 I’m a developer at Elastic focused on core search features. My
 professional background is in information retrieval and data systems. I
 also have an interest in statistical computing and machine learning
 software. I’m originally from Canada but have lived in the SF Bay Area for
 many years now. Some of my favorite things…
 * Color: purple
 * Album: Siamese Dream
 * Java keyword: final

 Julie

 On Wed, Nov 18, 2020 at 6:33 PM Ishan Chattopadhyaya <
 ichattopadhy...@gmail.com> wrote:

> Welcome Julie!
>
> On Thu, 19 Nov, 2020, 12:10 am Erick Erickson, <
> erickerick...@gmail.com> wrote:
>
>> Welcome Julie!
>>
>> > On Nov 18, 2020, at 1:21 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com> wrote:
>> >
>> > Juliet from the house of Elasticsearch meets a interesting,
>> relevancy-aware  committer from the house of Solr.
>> >
>> > Such a romantic beginning. Not sure I want to know the end of that
>> heroine's journey.
>> >
>> > :-)
>> >
>> > On Wed., Nov. 18, 2020, 12:59 p.m. Dawid Weiss, <
>> dawid.we...@gmail.com> wrote:
>> >
>> > Congratulations and welcome, Julie.
>> >
>> > I think juliet is not a bad nick at all, you just need to who -all
>> | grep "romeo"... :)
>> >
>> > Dawid
>> >
>> > On Wed, Nov 18, 2020 at 4:08 PM Michael Sokolov 
>> wrote:
>> > I'm pleased to announce that Julie Tibshirani has accepted the PMC's
>> > invitation to become a committer.
>> >
>> > Julie, the tradition is that new committers introduce themselves
>> with
>> > a brief bio.
>> >
>> > I think we may still be sorting out the details of your Apache
>> account
>> > (julie@ may have been taken?), but as soon as that has been sorted
>> out
>> >  and karma has been granted, you can use your new powers to add
>> > yourself to the committers section of the Who We Are page on the
>> > website: 
>> >
>> > Congratulations and welcome!
>> >
>> > Mike Sokolov
>> >
>> >
>> -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: Payloads for each term

2020-10-26 Thread Bruno Roustant

Hi Ankur,
Indeed payloads are the standard way to solve this problem. For light
queries with a few top N results that should be efficient. For multi-term
queries that could become penalizing if you need to access the payloads of
too many terms.
Also, there is an experimental PostingsFormat called
SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
would allow you to effectively share the overlapping terms in the index
while having 50 fields. This would solve the index bloat issue, but would
not fully solve the seeks issue. You might want to benchmark this approach
too.

Bruno

Le ven. 23 oct. 2020 à 02:48, Ankur Goel  a écrit :

> Hi Lucene Devs,
>I have a need to store a sparse feature vector on a per term
> basis. The total number of possible dimensions are small (~50) and known at
> indexing time. The feature values will be used in scoring along with corpus
> statistics. It looks like payloads
>  were
> created for this exact same purpose but some workaround is needed to
> minimize the performance penalty as mentioned on the wiki
>  .
>
> An alternative is to override *term frequency* to be a *pointer* in a 
> *Map Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
> time, the matching *docId *will be used to advance the pointer to the
> starting offset of this map*. *The term frequency will be used to perform
> lookup into the serialized map to retrieve the* Feature_Vector. *That's
> my current plan but I haven't benchmarked it.
>
> The problem that I am trying to solve is to *reduce the index bloat* and
> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
> stored as separate fields in the index with very high term overlap and
> Lucene does not share Terms dictionary across different fields. This itself
> can be a new feature for Lucene but will reqiure lots of work I imagine.
>
> Any ideas are welcome :-)
>
> Thanks
> -Ankur
>

Re: Code Analysis during CI?

2020-09-09 Thread Bruno Roustant

+1 for analysis within the PR workflow.

Le ven. 4 sept. 2020 à 06:38, David Smiley  a écrit :

> Sounds great to me!  I'm really glad to hear it works with the PR
> workflow, and only on the files touched in the PR.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Sep 3, 2020 at 8:03 PM Tom DuBuisson  wrote:
>
>> Tomás,
>> Oof, thanks for the note on TOS.  I fixed the link.  The tool can be
>> configured and I'm happy to make things work better for your use case.
>> Muse is free for public repos and will remain free for open source
>> indefinitely.  You can try it and remove it any time - github is in charge
>> of access control and provides you as the repository owner with control via
>> the website.
>>
>> On Thu, Sep 3, 2020 at 4:37 PM Tomás Fernández Löbbe <
>> tomasflo...@gmail.com> wrote:
>>
>>> Thanks Tom. I think this could be very useful as long as it can be
>>> configurable. (The "terms of use here[1] link to "google.com", so I
>>> couldn't check that, but they claim it's free for public repos, so...). We
>>> could always try it and remove it if we don't like it? What do others think?
>>>
>>>
>>> [1] https://github.com/apps/muse-dev
>>>
>>> On Thu, Sep 3, 2020 at 3:06 PM Tom DuBuisson  wrote:
>>>
 Hello Lucene/Solr folks,

 During Lucene development CI is used for build and unit tests to gate
 merges.  The CI doesn't yet include any analysis tools though, but their
 use has been discussed [1].  I fixed some issues flagged by Facebook's
 Infer and was prompted to bring up the topic here [2].

 The recent PR fixed some low-hanging fruit that was reported when I ran
 Muse [3] - a github app that is a platform for static analysis tools.
  Muse's platform bundles the most useful analysis tools, all open source
 with many of them developed by FANG, and triggers analysis on PRs
 then delivers results as comments.

 Because of the PR-centric workflow you only see issues related to the
 changes in the pull request.  This means that even a project where tools
 give a daunting list of issues can still have quiet day-to-day operation.
 Muse also has options to configure individual tools and turn tools or
 warnings off entirely.  If there are concerns in addition to noise and
 added mental tax on development then I'd really like to hear those 
 thoughts.

 Would you be up for running Muse on the lucene-solr repo?  Let me know,
 and I hope to hear your thoughts on analysis tools either way.

 -Tom

 [1] https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-8847
 [2] https://issues.apache.org/jira/projects/SOLR/issues/SOLR-14819
 [3] Muse result on Lucene:
 https://console.muse.dev/result/TomMD/lucene-solr/01EH5WXS6C1RH1NFYHP6ATXTZ9?tab=results
 Muse app link: https://github.com/apps/muse-dev
 [4] https://github.com/TomMD/lucene-solr/pulls
 [5] Example of muse commenting on an issue
 https://github.com/TomMD/shiro/pull/2

Re: 8.6 release

2020-07-16 Thread Bruno Roustant

Thanks!
The Release Wizard is a great help clearly. I'm going to open a Jira issue
to fix some glitches (links to update, some git command to improve, maybe
more explanation on some specific steps, etc)

Le mer. 15 juil. 2020 à 18:42, Erick Erickson  a
écrit :

> +1
>
> > On Jul 15, 2020, at 11:06 AM, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> >
> > Thanks for the release, Bruno. And congrats on a great job with it. :-)
> >
> > On Fri, Jul 10, 2020 at 1:47 AM Shawn Heisey 
> wrote:
> > On 7/9/2020 1:24 AM, Ishan Chattopadhyaya wrote:
> > > I wish to send out something like this to coincide with the release
> > > announcement.
> > > Please review:
> > >
> https://docs.google.com/document/d/1SdlZVXYgaeZVgL3Xs0ZFHs_RyvdNqNcx3xkEpjdkobU/edit?usp=sharing
> > >
> > > Unfortunately, my cwiki/confluence Apache login isn't working (and
> reset
> > > password isn't working either). If someone can provide me edit access
> to
> > > "ichattopadhyaya", then I can put together the confluence page as
> proposed.
> >
> > Confluence is LDAP-enabled, which means that you need to use your Apache
> > id and password on it.  Resetting that password is done on
> > https://id.apache.org/ and not on Confluence.
> >
> > According to whimsy, your apache id is ishan. You would need to use that
> > ID to log into Confluence.  As for the "ichattopadhyaya" login, if you
> > have anything important on it, you may be able to get INFRA to merge it
> > with your Apache login, similar to what they can do on Jira.  To find
> > out, you'll need to check with them.  I do not know what is possible.
> >
> > Thanks,
> > Shawn
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[ANNOUNCE] Apache Solr 8.6.0 released

2020-07-15 Thread Bruno Roustant

The Lucene PMC is pleased to announce the release of Apache Solr 8.6.0.


Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document handling, and geospatial search. Solr is highly
scalable, providing fault tolerant distributed search and indexing, and
powers the search and navigation features of many of the world's largest
internet sites.


Solr 8.6.0 is available for immediate download at:


  


### Solr 8.6.0 Release Highlights:


 * Cross-Collection Join Queries: Join queries can now work
cross-collection, even when shared or when spanning nodes.

 * Search: Performance improvement for some types of queries when exact hit
count isn't needed by using BlockMax WAND algorithm.

 * Streaming Expression: Percentiles and standard deviation aggregations
added to stats, facet and time series.  Streaming expressions added to
/export handler.  Drill Streaming Expression for efficient and accurate
high cardinality aggregation.

 * Package manager: Support for cluster (CoreContainer) level plugins.

 * Health Check: HealthCheckHandler can now require that all cores are
healthy before returning OK.

 * Zookeeper read API: A read API at /api/cluster/zk/* to fetch raw ZK data
and view contents of a ZK directory.

 * Admin UI: New panel with security info in admin UI's dashboard.

 * Query DSL: Support for {param:ref} and {bool: {excludeTags:""}}

 * Ref Guide: Major redesign of Solr's documentation.


Please read CHANGES.txt for a full list of new features and changes:


  


Solr 8.6.0 also includes features, optimizations  and bugfixes in the
corresponding Apache Lucene release:


  


Note: The Apache Software Foundation uses an extensive mirroring network for

distributing releases. It is possible that the mirror you are using may not
have

replicated the release yet. If that is the case, please try another mirror.

This also applies to Maven access.

[ANNOUNCE] Apache Lucene 8.6.0 released

2020-07-15 Thread Bruno Roustant

The Lucene PMC is pleased to announce the release of Apache Lucene 8.6.0.


Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.


This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:


  


### Lucene 8.6.0 Release Highlights:


 * API change in: SimpleFSDirectory, IndexWriterConfig, MergeScheduler,
SortFields, SimpleBindings, QueryVisitor, DocValues, CodecUtil.

 * New: IndexWriter merge-on-commit feature to selectively merge small
segments on commit, subject to a configurable timeout, to improve search
performance by reducing the number of small segments for searching.

 * New: Grouping by range based on DoubleValueSource and LongValueSource.

 * Optimizations: BKD trees and index, DoubleValuesSource/QueryValueSource,
UsageTrackingQueryingCachingPolicy, FST, Geometry queries, Points,
UniformSplit.

 * Others: Ukrainian analyzer, checksums verification, resource leaks fixes.


Please read CHANGES.txt for a full list of new features and changes:


  


Note: The Apache Software Foundation uses an extensive mirroring network for

distributing releases. It is possible that the mirror you are using may not
have

replicated the release yet. If that is the case, please try another mirror.

This also applies to Maven access.

[VOTE] Release Lucene/Solr 8.6.0 RC1

2020-07-08 Thread Bruno Roustant

 Please vote for release candidate 1 for Lucene/Solr 8.6.0

The artifacts can be downloaded from:

https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.0-RC1-reva9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.0-RC1-reva9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc

The vote will be open for at least 72 hours i.e. until 2020-07-11 09:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1

Re: 8.6 release

2020-07-07 Thread Bruno Roustant

Thank you for taking care of the unresolved issues. Now it's clean and I'll
start the RC process today (with the help of the Great Release Wizard).

For those who want to have a look at the draft release notes (or edit them):
Lucene:
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=158865929=e7da54c5-8e1c-4228-b9f7-ff03ab882080=shareui=1594107478794
<https://slack-redir.net/link?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fpages%2Fresumedraft.action%3FdraftId%3D158865929%26draftShareId%3De7da54c5-8e1c-4228-b9f7-ff03ab882080%26src%3Dshareui%26src.shareui.timestamp%3D1594107478794>
Solr:
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=158865919=9cd1341c-bf4d-4714-b71b-08a685053f4d=shareui=1594039624215
<https://slack-redir.net/link?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fpages%2Fresumedraft.action%3FdraftId%3D158865919%26draftShareId%3D9cd1341c-bf4d-4714-b71b-08a685053f4d%26src%3Dshareui%26src.shareui.timestamp%3D1594039624215>

Le lun. 6 juil. 2020 à 20:58, Eric Pugh  a
écrit :

> I just resolved SOLR-14422.
>
> On Jul 6, 2020, at 1:36 PM, Tomás Fernández Löbbe 
> wrote:
>
> Just resolved SOLR-14590.
>
> On Mon, Jul 6, 2020 at 4:22 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> I'll take a look today, Bruno. Thanks.
>>
>> On Mon, 6 Jul, 2020, 4:32 pm Bruno Roustant, 
>> wrote:
>>
>>> Hi all,
>>>
>>> 8.6 RC is planned tomorrow but there are still 9 Jira issues unresolved
>>> for 8.6 (+ private ones?)
>>>
>>> Please review and update their status.
>>>
>>> 3 blockers
>>> SOLR-14599 Introduce cluster level plugins through packages
>>> SOLR-14593 Package store API to disable file upload over HTTP
>>> SOLR-14580 CloudSolrClient cannot be initialized using 'zkHosts' builder
>>>
>>> Other
>>> SOLR-14590 Add support for FeatureField in Solr
>>> SOLR-14516 NPE during Realtime GET
>>> SOLR-14422 Solr 8.5 Admin UI shows Angular placeholders on first load /
>>> refresh
>>> SOLR-14398 package store PUT should be idempotent
>>> SOLR-14311 Shared schema should not have access to core level classes
>>> LUCENE-9356 Add tests for corruptions caused by byte flips
>>>
>>> Le dim. 5 juil. 2020 à 08:10, David Smiley  a
>>> écrit :
>>>
>>>> Pertaining to the highlighter performance regression:
>>>> https://issues.apache.org/jira/browse/SOLR-14628
>>>> It's a simple change in a default setting, that is furthermore
>>>> consistent with how the behavior was prior to Solr 8.5
>>>>
>>>> I'm hoping this can make it into the release?  See the PR.
>>>>
>>>> ~ David
>>>>
>>>>
>>>> On Wed, Jun 24, 2020 at 3:05 PM David Smiley 
>>>> wrote:
>>>>
>>>>> Thanks starting this discussion, Cassandra.
>>>>>
>>>>> I reviewed the issues I was involved with and I don't quite see
>>>>> something worth noting.
>>>>>
>>>>> I plan to add a note about a change in defaults within
>>>>> UnifiedHighlighter that could be a significant perf regression.  This
>>>>> wasn't introduced in 8.6 but introduced in 8.5 and it's significant enough
>>>>> to bring attention to.  I could add it in 8.5's section but then add a
>>>>> short pointer to it in 8.6.
>>>>>
>>>>> ~ David
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2020 at 2:52 PM Cassandra Targett <
>>>>> casstarg...@gmail.com> wrote:
>>>>>
>>>>>> I started looking at the Ref Guide for 8.6 to get it ready, and
>>>>>> notice there are no Upgrade Notes in `solr-upgrade-notes.adoc` for 8.6. 
>>>>>> Is
>>>>>> it really true that none are needed at all?
>>>>>>
>>>>>> I’ll add what I usually do about new features/changes that maybe
>>>>>> wouldn’t normally make the old Upgrade Notes section, I just find it
>>>>>> surprising that there weren’t any devs who thought any of the 100 or so
>>>>>> Solr changes warrant any user caveats.
>>>>>> On Jun 17, 2020, 12:27 PM -0500, Tomás Fernández Löbbe <
>>>>>> tomasflo...@gmail.com>, wrote:
>>>>>>
>>>>>> +1. Thanks Bruno
>>>>>>
>>>>>> On Wed, Jun 17, 2020 at 6:22 AM Mike Drob  wrote:
>>>>>>
>>>>>>> +1
>>&g

Re: 8.6 release

2020-07-06 Thread Bruno Roustant

Hi all,

8.6 RC is planned tomorrow but there are still 9 Jira issues unresolved for
8.6 (+ private ones?)

Please review and update their status.

3 blockers
SOLR-14599 Introduce cluster level plugins through packages
SOLR-14593 Package store API to disable file upload over HTTP
SOLR-14580 CloudSolrClient cannot be initialized using 'zkHosts' builder

Other
SOLR-14590 Add support for FeatureField in Solr
SOLR-14516 NPE during Realtime GET
SOLR-14422 Solr 8.5 Admin UI shows Angular placeholders on first load /
refresh
SOLR-14398 package store PUT should be idempotent
SOLR-14311 Shared schema should not have access to core level classes
LUCENE-9356 Add tests for corruptions caused by byte flips

Le dim. 5 juil. 2020 à 08:10, David Smiley  a
écrit :

> Pertaining to the highlighter performance regression:
> https://issues.apache.org/jira/browse/SOLR-14628
> It's a simple change in a default setting, that is furthermore consistent
> with how the behavior was prior to Solr 8.5
>
> I'm hoping this can make it into the release?  See the PR.
>
> ~ David
>
>
> On Wed, Jun 24, 2020 at 3:05 PM David Smiley 
> wrote:
>
>> Thanks starting this discussion, Cassandra.
>>
>> I reviewed the issues I was involved with and I don't quite see something
>> worth noting.
>>
>> I plan to add a note about a change in defaults within UnifiedHighlighter
>> that could be a significant perf regression.  This wasn't introduced in 8.6
>> but introduced in 8.5 and it's significant enough to bring attention to.  I
>> could add it in 8.5's section but then add a short pointer to it in 8.6.
>>
>> ~ David
>>
>>
>> On Wed, Jun 24, 2020 at 2:52 PM Cassandra Targett 
>> wrote:
>>
>>> I started looking at the Ref Guide for 8.6 to get it ready, and notice
>>> there are no Upgrade Notes in `solr-upgrade-notes.adoc` for 8.6. Is it
>>> really true that none are needed at all?
>>>
>>> I’ll add what I usually do about new features/changes that maybe
>>> wouldn’t normally make the old Upgrade Notes section, I just find it
>>> surprising that there weren’t any devs who thought any of the 100 or so
>>> Solr changes warrant any user caveats.
>>> On Jun 17, 2020, 12:27 PM -0500, Tomás Fernández Löbbe <
>>> tomasflo...@gmail.com>, wrote:
>>>
>>> +1. Thanks Bruno
>>>
>>> On Wed, Jun 17, 2020 at 6:22 AM Mike Drob  wrote:
>>>
>>>> +1
>>>>
>>>> The release wizard python script should be sufficient for everything.
>>>> If you run into any issues with it, let me know, I used it for 8.5.2 and
>>>> think I understand it pretty well.
>>>>
>>>> On Tue, Jun 16, 2020 at 8:31 AM Bruno Roustant <
>>>> bruno.roust...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> It’s been a while since we released Lucene/Solr 8.5.
>>>>> I’d like to volunteer to be a release manager for an 8.6 release. If
>>>>> there's agreement, then I plan to cut the release branch two weeks today,
>>>>> on June 30th, and then to build the first RC two days later.
>>>>>
>>>>> This will be my first time as release manager so I'll probably need
>>>>> some guidance. Currently I have two resource links on this subject:
>>>>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
>>>>>
>>>>> https://github.com/apache/lucene-solr/tree/master/dev-tools/scripts#releasewizardpy
>>>>> If you have more, please share with me.
>>>>>
>>>>> Bruno
>>>>>
>>>>

Re: 8.6 release

2020-07-02 Thread Bruno Roustant

Here are the draft release notes.
I tried to keep them concise, but please tell me if I miss something
important.

Solr 8.6.0 Release Highlights:

   - Health Check:
   HealthCheckHandler can now require that all cores are healthy before
   returning OK.
   - Zookeeper read API:
   A read API at /api/cluster/zk/* to fetch raw ZK data and view contents
   of a ZK directory.
   - Admin UI:
   New panel with security info in admin UI's dashboard.
   - Streaming Expression:
   Percentiles and standard deviation aggregations added to stats, facet
   and time series.
   Streaming expressions added to /export handler.
   Drill Streaming Expression for efficient and accurate high cardinality
   aggregation.
   - Cross-Collection Join Queries:
   Join queries can now work cross-collection, even when shared or when
   spanning nodes.

Lucene 8.6.0 Release Highlights:

   - API change in:
   SimpleFSDireectory, IndexWriterConfig, MergeScheduler, SortFields,
   SimpleBindings, QueryVisitor, DocValues, CodecUtil.
   - New:
   IndexWriter merge-on-commit feature to selectively merge small segments
   on commit, subject to a configurable timeout, to improve search performance
   by reducing the number of small segments for searching.
   Grouping by range based on DoubleValueSource and LongValueSource.
   - Optimizations:
   BKD trees and index, DoubleValuesSource/QueryValueSource,
   UsageTrackingQueryingCachingPolicy, FST, Geometry queries,
   Points, UniformSplit
   - Others:
   Ukrainian analyzer, checksums verification, resource leaks fixes




Le mar. 30 juin 2020 à 19:38, Bruno Roustant  a
écrit :

> Erick:
> AFAIK yes from now on a commit in branch_8x will not go to 8.6 branch.
>
> Le mar. 30 juin 2020 à 17:59, Erick Erickson  a
> écrit :
>
>> Bruno:
>>
>> Just to double check, anything committed to branch_8x from here on won’t
>> affect the 8.6 release unless explicitly backported, correct?
>>
>> I may be close to upgrading Zookeeper to 3.6.1, and very much do NOT want
>> it in the 8.6 release as it should bake longer than 2 weeks even…
>>
>> Thanks for managing this release!
>>
>> Erick
>>
>> > On Jun 30, 2020, at 11:34 AM, Bruno Roustant 
>> wrote:
>> >
>> > For the RC, I prefer to let the latest and greatest commits to bake a
>> week rather than only 2 days. Some of them have important impact and were
>> added very recently.
>> > So I plan a RC on July 7, if the smoke tester is fixed before that time.
>> >
>> > Bruno
>> >
>> > Le mar. 30 juin 2020 à 16:48, Uwe Schindler  a écrit :
>> > Hi,
>> >
>> >
>> >
>> > I enabled builds for 8.6 on Policeman Jenkins:
>> >
>> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Linux/
>> >
>> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-MacOSX/
>> >
>> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Windows/
>> >
>> >
>> >
>> > Uwe
>> >
>> >
>> >
>> > -
>> >
>> > Uwe Schindler
>> >
>> > Achterdiek 19, D-28357 Bremen
>> >
>> > https://www.thetaphi.de
>> >
>> > eMail: u...@thetaphi.de
>> >
>> >
>> >
>> > From: Bruno Roustant 
>> > Sent: Tuesday, June 30, 2020 3:02 PM
>> > To: dev@lucene.apache.org
>> > Subject: Re: 8.6 release
>> >
>> >
>> >
>> > [new branch]  0a1f68fafd6711304bbd7372567a359bcf36aab4 -> branch_8_6
>> >
>> >
>> > Le mar. 30 juin 2020 à 14:59, Bruno Roustant 
>> a écrit :
>> >
>> > I'm creating the branch_8_6 with the release wizard.
>> >
>> >
>> >
>> > Le mar. 30 juin 2020 à 12:37, Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> a écrit :
>> >
>> > This is done and merged. Thanks.
>> >
>> >
>> >
>> > On Tue, Jun 30, 2020 at 11:52 AM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>> >
>> > Hi Bruno,
>> >
>> > I'd like to get SOLR-14599 in. It is for the package manager CLI
>> support for cluster level (core container) level plugins. I think it is
>> important to have this in 8.6 for two reasons:
>> >
>> >
>> >
>> > (a) it will unblock Marcus Eagan/Ke Zhenxu who are working on a new
>> Solr UI package and would like to have their package released for early
>> feedback
>> >
>> > (b) Earlier we can release this, more feedback we can get before this
>> is released in 9.0.
>> >
>> > (c) This is an isolated change

Re: 8.6 release

2020-06-30 Thread Bruno Roustant

Erick:
AFAIK yes from now on a commit in branch_8x will not go to 8.6 branch.

Le mar. 30 juin 2020 à 17:59, Erick Erickson  a
écrit :

> Bruno:
>
> Just to double check, anything committed to branch_8x from here on won’t
> affect the 8.6 release unless explicitly backported, correct?
>
> I may be close to upgrading Zookeeper to 3.6.1, and very much do NOT want
> it in the 8.6 release as it should bake longer than 2 weeks even…
>
> Thanks for managing this release!
>
> Erick
>
> > On Jun 30, 2020, at 11:34 AM, Bruno Roustant 
> wrote:
> >
> > For the RC, I prefer to let the latest and greatest commits to bake a
> week rather than only 2 days. Some of them have important impact and were
> added very recently.
> > So I plan a RC on July 7, if the smoke tester is fixed before that time.
> >
> > Bruno
> >
> > Le mar. 30 juin 2020 à 16:48, Uwe Schindler  a écrit :
> > Hi,
> >
> >
> >
> > I enabled builds for 8.6 on Policeman Jenkins:
> >
> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Linux/
> >
> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-MacOSX/
> >
> > https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Windows/
> >
> >
> >
> > Uwe
> >
> >
> >
> > -
> >
> > Uwe Schindler
> >
> > Achterdiek 19, D-28357 Bremen
> >
> > https://www.thetaphi.de
> >
> > eMail: u...@thetaphi.de
> >
> >
> >
> > From: Bruno Roustant 
> > Sent: Tuesday, June 30, 2020 3:02 PM
> > To: dev@lucene.apache.org
> > Subject: Re: 8.6 release
> >
> >
> >
> > [new branch]  0a1f68fafd6711304bbd7372567a359bcf36aab4 -> branch_8_6
> >
> >
> > Le mar. 30 juin 2020 à 14:59, Bruno Roustant 
> a écrit :
> >
> > I'm creating the branch_8_6 with the release wizard.
> >
> >
> >
> > Le mar. 30 juin 2020 à 12:37, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> a écrit :
> >
> > This is done and merged. Thanks.
> >
> >
> >
> > On Tue, Jun 30, 2020 at 11:52 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> >
> > Hi Bruno,
> >
> > I'd like to get SOLR-14599 in. It is for the package manager CLI support
> for cluster level (core container) level plugins. I think it is important
> to have this in 8.6 for two reasons:
> >
> >
> >
> > (a) it will unblock Marcus Eagan/Ke Zhenxu who are working on a new Solr
> UI package and would like to have their package released for early feedback
> >
> > (b) Earlier we can release this, more feedback we can get before this is
> released in 9.0.
> >
> > (c) This is an isolated change to the CLI for the package manager
> (experimental), so very low risk to stability of the release.
> >
> >
> >
> > I should be done with this issue by eod today. In case you have no
> objection, I would like to merge this issue after you cut the branch today
> (and before you spin the RC).
> >
> >
> >
> > Regards,
> >
> > Ishan
> >
> >
> >
> > On Tue, Jun 30, 2020 at 6:20 AM Joel Bernstein 
> wrote:
> >
> > Hi Bruno,
> >
> >
> >
> > Andrzej and I decided that SOLR-14537 is headed to master to bake for a
> while and won't make it into the 8.6 release. So please feel free to cut
> the branch when ready.
> >
> >
> >
> >
> >
> > Joel Bernstein
> >
> > http://joelsolr.blogspot.com/
> >
> >
> >
> >
> >
> > On Mon, Jun 29, 2020 at 6:13 AM Andrzej Białecki  wrote:
> >
> > I wold like to include SOLR-14537 in 8.6 (it’s already tagged), the
> patch is ready and I’m just waiting for Joel to finish performance testing.
> >
> >
> >
> >
> > On 27 Jun 2020, at 04:59, Tomás Fernández Löbbe 
> wrote:
> >
> >
> >
> > I tagged SOLR-14590 for 8.6, The PR is ready for review and I plan to
> merge it soon
> >
> >
> >
> > On Fri, Jun 26, 2020 at 12:54 PM Andrzej Białecki  wrote:
> >
> > Jan,
> >
> >
> >
> > I just removed SOLR-14182 from 8.6, this needs proper back-compat shims
> and testing, and I don’t have enough time to get it done properly for 8.6.
> >
> >
> >
> >
> > On 26 Jun 2020, at 13:37, Jan Høydahl  wrote:
> >
> >
> >
> > Unresolved Solr issues tagged with 8.6:
> >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D

Re: 8.6 release

2020-06-30 Thread Bruno Roustant

For the RC, I prefer to let the latest and greatest commits to bake a week
rather than only 2 days. Some of them have important impact and were added
very recently.
So I plan a RC on July 7, if the smoke tester is fixed before that time.

Bruno

Le mar. 30 juin 2020 à 16:48, Uwe Schindler  a écrit :

> Hi,
>
>
>
> I enabled builds for 8.6 on Policeman Jenkins:
>
> https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Linux/
>
> https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-MacOSX/
>
> https://jenkins.thetaphi.de/job/Lucene-Solr-8.6-Windows/
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Bruno Roustant 
> *Sent:* Tuesday, June 30, 2020 3:02 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: 8.6 release
>
>
>
> [new branch]  0a1f68fafd6711304bbd7372567a359bcf36aab4 -> branch_8_6
>
>
>
> Le mar. 30 juin 2020 à 14:59, Bruno Roustant  a
> écrit :
>
> I'm creating the branch_8_6 with the release wizard.
>
>
>
> Le mar. 30 juin 2020 à 12:37, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> a écrit :
>
> This is done and merged. Thanks.
>
>
>
> On Tue, Jun 30, 2020 at 11:52 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
> Hi Bruno,
>
> I'd like to get SOLR-14599 in. It is for the package manager CLI support
> for cluster level (core container) level plugins. I think it is important
> to have this in 8.6 for two reasons:
>
>
>
> (a) it will unblock Marcus Eagan/Ke Zhenxu who are working on a new Solr
> UI package and would like to have their package released for early feedback
>
> (b) Earlier we can release this, more feedback we can get before this is
> released in 9.0.
>
> (c) This is an isolated change to the CLI for the package manager
> (experimental), so very low risk to stability of the release.
>
>
>
> I should be done with this issue by eod today. In case you have no
> objection, I would like to merge this issue after you cut the branch today
> (and before you spin the RC).
>
>
>
> Regards,
>
> Ishan
>
>
>
> On Tue, Jun 30, 2020 at 6:20 AM Joel Bernstein  wrote:
>
> Hi Bruno,
>
>
>
> Andrzej and I decided that SOLR-14537 is headed to master to bake for a
> while and won't make it into the 8.6 release. So please feel free to cut
> the branch when ready.
>
>
>
>
> Joel Bernstein
>
> http://joelsolr.blogspot.com/
>
>
>
>
>
> On Mon, Jun 29, 2020 at 6:13 AM Andrzej Białecki  wrote:
>
> I wold like to include SOLR-14537 in 8.6 (it’s already tagged), the patch
> is ready and I’m just waiting for Joel to finish performance testing.
>
>
>
> On 27 Jun 2020, at 04:59, Tomás Fernández Löbbe 
> wrote:
>
>
>
> I tagged SOLR-14590 for 8.6, The PR is ready for review and I plan to
> merge it soon
>
>
>
> On Fri, Jun 26, 2020 at 12:54 PM Andrzej Białecki  wrote:
>
> Jan,
>
>
>
> I just removed SOLR-14182 from 8.6, this needs proper back-compat shims
> and testing, and I don’t have enough time to get it done properly for 8.6.
>
>
>
> On 26 Jun 2020, at 13:37, Jan Høydahl  wrote:
>
>
>
> Unresolved Solr issues tagged with 8.6:
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%208.6
> <https://issues.apache.org/jira/issues/?jql=project%20=%20SOLR%20AND%20resolution%20=%20Unresolved%20AND%20fixVersion%20=%208.6>
>
>
>
>
> SOLR-14593   Package store API to disable file upload over HTTP
>Blocker
>
> SOLR-14580   CloudSolrClient cannot be initialized using 'zkHosts' builder
>   Blocker
>
> SOLR-14516   NPE during Realtime GET
>   Major
>
> SOLR-14502   increase bin/solr's post kill sleep
>   Minor
>
> SOLR-14398   package store PUT should be idempotent
>Trivial
>
> SOLR-14311   Shared schema should not have access to core level classes
>Major
>
> SOLR-14182   Move metric reporters config from solr.xml to ZK cluster
> properties Major
>
> SOLR-14066   Deprecate DIH
>   Blocker
>
> SOLR-14022   Deprecate CDCR from Solr in 8.x
>   Blocker
>
>
>
> Plus two private JIRA issues.
>
>
>
> Jan
>
>
>
> 26. jun. 2020 kl. 12:06 skrev Bruno Roustant :
>
>
>
> So the plan is to cut the release branch on next Tuesday June 30th. If you
> anticipate a problem with the date, please reply.
>
>
>
> Is there any JIRA issue that must be committed before the release is made
> and that has not a

New branch and feature freeze for Lucene/Solr 8.6.0

2020-06-30 Thread Bruno Roustant

 NOTICE:

Branch branch_8_6 has been cut and versions updated to 8.7 on stable branch.

Please observe the normal rules:

* No new features may be committed to the branch.

* Documentation patches, build patches and serious bug fixes may be
  committed to the branch. However, you should submit all patches you
  want to commit to Jira first to give others the chance to review
  and possibly vote against the patch. Keep in mind that it is our
  main intention to keep the branch as stable as possible.

* All patches that are intended for the branch should first be committed
  to the unstable branch, merged into the stable branch, and then into
  the current release branch.

* Normal unstable and stable branch development may continue as usual.
  However, if you plan to commit a big change to the unstable branch
  while the branch feature freeze is in effect, think twice: can't the
  addition wait a couple more days? Merges of bug fixes into the branch
  may become more difficult.

* Only Jira issues with Fix version 8.6 and priority "Blocker" will delay
  a release candidate build.

Re: 8.6 release

2020-06-30 Thread Bruno Roustant

[new branch]  0a1f68fafd6711304bbd7372567a359bcf36aab4 -> branch_8_6

Le mar. 30 juin 2020 à 14:59, Bruno Roustant  a
écrit :

> I'm creating the branch_8_6 with the release wizard.
>
> Le mar. 30 juin 2020 à 12:37, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> a écrit :
>
>> This is done and merged. Thanks.
>>
>> On Tue, Jun 30, 2020 at 11:52 AM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>>
>>> Hi Bruno,
>>> I'd like to get SOLR-14599 in. It is for the package manager CLI support
>>> for cluster level (core container) level plugins. I think it is important
>>> to have this in 8.6 for two reasons:
>>>
>>> (a) it will unblock Marcus Eagan/Ke Zhenxu who are working on a new Solr
>>> UI package and would like to have their package released for early feedback
>>> (b) Earlier we can release this, more feedback we can get before this is
>>> released in 9.0.
>>> (c) This is an isolated change to the CLI for the package manager
>>> (experimental), so very low risk to stability of the release.
>>>
>>> I should be done with this issue by eod today. In case you have no
>>> objection, I would like to merge this issue after you cut the branch today
>>> (and before you spin the RC).
>>>
>>> Regards,
>>> Ishan
>>>
>>> On Tue, Jun 30, 2020 at 6:20 AM Joel Bernstein 
>>> wrote:
>>>
>>>> Hi Bruno,
>>>>
>>>> Andrzej and I decided that SOLR-14537 is headed to master to bake for a
>>>> while and won't make it into the 8.6 release. So please feel free to cut
>>>> the branch when ready.
>>>>
>>>>
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/
>>>>
>>>>
>>>> On Mon, Jun 29, 2020 at 6:13 AM Andrzej Białecki  wrote:
>>>>
>>>>> I wold like to include SOLR-14537 in 8.6 (it’s already tagged), the
>>>>> patch is ready and I’m just waiting for Joel to finish performance 
>>>>> testing.
>>>>>
>>>>> On 27 Jun 2020, at 04:59, Tomás Fernández Löbbe 
>>>>> wrote:
>>>>>
>>>>> I tagged SOLR-14590 for 8.6, The PR is ready for review and I plan to
>>>>> merge it soon
>>>>>
>>>>> On Fri, Jun 26, 2020 at 12:54 PM Andrzej Białecki 
>>>>> wrote:
>>>>>
>>>>>> Jan,
>>>>>>
>>>>>> I just removed SOLR-14182 from 8.6, this needs proper back-compat
>>>>>> shims and testing, and I don’t have enough time to get it done properly 
>>>>>> for
>>>>>> 8.6.
>>>>>>
>>>>>> On 26 Jun 2020, at 13:37, Jan Høydahl  wrote:
>>>>>>
>>>>>> Unresolved Solr issues tagged with 8.6:
>>>>>>
>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%208.6
>>>>>> <https://issues.apache.org/jira/issues/?jql=project%20=%20SOLR%20AND%20resolution%20=%20Unresolved%20AND%20fixVersion%20=%208.6>
>>>>>>
>>>>>>
>>>>>> SOLR-14593   Package store API to disable file upload over HTTP
>>>>>>Blocker
>>>>>> SOLR-14580   CloudSolrClient cannot be initialized using 'zkHosts'
>>>>>> builder   Blocker
>>>>>> SOLR-14516   NPE during Realtime GET
>>>>>> Major
>>>>>> SOLR-14502   increase bin/solr's post kill sleep
>>>>>> Minor
>>>>>> SOLR-14398   package store PUT should be idempotent
>>>>>>Trivial
>>>>>> SOLR-14311   Shared schema should not have access to core level
>>>>>> classes  Major
>>>>>> SOLR-14182   Move metric reporters config from solr.xml to ZK cluster
>>>>>> properties Major
>>>>>> SOLR-14066   Deprecate DIH
>>>>>> Blocker
>>>>>> SOLR-14022   Deprecate CDCR from Solr in 8.x
>>>>>> Blocker
>>>>>>
>>>>>> Plus two private JIRA issues.
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>> 26. jun. 2020 kl. 12:06 skrev Bruno Roustant <
>>>>>> bruno.roust...@gmail.com>:
>>>>>>
&

Re: 8.6 release

2020-06-30 Thread Bruno Roustant

I'm creating the branch_8_6 with the release wizard.

Le mar. 30 juin 2020 à 12:37, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> This is done and merged. Thanks.
>
> On Tue, Jun 30, 2020 at 11:52 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Hi Bruno,
>> I'd like to get SOLR-14599 in. It is for the package manager CLI support
>> for cluster level (core container) level plugins. I think it is important
>> to have this in 8.6 for two reasons:
>>
>> (a) it will unblock Marcus Eagan/Ke Zhenxu who are working on a new Solr
>> UI package and would like to have their package released for early feedback
>> (b) Earlier we can release this, more feedback we can get before this is
>> released in 9.0.
>> (c) This is an isolated change to the CLI for the package manager
>> (experimental), so very low risk to stability of the release.
>>
>> I should be done with this issue by eod today. In case you have no
>> objection, I would like to merge this issue after you cut the branch today
>> (and before you spin the RC).
>>
>> Regards,
>> Ishan
>>
>> On Tue, Jun 30, 2020 at 6:20 AM Joel Bernstein 
>> wrote:
>>
>>> Hi Bruno,
>>>
>>> Andrzej and I decided that SOLR-14537 is headed to master to bake for a
>>> while and won't make it into the 8.6 release. So please feel free to cut
>>> the branch when ready.
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>>
>>> On Mon, Jun 29, 2020 at 6:13 AM Andrzej Białecki  wrote:
>>>
>>>> I wold like to include SOLR-14537 in 8.6 (it’s already tagged), the
>>>> patch is ready and I’m just waiting for Joel to finish performance testing.
>>>>
>>>> On 27 Jun 2020, at 04:59, Tomás Fernández Löbbe 
>>>> wrote:
>>>>
>>>> I tagged SOLR-14590 for 8.6, The PR is ready for review and I plan to
>>>> merge it soon
>>>>
>>>> On Fri, Jun 26, 2020 at 12:54 PM Andrzej Białecki 
>>>> wrote:
>>>>
>>>>> Jan,
>>>>>
>>>>> I just removed SOLR-14182 from 8.6, this needs proper back-compat
>>>>> shims and testing, and I don’t have enough time to get it done properly 
>>>>> for
>>>>> 8.6.
>>>>>
>>>>> On 26 Jun 2020, at 13:37, Jan Høydahl  wrote:
>>>>>
>>>>> Unresolved Solr issues tagged with 8.6:
>>>>>
>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%208.6
>>>>> <https://issues.apache.org/jira/issues/?jql=project%20=%20SOLR%20AND%20resolution%20=%20Unresolved%20AND%20fixVersion%20=%208.6>
>>>>>
>>>>>
>>>>> SOLR-14593   Package store API to disable file upload over HTTP
>>>>>Blocker
>>>>> SOLR-14580   CloudSolrClient cannot be initialized using 'zkHosts'
>>>>> builder   Blocker
>>>>> SOLR-14516   NPE during Realtime GET
>>>>>   Major
>>>>> SOLR-14502   increase bin/solr's post kill sleep
>>>>>   Minor
>>>>> SOLR-14398   package store PUT should be idempotent
>>>>>Trivial
>>>>> SOLR-14311   Shared schema should not have access to core level
>>>>> classes  Major
>>>>> SOLR-14182   Move metric reporters config from solr.xml to ZK cluster
>>>>> properties Major
>>>>> SOLR-14066   Deprecate DIH
>>>>>   Blocker
>>>>> SOLR-14022   Deprecate CDCR from Solr in 8.x
>>>>>   Blocker
>>>>>
>>>>> Plus two private JIRA issues.
>>>>>
>>>>> Jan
>>>>>
>>>>> 26. jun. 2020 kl. 12:06 skrev Bruno Roustant >>>> >:
>>>>>
>>>>> So the plan is to cut the release branch on next Tuesday June 30th. If
>>>>> you anticipate a problem with the date, please reply.
>>>>>
>>>>> Is there any JIRA issue that must be committed before the release is
>>>>> made and that has not already the appropriate "Fix Version"?
>>>>>
>>>>> Currently there 3 unresolved issues flagged as Fix Version = 8.6:
>>>>> Add tests for corruptions caused by byte flips LUCENE-9356
>>>>> <https://iss

Re: PGP key to sign the 8.6 branch

2020-06-30 Thread Bruno Roustant

I uploaded my key (6AD29C0A) to keyserver.ubuntu.com, pgp.surfnet.nl and
hkps.pool.sks-keyservers.net and it can be retrieved now:

gpg --verbose --keyserver keyserver.ubuntu.com --recv-keys 6AD29C0A

gpg: data source: http://162.213.33.9:11371

gpg: pub  rsa4096/377C3BA26AD29C0A 2020-06-26  Bruno Roustant <
broust...@apache.org>

gpg: key 377C3BA26AD29C0A: "Bruno Roustant " not
changed

gpg: Total number processed: 1

gpg:  unchanged: 1


I still cannot connect to pgp.mit.edu

Le lun. 29 juin 2020 à 16:52, David Smiley  a écrit :

> I've been trying to get Bruno's key and have had great difficulty.
> I can find his key with the web interface:
> https://pgp.mit.edu/pks/lookup?search=broustant%40apache.org=vindex
>
> But at the CLI I can't find it:
>
> This fails:
>
> gpg --keyserver pgp.mit.edu --search-keys broust...@apache.org
>
> gpg: searching for "broust...@apache.org" from hkp server pgp.mit.edu
>
> gpg: key "broust...@apache.org" not found on keyserver
>
> And so does:
>
> gpg --keyserver pgp.mit.edu -v --recv-keys 0x377C3BA26AD29C0A
>
> gpg: requesting key 6AD29C0A from hkp server pgp.mit.edu
>
> gpg: keyserver timed out
>
> gpg: keyserver receive failed: Keyserver error
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Jun 29, 2020 at 10:33 AM Robert Muir  wrote:
>
>> I have had problems with gpg last few hours too. pgp.mit.edu has been
>> slow/not working even for my own key.
>> But if i use an alternative server it works better.
>>
>> May not help you, as your key (6AD29C0A?) doesn't seem to exist on any of
>> the other servers yet.
>>
>> $ gpg --verbose --keyserver pgp.mit.edu --recv-keys 322D7ECA
>> gpg: keyserver receive failed: No keyserver available
>> $ gpg --verbose --keyserver keyserver.ubuntu.com --recv-keys 322D7ECA
>> gpg: data source: http://162.213.33.9:11371
>> gpg: key 817AE1DD322D7ECA: number of dropped non-self-signatures: 6
>> gpg: pub  rsa4096/817AE1DD322D7ECA 2009-11-05  Robert Muir (Code Signing
>> Key) 
>> gpg: key 817AE1DD322D7ECA: "Robert Muir (Code Signing Key) <
>> rm...@apache.org>" not changed
>> gpg: Total number processed: 1
>> gpg:  unchanged: 1
>>
>> On Mon, Jun 29, 2020 at 10:09 AM Bruno Roustant 
>> wrote:
>>
>>> Hi
>>>
>>> I've been reading the PGP/GPG key part of the ReleaseTodo doc.
>>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
>>> I created a 4K-bit key (with my apache.org email) and I uploaded it to
>>> MIT key server pgp.mit.edu last Thursday.
>>>
>>> But there is a line in the doc that says my key should be signed by
>>> another committer. I asked David Smiley but it seems he encounters
>>> difficulties to get back my key from the server.
>>> Could someone help us to understand the issue?
>>>
>>> Thanks!
>>>
>>> Bruno
>>>
>>

PGP key to sign the 8.6 branch

2020-06-29 Thread Bruno Roustant

Hi

I've been reading the PGP/GPG key part of the ReleaseTodo doc.
https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
I created a 4K-bit key (with my apache.org email) and I uploaded it to MIT
key server pgp.mit.edu last Thursday.

But there is a line in the doc that says my key should be signed by another
committer. I asked David Smiley but it seems he encounters difficulties to
get back my key from the server.
Could someone help us to understand the issue?

Thanks!

Bruno

Re: 8.6 release

2020-06-26 Thread Bruno Roustant

So the plan is to cut the release branch on next Tuesday June 30th. If you
anticipate a problem with the date, please reply.

Is there any JIRA issue that must be committed before the release is made
and that has not already the appropriate "Fix Version"?

Currently there 3 unresolved issues flagged as Fix Version = 8.6:
Add tests for corruptions caused by byte flips LUCENE-9356
<https://issues.apache.org/jira/browse/LUCENE-9356>
Fix linefiledocs compression or replace in tests LUCENE-9191
<https://issues.apache.org/jira/browse/LUCENE-9191>
Can we merge small segments during refresh, for faster searching?
LUCENE-8962 <https://issues.apache.org/jira/browse/LUCENE-8962>


Le mer. 24 juin 2020 à 21:05, David Smiley  a
écrit :

> Thanks starting this discussion, Cassandra.
>
> I reviewed the issues I was involved with and I don't quite see something
> worth noting.
>
> I plan to add a note about a change in defaults within UnifiedHighlighter
> that could be a significant perf regression.  This wasn't introduced in 8.6
> but introduced in 8.5 and it's significant enough to bring attention to.  I
> could add it in 8.5's section but then add a short pointer to it in 8.6.
>
> ~ David
>
>
> On Wed, Jun 24, 2020 at 2:52 PM Cassandra Targett 
> wrote:
>
>> I started looking at the Ref Guide for 8.6 to get it ready, and notice
>> there are no Upgrade Notes in `solr-upgrade-notes.adoc` for 8.6. Is it
>> really true that none are needed at all?
>>
>> I’ll add what I usually do about new features/changes that maybe wouldn’t
>> normally make the old Upgrade Notes section, I just find it surprising that
>> there weren’t any devs who thought any of the 100 or so Solr changes
>> warrant any user caveats.
>> On Jun 17, 2020, 12:27 PM -0500, Tomás Fernández Löbbe <
>> tomasflo...@gmail.com>, wrote:
>>
>> +1. Thanks Bruno
>>
>> On Wed, Jun 17, 2020 at 6:22 AM Mike Drob  wrote:
>>
>>> +1
>>>
>>> The release wizard python script should be sufficient for everything. If
>>> you run into any issues with it, let me know, I used it for 8.5.2 and think
>>> I understand it pretty well.
>>>
>>> On Tue, Jun 16, 2020 at 8:31 AM Bruno Roustant 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> It’s been a while since we released Lucene/Solr 8.5.
>>>> I’d like to volunteer to be a release manager for an 8.6 release. If
>>>> there's agreement, then I plan to cut the release branch two weeks today,
>>>> on June 30th, and then to build the first RC two days later.
>>>>
>>>> This will be my first time as release manager so I'll probably need
>>>> some guidance. Currently I have two resource links on this subject:
>>>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
>>>>
>>>> https://github.com/apache/lucene-solr/tree/master/dev-tools/scripts#releasewizardpy
>>>> If you have more, please share with me.
>>>>
>>>> Bruno
>>>>
>>>

Re: Welcome Ilan Ginzburg as Lucene/Solr committer

2020-06-21 Thread Bruno Roustant

Congrats Ilan!

Le dim. 21 juin 2020 à 17:10, Yonik Seeley  a écrit :

> Congrats Ilan!
> -Yonik
>
>
> On Sun, Jun 21, 2020 at 5:44 AM Noble Paul  wrote:
>
>> Hi all,
>>
>> Please join me in welcoming Ilan Ginzburg as the latest Lucene/Solr
>> committer.
>> Ilan, it's tradition for you to introduce yourself with a brief bio.
>>
>> Congratulations and Welcome!
>> Noble
>>
>

Re: [VOTE] Lucene logo contest

2020-06-16 Thread Bruno Roustant

C - current logo
not PMC

Le mar. 16 juin 2020 à 21:38, Erik Hatcher  a
écrit :

> C - current logo
>
> On Jun 15, 2020, at 6:08 PM, Ryan Ernst  wrote:
>
> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for Lucene [1].
> That contest concluded, and I am now (admittedly a little late!) calling a
> vote.
>
> The entries are labeled as follows:
>
> A. Submitted by Dustin Haver [2]
>
> B. Submitted by Stamatis Zampetakis [3] Note that this has several
> variants. Within the linked entry there are 7 patterns and 7 color
> palettes. Any vote for B should contain the pattern number, like B1 or B3.
> If a B variant wins, we will have a followup vote on the color palette.
>
> C. The current Lucene logo [4]
>
> Please vote for one of the three (or nine depending on your perspective!)
> above choices. Note that anyone in the Lucene+Solr community is invited to
> express their opinion, though only Lucene+Solr PMC cast binding votes
> (indicate non-binding votes in your reply, please). This vote will close
> one week from today, Mon, June 22, 2020.
>
> Thanks!
>
> [1] https://issues.apache.org/jira/browse/LUCENE-9221
> [2]
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> [3]
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> [4]
> https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>
>
>

Re: Please look at and comment on SOLR-11973 (fail compilation on warnings)

2020-06-16 Thread Bruno Roustant

+1

Le mar. 16 juin 2020 à 08:23, David Smiley  a
écrit :

> +1 thanks
> ~ David
>
>
> On Fri, Jun 12, 2020 at 10:11 AM Erick Erickson 
> wrote:
>
>> Short form:
>>
>> In a week or so, I propose to start failing compilations on master for
>> compiler warnings (exclusive of deprecations). If you have a problem with
>> that, speak up or hold your peace ;)
>>
>> Erick
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: 8.6 release

2020-06-16 Thread Bruno Roustant

Atri, yes sure. More eyes to verify the steps.

Le mar. 16 juin 2020 à 16:38, Atri Sharma  a écrit :

> Bruno,
>
> If you would want, I am willing to help you out in doing the 8.6 release.
> Will help learn the process as well.
>
> On Tue, 16 Jun 2020 at 19:01, Bruno Roustant 
> wrote:
>
>> Hi all,
>>
>> It’s been a while since we released Lucene/Solr 8.5.
>> I’d like to volunteer to be a release manager for an 8.6 release. If
>> there's agreement, then I plan to cut the release branch two weeks today,
>> on June 30th, and then to build the first RC two days later.
>>
>> This will be my first time as release manager so I'll probably need some
>> guidance. Currently I have two resource links on this subject:
>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
>>
>> https://github.com/apache/lucene-solr/tree/master/dev-tools/scripts#releasewizardpy
>> If you have more, please share with me.
>>
>> Bruno
>>
> --
> Regards,
>
> Atri
> Apache Concerted
>

8.6 release

2020-06-16 Thread Bruno Roustant

Hi all,

It’s been a while since we released Lucene/Solr 8.5.
I’d like to volunteer to be a release manager for an 8.6 release. If
there's agreement, then I plan to cut the release branch two weeks today,
on June 30th, and then to build the first RC two days later.

This will be my first time as release manager so I'll probably need some
guidance. Currently I have two resource links on this subject:
https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo
https://github.com/apache/lucene-solr/tree/master/dev-tools/scripts#releasewizardpy
If you have more, please share with me.

Bruno

Re: Welcome Mayya Sharipova as Lucene/Solr committer

2020-06-09 Thread Bruno Roustant

Welcome Mayya, congratulations!

Le mar. 9 juin 2020 à 09:10, Tomoko Uchida  a
écrit :

> Hello Mayya,
> congratulations and welcome!
>
> Tomoko
>
>
> 2020年6月9日(火) 15:59 Adrien Grand :
>
>> Welcome, Mayya!
>>
>> On Mon, Jun 8, 2020 at 6:58 PM jim ferenczi  wrote:
>>
>>> Hi all,
>>>
>>> Please join me in welcoming Mayya Sharipova as the latest Lucene/Solr
>>> committer.
>>> Mayya, it's tradition for you to introduce yourself with a brief bio.
>>>
>>> Congratulations and Welcome!
>>>
>>> Jim
>>>
>>
>>
>> --
>> Adrien
>>
>

Re: Welcome Alessandro Benedetti as a Lucene/Solr committer

2020-03-30 Thread Bruno Roustant

Welcome Alessandro!

Le ven. 27 mars 2020 à 17:02, Christine Poerschke (BLOOMBERG/ LONDON) <
cpoersc...@bloomberg.net> a écrit :

> Welcome Alessandro!
>
> Christine
>
> From: dev@lucene.apache.org At: 03/18/20 19:25:48
> Cc: dev@lucene.apache.org
> Subject: Re: Welcome Alessandro Benedetti as a Lucene/Solr committer
>
> Thanks everyone for the warm welcome!
> I already know most of you but for all the others here's my brief bio :)
>
> I am Italian (possibly the only other italian in addition to Tommaso) and
> I have been living in the UK for the last 7 years.
> I am currently based in London.
> I started working with Apache Solr back in 2010 (and a few months later
> with Apache Lucene), my first project was a search API that translated the
> Verity query language to Lucene syntax, at the time I was a junior software
> engineer with a background in Information Retrieval research at Roma3
> university.
> Since then I have explored a lot of different use cases for Apache
> Lucene/Solr and I spent more and more time studying and working with the
> internals, across various companies and positions.
> My favourite projects in my career have been the design and implementation
> of a Semantic Search engine called Sensify (when I was working in a small
> and cohesive R team in Zaizi, with spanish friends and colleagues from
> Seville), the Apache Solr Learning To Rank plugin from Bloomberg (and
> integrations/applications) and the Rated Ranking Evaluator project (an Open
> Source library for Search Quality Evaluation we contributed back to the
> community).
> In 2016 I founded my own company, Sease where we try to build a bridge
> between Academia and the industry through Open Source software in the
> domain of Information Retrieval.
>
> As David mentioned my main areas of contribution in Apache Lucene/Solr
> have been the More Like This, the Learning To Rank plugin, Synonyms
> expansion and the Suggester component.
> I have a lot of ideas in my to do list, so stay tuned, we'll have a lot to
> discuss and innovate !
>
> It is a pleasure to join this group and I am sure we'll do great things
> together :)
>
> Cheers
>
>
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io
>
>
> On Wed, 18 Mar 2020 at 13:00, David Smiley 
> wrote:
>
>> Hi all,
>>
>> Please join me in welcoming Alessandro Benedetti as the latest
>> Lucene/Solr committer!
>>
>> Alessandro has been contributing to Lucene and Solr in areas such as More
>> Like This, Synonym boosting, and Suggesters, and other areas for years.
>> Furthermore he's been a help to many users on the solr-user mailing list
>> and has helped others through his blog posts and presentations about
>> search.  We look forward to his future contributions.
>>
>> Congratulations and welcome!  It is a tradition to introduce yourself
>> with a brief bio, Alessandro.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>
>

Re: CHANGES.txt and issue categorization

2020-03-05 Thread Bruno Roustant

+1 to move these entries. And I agree with the categories definitions.

Le mer. 4 mars 2020 à 10:24, Adrien Grand  a écrit :

> +1 to move these entries.
>
> On Wed, Mar 4, 2020 at 4:27 AM David Smiley 
> wrote:
>
>> I'll simply move these items around tomorrow this time, unless I hear
>> feedback to the contrary.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Mon, Mar 2, 2020 at 1:07 PM David Smiley 
>> wrote:
>>
>>> I'd like us to reflect on how we categorize issues in CHANGES.txt.  We
>>> have these categories:
>>> (Lucene) 'API Changes', 'New Features', 'Improvements', 'Optimizations',
>>> 'Bug Fixes', 'Other'
>>> (Solr) 'New Features', 'Improvements', 'Optimizations', 'Bug Fixes',
>>> 'Other Changes'
>>> (I lifted these from dev-tools/scripts/addVersion.py line 215)
>>>
>>> In particular, I'm often surprised at how some of us categorize New
>>> Features or Improvements that should better be categorized as something
>>> else.  I think the root cause of these problems may be that we don't have
>>> JIRA categories that directly align.  Furthermore, our dev practices will
>>> typically result in a CHANGES.txt being added out of band from the
>>> code-review process, and thus no peer-review on ideal placement.
>>> Furthermore the message itself is often not code reviewed but should be.
>>> Perhaps we can simply get in the habit of adding a JIRA comment (or GH code
>>> review) what we propose the category & issue summary should be.
>>>
>>> Here is my attempt at a definition for _some_ of these categories.  I
>>> don't pretend to think we all agree 100% but it's up for discussion:
>>> 
>>> * New Features:  A user-visible new capability.  Usually opt-in.
>>>
>>> * Improvements:  A user-visible improvement to an existing capability
>>> that somehow expands its ability or that which improves the behavior.  Not
>>> a refactoring, not an optimization.
>>>
>>> * Optimizations: Something is now more efficient.  Usually automatic
>>> (not opt-in).
>>>
>>> * Other:  Anything else: Refactorings, tests, build, docs, etc.  And
>>> adding log statements.
>>> 
>>>
>>> I recommend the following changes to Lucene 8.5:
>>>
>>> These are "Improvements" that I think are better categorized as
>>> "Optimizations"
>>> * LUCENE-9211: Add compression for Binary doc value fields. (Mark
>>> Harwood)
>>> * LUCENE-4702: Better compression of terms dictionaries. (Adrien Grand)
>>> * LUCENE-9228: Sort dvUpdates in the term order before applying if they
>>> all update a
>>>   single field to the same value. This optimization can reduce the flush
>>> time by around
>>>   20% for the docValues update user cases. (Nhat Nguyen, Adrien Grand,
>>> Simon Willnauer)
>>> * LUCENE-9245: Reduce AutomatonTermsEnum memory usage. (Bruno Roustant,
>>> Robert Muir)
>>> * LUCENE-9237: Faster UniformSplit intersect TermsEnum. (Bruno Roustant)
>>>
>>> These "Improvements" I think are better categorized as "Other":
>>> * LUCENE-9109: Backport some changes from master (except StackWalker) to
>>> improve
>>>   TestSecurityManager (Uwe Schindler)
>>> * LUCENE-9110: Backport refactored stack analysis in tests to use
>>> generalized
>>>   LuceneTestCase methods (Uwe Schindler)
>>> * LUCENE-9141: Simplify LatLonShapeXQuery API by adding a new abstract
>>> class called LatLonGeometry. Queries are
>>>   executed with input objects that extend such interface. (Ignacio Vera)
>>> * LUCENE-9194: Simplify XYShapeXQuery API by adding a new abstract class
>>> called XYGeometry. Queries are
>>>   executed with input objects that extend such interface. (Ignacio Vera)
>>>
>>> Maybe this "Other" item should be  "Optimization"? (not sure):
>>> * LUCENE-9068: FuzzyQuery builds its Automaton up-front (Alan Woodward,
>>> Mike Drob)
>>>
>>> Solr:
>>>
>>> "New Features" that maybe should be "Improvements":
>>>  * SOLR-13892: New "top-level" docValues join implementation (Jason
>>> Gerlowski, Joel Bernstein)
>>>  * SOLR-14242: HdfsDirectory now supports indexing geo-points, ranges or
>>> shapes. (Adrien Grand)
>>>
>>> "Improvements" that maybe should be "Optimizations":
>>> * SOLR-13808: filter in BoolQParser and {"bool":{"filter":..}} in Query
>>> DSL are cached by default (Mikhail Khludnev)
>>>
>>> "Improvements" that maybe should be "Other":
>>> * SOLR-14114: Add WARN to Solr log that embedded ZK is not supported in
>>> production (janhoy)
>>>
>>> Thoughts?
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>
>
> --
> Adrien
>

Re: Commit / Code Review Policy

2019-11-29 Thread Bruno Roustant

I like this new version. This clarifies the review, commit and CHANGES. As
a beginner in this process, it helps.

I appreciate the idea to have a "risk" section where we could list and say
a few words about some risky areas so that the contributor can announce
they might be impacted in reviews.

Le ven. 29 nov. 2019 à 16:06, David Smiley  a
écrit :

> The commit policy / guideline document is basically 95% there and I don't
> want to wait longer to get input.
> https://cwiki.apache.org/confluence/display/LUCENE/Commit+Policy+-+DRAFT
>
> If you log-in, you can comment on the document in-line as Jan has already
> done.  Such feedback is good for details.  For more substantive or high
> level feedback, this email thread probably makes more sense.
>
> The policy/guideline document insists on reviews but gives broad
> exceptions for reviews and defines a very low bar for reviews -- basically
> mere "approval" from *anyone* and that didn't necessarily look at the
> code.  Yet this is a higher bar than today.
>
> Also, I hope this is not controversial but I want the same definition of
> minor/trival matters to be used for (A) when a JIRA issue is not needed
> either, and (B) not bothering with a CHANGES.txt entry.  I observe that
> today we seemingly have a JIRA issue for *everything*, and I find that
> onerous and is yet another barrier for contributors of such small matters.
> For example https://issues.apache.org/jira/browse/SOLR-13926 which just
> adds javadocs.  Also I think we add too many items to CHANGES.txt... lots
> of people read this and it's a collective waste of our time IMO to mention
> that some test was fixed.
>
> All feedback is very welcome!
>
> ~ David
>

Re: Welcome Bruno Roustant as Lucene/Solr committer

2019-11-23 Thread Bruno Roustant

 Thank you all for this warm welcome.

A couple of words about me:
I live in France with my family, in Grenoble near the Alps. Since my very
first work experience (2000), I have been developing around Search and
Java. Federated Search, result clustering, more like this, etc. When I
joined Salesforce in 2013, I had the opportunity to start using
Lucene-Solr. Exploring the beast as a user at the beginning, and then
developing plugins. Since 2016 we had stronger performance goals, for lots
of fields and security constraints, and this gave me again the opportunity
to dive into the mechanics, the APIs, the queries. What a nice surprise to
discover this strong project with refined interfaces, carefully crafted
algorithms, and high test quality bar! Soon another opportunity made me
work with David Smiley, who advocated convincingly the benefits of
open-sourcing. That was a succession of opportunities that led me here,
becoming a committer, part of this projet!
I like data structure and algorithm challenges. I have experience in
performance/memory optimization. I like team work and I’m discovering the
open-source power :)

Thanks again, I can’t wait to contribute more with you.

Le sam. 23 nov. 2019 à 21:51, Rahul Yadav  a écrit :

> Apologies , Sure will do.
>
> Regards
> Rahul
>
> On Sat, Nov 23, 2019 at 8:46 PM David Smiley 
> wrote:
>
>> Please start a new thread instead of replying to this thread.
>>
>> On Sat, Nov 23, 2019 at 3:14 PM Rahul Yadav  wrote:
>>
>>> Hi All ,
>>>
>>> I have just joined(new dev) here.
>>> I have had some experience in Information Retrieval and Search and would
>>> like to contribute to the community.
>>> as i am just setting up , it would be helpful if there is any
>>> task/bug/(Starter Level) that i can work on.
>>>
>>>
>>> Regards
>>> Rahul
>>> https://www.linkedin.com/in/rahul-y-26b6b1142/
>>>
>>> On Sat, Nov 23, 2019 at 6:18 PM Namgyu Kim  wrote:
>>>
>>>> Congratulations and welcome, Bruno! :D
>>>>
>>>> On Sun, Nov 24, 2019 at 2:16 AM Ishan Chattopadhyaya <
>>>> ichattopadhy...@gmail.com> wrote:
>>>>
>>>>> Welcome Bruno!
>>>>>
>>>>> On Sat, 23 Nov, 2019, 10:35 PM David Smiley, 
>>>>> wrote:
>>>>>
>>>>>> Congratulations and welcome Bruno!  We always need more eyes on the
>>>>>> low level Lucene bits.
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 23, 2019 at 3:29 AM Adrien Grand 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Please join me in welcoming Bruno Roustant as the latest Lucene/Solr
>>>>>>> committer!
>>>>>>>
>>>>>>> It didn't take many JIRA issues for Bruno to demonstrate good
>>>>>>> understanding of the lower level bits of Lucene by writing a new
>>>>>>> postings format and more recently exploring ideas that ended up
>>>>>>> speeding up FSTs while decreasing their memory usage at the same
>>>>>>> time.
>>>>>>> We are very happy that Bruno accepted the PMC's invitation to join.
>>>>>>>
>>>>>>> Congratulations and welcome, Bruno! It's a tradition to introduce
>>>>>>> yourself with a brief bio.
>>>>>>>
>>>>>>> --
>>>>>>> Adrien
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>>
>>>>>>> --
>> Sent from Gmail Mobile
>>
>

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/6/19 11:57 AM:
-

I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)


was (Author: bruno.roustant):
I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178
 ] 

Bruno Roustant commented on LUCENE-8920:


I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/5/19 8:46 PM:


[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

On the Binary-Search side, it could be possible to support variable length 
encoding, by finding the first byte starting a label based on the bit used to 
encode the var length additional bytes.


was (Author: bruno.roustant):
[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923463#comment-16923463
 ] 

Bruno Roustant commented on LUCENE-8920:


Based on some heuristics, Direct-Addressing is the good choice. For example if 
num labels / (max label - min label) >= 75%.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428
 ] 

Bruno Roustant commented on LUCENE-8920:


[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-09-04 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922609#comment-16922609
 ] 

Bruno Roustant commented on LUCENE-8753:


Ok, I followed your advice to include the "shared terms" extension (subpackage) 
in the same PR #828. I'm going to close the two previous ones.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-08-13 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906219#comment-16906219
 ] 

Bruno Roustant commented on LUCENE-8753:


New [PR 828|https://github.com/apache/lucene-solr/pull/828] to have this 
PostingsFormat inside codecs/uniformsplit with no code elsewhere. I added 
package javadoc and lucene.experimental annotation.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-19 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1682#comment-1682
 ] 

Bruno Roustant commented on LUCENE-8906:


PR added

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>    Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-19 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1630#comment-1630
 ] 

Bruno Roustant commented on LUCENE-8921:


PR added

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-17 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886889#comment-16886889
 ] 

Bruno Roustant commented on LUCENE-8921:


Yes, sure. I could work on a PR for 8.2.

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: master (9.0)
>
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-17 Thread Bruno Roustant (JIRA)

Bruno Roustant created LUCENE-8921:
--

 Summary: IndexSearcher.termStatistics should not require 
TermStates but docFreq and totalTermFreq
 Key: LUCENE-8921
 URL: https://issues.apache.org/jira/browse/LUCENE-8921
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Affects Versions: 8.1
Reporter: Bruno Roustant
 Fix For: master (9.0)


IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
create a TermStatistics. It requires a TermStates param although it only cares 
about the docFreq and totalTermFreq.

 

For customizations that what to create TermStatistics based on docFreq and 
totalTermFreq, but that do not have available TermStates, this method forces to 
create a TermStates instance (which is not very lightweight) only to pass two 
ints.

termStatistics could be modified to the following signature:

termStatistics(Term term, int docFreq, int totalTermFreq)

Since it would change the API, it could be done in master for next major 
release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11866:
--
Attachment: (was: SOLR-11866.patch)

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11866:
--
Attachment: (was: 
0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch)

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883819#comment-16883819
 ] 

Bruno Roustant commented on SOLR-11866:
---

Also, the doc will need to be updated to explain the support of the new 
match="subset" param in the elevation rule (in addition to match="exact").

.

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>    Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, 
> SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883816#comment-16883816
 ] 

Bruno Roustant commented on SOLR-11866:
---

I have updated with PR [#780|https://github.com/apache/lucene-solr/pull/780]. 
Should I remove the obsolete patch files from this Jira issue?

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, 
> SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-07-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881046#comment-16881046
 ] 

Bruno Roustant commented on LUCENE-8753:


I have created a related Jira issue LUCENE-8906 
(Lucene50PostingsReader.postings() casts BlockTermState param to private 
IntBlockTermState) to make the PR review advance.

If we find a solution for this issue, then UniformSplit posting format will be 
fully isolated in a separate package in codecs, with no intrusion anymore 
elsewhere.

The goal is to have it as an additional optional posting format (not to replace 
BlockTree) for the following use-cases: customizable by extension, shared-terms 
extension available, low memory on-heap footprint, best efficiency when dealing 
with small to medium indexes.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881037#comment-16881037
 ] 

Bruno Roustant commented on LUCENE-8906:


This issue has been encountered in LUCENE-8753 (Uniform Split posting format).

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>    Reporter: Bruno Roustant
>Priority: Major
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-09 Thread Bruno Roustant (JIRA)

Bruno Roustant created LUCENE-8906:
--

 Summary: Lucene50PostingsReader.postings() casts BlockTermState 
param to private IntBlockTermState
 Key: LUCENE-8906
 URL: https://issues.apache.org/jira/browse/LUCENE-8906
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Bruno Roustant


Lucene50PostingsReader is the public API that offers the postings() method to 
read the postings. Any PostingFormat can use it (as well as 
Lucene50PostingsWriter) to read/write postings.

But the postings() method asks for a (public) BlockTermState param which is 
internally cast to the private IntBlockTermState. This BlockTermState is 
provided by Lucene50PostingsReader.newTermState().

public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
PostingsEnum reuse, int flags)

This actually makes impossible to a custom PostingFormat customizing the Block 
file structure to use this postings() method by providing their 
(Int)BlockTermState, because they cannot access the FP fields of the 
IntBlockTermState returned by PostingsReaderBase.newTermState().

Proposed change:
 * Either make IntBlockTermState public, as well as its fields.
 * Or replace it by an interface in the postings() method. In this case the 
IntBlockTermState fields currently accessed directly would be replaced by 
getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

2019-06-06 Thread Bruno Roustant (JIRA)

Bruno Roustant created LUCENE-8836:
--

 Summary: Optimize DocValues TermsDict to continue scanning from 
the last position when possible
 Key: LUCENE-8836
 URL: https://issues.apache.org/jira/browse/LUCENE-8836
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant


Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
term ordinal.

Currently it does not have the optimization the FSTEnum has: to be able to 
continue a sequential scan from where the last lookup was in the IndexInput. 
For sparse lookups (when searching only a few terms or ordinal) it is not an 
issue. But for multiple lookups in a row this optimization could save 
re-scanning all the terms from the block start (since they are delat encoded).

This patch proposes the optimization.

To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
term reads in the IndexInput, with and without the optimization:

TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
reads.
TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
82% term reads.

In some cases, when scanning many terms in lexicographical order, the 
optimization saves a lot. In some case, when only looking for some sparse 
terms, the optimization does not bring improvement, but does not penalize 
neither. It seems to be worth to always have it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-05-14 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839429#comment-16839429
 ] 

Bruno Roustant commented on LUCENE-8753:


Beyond the performance aspects, we developed UniformSplit to be extensible. To 
give an idea of how it can be extended, I have added a new PR#676: SharedTerms 
UniformSplit.

The use-case is when there are many fields. We want to take advantage of the 
FST property to share the terms between all the fields, by replacing one FST 
per field by a single FST containing the shared terms. In this case each term 
is stored only once in the block file, and its block line contains the 
TermState for each different field for which the term occurs.

term A -> field1 TermState, field2 TermState, field3 TermState

term B -> field3 TermState, field5 TermState

The FST is compact and this posting format also unlocks the possibility to 
cache when the same term is searched in many fields (but this is not part of 
this PR).

My goal here is to showcase the extensibility of this posting format. This 
extension is in a separate sub-package sharedterms and is quite concise. (the 
only tricky part is the custom merge to merge efficiently two segments by 
accessing directly the sharedterms posting format)

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 9:26 AM:


I agree.

We profiled wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.


was (Author: bruno.roustant):
I agree.

We profile wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171
 ] 

Bruno Roustant commented on LUCENE-8753:


I agree.

We profile wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:15 AM:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

(I used -Jira option, but it does not seem to recognize the "color" tag)

||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff
|Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}|
 


was (Author: bruno.roustant):
It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less O

[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:13 AM:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

(I used -Jira option, but it does not seem to recognize the \{color} tag)
||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff||
|Fuzzy1|72.81|3.11|21.77|0.71|{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|{color:red}2%\{color}-\{color:green}6%\{color}|


was (Author: bruno.roustant):
It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant commented on LUCENE-8753:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff
|Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}|

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to ad

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809251#comment-16809251
 ] 

Bruno Roustant commented on LUCENE-8753:


{quote}I think this is similar to […] BlockTermsReader/Writer
{quote}
Indeed similar; it mainly differs from VariableGapTermsIndexWriter in the way 
it selects the best term to start a block. It is based on the minimal 
distinguishing prefix. The idea is to make the terms index FST more compact. 
That way, given a target max heap memory, we can have potentially more blocks, 
so smaller ones that are scanned faster. This requirement to consume less heap 
was strong with lucene 7.1, now maybe less with the recent off-heap FST.

 
{quote}Are you also doing something different to encode/decode postings?
{quote}
No, the postings are written with the regular PostingsWriterBase.

 
{quote}Can you post results on the full wikimediumall?
{quote}
 Good point. Will do tomorrow.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808852#comment-16808852
 ] 

Bruno Roustant commented on LUCENE-8753:


{quote}Is it due to the fact that it doesn't have the ability to fail lookups 
early like BlockTree?
{quote}
This is one cause. While BlockTree builds a kind of prefix-trie and may stop if 
the prefix is not matched, UniformSplit doesn't, so it loads a block.

That said I remarked that PKLookup performance varies a lot. It is sometimes in 
favor of UniformSplit. Actually I don't know how the benchmark generates the 
test set. It clearly has an influence on the metric.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8753:
---
Attachment: luceneutil.benchmark.txt

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8753:
---
Description: 
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
 - Clear design and simple code.
 - Easily extensible, for both the logic and the index format.
 - Light memory usage with a very compact FST.
 - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
 There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment and also 
attached for better formatting.

Although the precise percentages vary between runs, three main points:
 - TermQuery and PhraseQuery are improved.
 - PrefixQuery and WildcardQuery are ok.
 - Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.

This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.

Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley

  was:
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
- Clear design and simple code.
- Easily extensible, for both the logic and the index format.
- Light memory usage with a very compact FST.
- Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment.
 
 Although the precise percentages vary between runs, three main points:
- TermQuery and PhraseQuery are improved.
- PrefixQuery and WildcardQuery are ok.
- Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
 
 This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.
 
 Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley


> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>    Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average blo

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808799#comment-16808799
 ] 

Bruno Roustant commented on LUCENE-8753:


Here's the Luceneutil benchmark with the wikimedium500k data set using Java 8. 
This is a bit dated using Lucene 7.1; it'd be nice to update to master.

 

Report after iter 19:
 TaskQPS blocktree StdDevQPS uniformsplit StdDev Pct diff
 Fuzzy1 508.47 (3.8%) 221.37 (0.9%) {color:#59afe1}-56.5%{color} ( -58% - -53%)
 Fuzzy2 171.73 (6.4%) 80.62 (1.4%) {color:#59afe1}-53.1%{color} ( -57% - -48%)
 PKLookup 182.47 (2.4%) 149.62 (2.5%) {color:#59afe1}-18.0%{color} ( -22% - 
-13%)
 Wildcard 1788.74 (5.9%) 1729.37 (4.5%) {color:#59afe1}-3.3%{color} ( -12% - 7%)
 IntNRQ 1561.48 (2.1%) 1564.33 (1.9%) {color:#59afe1}0.2%{color} ( -3% - 4%)
 Prefix3 1759.69 (5.0%) 1829.74 (4.8%) {color:#59afe1}4.0%{color} ( -5% - 14%)
 HighTermDayOfYearSort 586.06 (5.4%) 622.34 (8.2%) {color:#59afe1}6.2%{color} ( 
-6% - 20%)
 MedPhrase 1204.85 (5.5%) 1282.89 (7.7%) {color:#59afe1}6.5%{color} ( -6% - 20%)
 HighSpanNear 590.88 (4.1%) 629.64 (6.1%) {color:#59afe1}6.6%{color} ( -3% - 
17%)
 OrHighMed 1101.48 (4.5%) 1220.75 (6.2%) {color:#59afe1}10.8%{color} ( 0% - 22%)
 HighTermMonthSort 2617.10 (2.6%) 2916.34 (4.6%) {color:#59afe1}11.4%{color} ( 
4% - 19%)
 HighPhrase 961.04 (5.5%) 1073.62 (6.0%) {color:#59afe1}11.7%{color} ( 0% - 24%)
 MedSloppyPhrase 604.56 (13.3%) 680.31 (13.7%) {color:#59afe1}12.5%{color} ( 
-12% - 45%)
 LowSloppyPhrase 954.87 (8.1%) 1075.67 (5.4%) {color:#59afe1}12.7%{color} ( 0% 
- 28%)
 MedSpanNear 737.14 (5.8%) 830.68 (8.3%) {color:#59afe1}12.7%{color} ( -1% - 
28%)
 OrHighHigh 811.57 (5.7%) 915.01 (6.2%) {color:#59afe1}12.7%{color} ( 0% - 26%)
 AndHighMed 1157.45 (5.3%) 1317.78 (5.1%) {color:#59afe1}13.9%{color} ( 3% - 
25%)
 AndHighHigh 1095.29 (5.7%) 1254.16 (4.9%) {color:#59afe1}14.5%{color} ( 3% - 
26%)
 HighSloppyPhrase 880.42 (8.2%) 1009.72 (7.0%) {color:#59afe1}14.7%{color} ( 0% 
- 32%)
 LowPhrase 1245.33 (6.0%) 1473.57 (4.4%) {color:#59afe1}18.3%{color} ( 7% - 30%)
 Respell 81.10 (12.7%) 99.43 (10.3%) {color:#59afe1}22.6%{color} ( 0% - 52%)
 HighTerm 3733.81 (6.1%) 4599.96 (6.8%) {color:#59afe1}23.2%{color} ( 9% - 38%)
 OrHighLow 1960.13 (6.2%) 2415.81 (6.0%) {color:#59afe1}23.2%{color} ( 10% - 
37%)
 MedTerm 4411.60 (4.9%) 5450.56 (5.8%) {color:#59afe1}23.6%{color} ( 12% - 35%)
 LowSpanNear 1944.27 (5.3%) 2416.29 (4.5%) {color:#59afe1}24.3%{color} ( 13% - 
36%)
 AndHighLow 1978.10 (7.6%) 2500.74 (5.8%) {color:#59afe1}26.4%{color} ( 12% - 
43%)
 LowTerm 4949.24 (4.8%) 6589.86 (5.3%) {color:#59afe1}33.1%{color} ( 22% - 45%)

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>    Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
> - Clear design and simple code.
> - Easily extensible, for both the logic and the index format.
> - Light memory usage with a very compact FST.
> - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
> There are also several optimizations inside the block to make it more compact 
> and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment.
>  
>  Although the precise percentages vary between runs, three main points:
> - TermQuery and PhraseQuery are improved.
> - PrefixQuery and WildcardQuery are ok.
> - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
>  
>  This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsForm

[jira] [Created] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)

Bruno Roustant created LUCENE-8753:
--

 Summary: New PostingFormat - UniformSplit
 Key: LUCENE-8753
 URL: https://issues.apache.org/jira/browse/LUCENE-8753
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 8.0
Reporter: Bruno Roustant
 Attachments: Uniform Split Technique.pdf

This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
- Clear design and simple code.
- Easily extensible, for both the logic and the index format.
- Light memory usage with a very compact FST.
- Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment.
 
 Although the precise percentages vary between runs, three main points:
- TermQuery and PhraseQuery are improved.
- PrefixQuery and WildcardQuery are ok.
- Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
 
 This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.
 
 Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-06-19 Thread Bruno Roustant (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant closed SOLR-11865.
-

Work done

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Minor
>  Labels: QueryComponent
> Fix For: 7.5
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-06-19 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517247#comment-16517247
 ] 

Bruno Roustant commented on SOLR-11865:
---

Thanks for your incredible help [~dsmiley]!

Closing this PR.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>    Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Minor
>  Labels: QueryComponent
> Fix For: 7.5
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-05-31 Thread Bruno Roustant (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496788#comment-16496788
 ] 

Bruno Roustant commented on SOLR-11865:
---

You're right MapElevationProvider.buildElevationMap should merge in this case 
(which indeed should not happen since they have been merged earlier).

I have created the GitHub PR 
([https://github.com/apache/lucene-solr/pull/390),] to be enhanced with all 
your improvements.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>    Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] lucene-solr issue #390: Refactor QueryElevationComponent to prepare query su...

2018-05-31 Thread bruno-roustant

Github user bruno-roustant commented on the issue:

https://github.com/apache/lucene-solr/pull/390
  
@dsmiley here is the PR for QueryElevationComponent.


---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] lucene-solr pull request #390: Refactor QueryElevationComponent to prepare q...

2018-05-31 Thread bruno-roustant

GitHub user bruno-roustant opened a pull request:

https://github.com/apache/lucene-solr/pull/390

Refactor QueryElevationComponent to prepare query subset matching 
[SOLR-11865]

See comments in https://issues.apache.org/jira/browse/SOLR-11865

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bruno-roustant/lucene-solr 
QueryElevationComponent

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/390.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #390


commit 5a6290359c0e12f2372c8470de636796928921cc
Author: broustant 
Date:   2018-01-12T17:03:26Z

Refactor QueryElevationComponent to introduce ElevationProvider

- Refactor to introduce ElevationProvider. The current full-query match 
policy becomes a default simple MapElevationProvider. It can be replaced by a 
more efficient provider in the future, or replaced by an extending class.
- Add overridable methods to handle exceptions during the component 
initialization.
- Add overridable methods to provide the default values for config 
properties.
- No functional change beyond refactoring.
- Adapt unit test.

commit e9f53315ef0dc230280e93f868055183aa09abb6
Author: broustant 
Date:   2018-03-30T12:04:43Z

Refactor QueryElevationComponent after review

commit 0bad4c66cf4ce89bc6cca3a7e631c20b23c500c4
Author: broustant 
Date:   2018-04-04T15:51:03Z

Remove exception handlers and refactor getBoostDocs




---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-05-15 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476032#comment-16476032
 ] 

Bruno Roustant commented on SOLR-11865:
---

Great! I agree with all your points [~dsmiley].

Indeed the String IDs in Elevation would be clearer as BytesRefs. And I vote to 
apply the key String => indexed form as early as possible, if the code remains 
small.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-15 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579
 ] 

Bruno Roustant edited comment on LUCENE-8292 at 5/15/18 9:57 AM:
-

Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally an 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().


was (Author: bruno.roustant):
Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally a 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-15 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579
 ] 

Bruno Roustant commented on LUCENE-8292:


Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally a 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-07 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465887#comment-16465887
 ] 

Bruno Roustant commented on LUCENE-8292:


[~dsmiley], if I create a subclass of FilterTermsEnum to override seekExact, 
how can I make other classes in Lucene create this subclass instead of 
FilterTermsEnum? Would I have to also override other classes or other factories?

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-07 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465767#comment-16465767
 ] 

Bruno Roustant commented on LUCENE-8292:


I just realized that the current no-default-override behavior is actually 
enforced by a test TestFilterLeafReader.testOverrideMethods.

I still think all methods should be overridden, but I understand that this may 
not be the expected behavior currently.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>    Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 >

1 - 100 of 150 matches

Mail list logo