Re: Any recommended issues to work on for a newcomer?

2024-05-18 Thread Chang Hank
Hey Michael,

I wrote the first version of my idea about implementing RRF in Lucene, here the 
link of the code 
https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.
Right now I have some questions, one is about the shardIndex to be returned, 
another one is the TotalHits value, please take a look at the code and kindly 
leave some comments below.

Thanks,
Hank

> On May 18, 2024, at 2:01 PM, Chang Hank  wrote:
> 
> Or maybe we can first create an issue and PR based on the issue number?
> WDYT?
> 
> Best,
> 
> Hank
> 
>> On May 18, 2024, at 11:29 AM, Chang Hank  wrote:
>> 
>> Hey Michael, 
>> 
>> Sorry I was a bit busy this week, but I’ve looked into the resources you 
>> provided and also some useful advice from Alessandro and Adrien.
>> 
>> I have a briefly understanding of how RRF works, but I’m not quite sure how 
>> we should implement it. Based on the advice from Alessandro and Adrien, it 
>> seems we need to consider that the search results are located at different 
>> shards. According to Alessandro, we should aggregate the ranked lists from 
>> all distributed nodes and then apply RRF.
>> Are we going to implement this aggregation logic inside our RRF method? 
>> 
>> Also could you please create a PR so we can discuss more details further?
>> 
>> All the best,
>> 
>> Hank
>> 
>>> On May 13, 2024, at 10:09 AM, Michael Wechner  
>>> wrote:
>>> 
>>> Great, sounds like we have plan :-)
>>> 
>>> Hank and I can get started trying to understand the internals better ...
>>> 
>>> Thanks
>>> 
>>> Michael
>>> 
>>> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
 Sure, we can make it work but in a distributed environment you have to run 
 first each query distributed (aggregating all nodes) and then RRF on top 
 of the aggregated ranked lists.
 Doing RRF per node first and then aggregate per shard won't return the 
 same results I suspect.
 When I go back to working on the task I'll be able to elaborate more!
 
 Cheers
 --
 Alessandro Benedetti
 Director @ Sease Ltd.
 Apache Lucene/Solr Committer
 Apache Solr PMC Member
 
 e-mail: a.benede...@sease.io 
 
 
 Sease - Information Retrieval Applied
 Consulting | Training | Open Source
 
 Website: Sease.io 
 LinkedIn  | Twitter 
  | Youtube 
  | Github 
 
 
 On Mon, 13 May 2024 at 14:12, Adrien Grand >>> > wrote:
> > Maybe Adrien Grand and others might also have some feedback :-)
> 
> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int 
> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. 
> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to 
> figure out which hits map to the same document.
> 
> > Back in the day, I was reasoning on this and I didn't think Lucene was 
> > the right place for an interleaving algorithm, given that Reciprocal 
> > Rank Fusion is affected by distribution and it's not supposed to work 
> > per node.
> 
> To me this is like `TopDocs#merge`. There are changes needed on the 
> application side to hook this call into the logic that combines hits that 
> come from multiple shards (multiple queries in the case of RRF), but 
> Lucene can still provide the merging logic.
> 
> On Mon, May 13, 2024 at 1:41 PM Michael Wechner 
> mailto:michael.wech...@wyona.com>> wrote:
>> Thanks for your feedback Alessandro!
>> 
>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but 
>> would like to combine different result sets using RRF, therefore think 
>> that Lucene itself could be a good place actually.
>> 
>> Looking forward to your additional elaboration!
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> 
>> 
>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti 
>>> mailto:a.benede...@sease.io>>:
>>> 
>>> This is not strictly related to Lucene, but I'll give a talk at Berlin 
>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache 
>>> Solr.
>>> I'll resume my work on the contribution next week and have more to 
>>> share later.
>>> 
>>> Back in the day, I was reasoning on this and I didn't think Lucene was 
>>> the right place for an interleaving algorithm, given that Reciprocal 
>>> Rank Fusion is affected by distribution and it's not supposed to work 
>>> per node.
>>> I think I evaluated the possibility of doing it as a Lucene query or a 
>>> Lucene component but then ended up with a different approach.
>>> I'll elaborate more when I go back to the task!
>>> 
>>> Cheers
>>> 

waiting for a PR review regarding the FieldHighlighter.

2024-05-18 Thread 쿨해머
Hello. I have submitted a PR that allows users to decide the final sorting
criteria for passages in the FieldHighlighter. If anyone is interested,
please take a look. I will leave the PR link below.

https://github.com/apache/lucene/pull/13276


Re: How much is ja.dict.UserDictionary used?

2024-05-18 Thread Michael Sokolov
We use it Amazon. I can't really read it so I'm not sure, but I think
it's used to encode terms that come up that aren't handled well by the
standard dictionary.

On Sat, May 18, 2024 at 8:39 AM Bruno Roustant  wrote:
>
> Hi,
>
> While looking at the various usages of Map with Integer keys, I found 
> ja.dict.UserDictionary with its lookup() method where there is a TODO: can we 
> avoid this treemap/toIndexArray?
>
> I could propose something, but I would like to know how much it is used, and 
> if it is worth improving it.
>
> Thanks
>
> Bruno

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Any recommended issues to work on for a newcomer?

2024-05-18 Thread Chang Hank
Or maybe we can first create an issue and PR based on the issue number?
WDYT?

Best,

Hank

> On May 18, 2024, at 11:29 AM, Chang Hank  wrote:
> 
> Hey Michael, 
> 
> Sorry I was a bit busy this week, but I’ve looked into the resources you 
> provided and also some useful advice from Alessandro and Adrien.
> 
> I have a briefly understanding of how RRF works, but I’m not quite sure how 
> we should implement it. Based on the advice from Alessandro and Adrien, it 
> seems we need to consider that the search results are located at different 
> shards. According to Alessandro, we should aggregate the ranked lists from 
> all distributed nodes and then apply RRF.
> Are we going to implement this aggregation logic inside our RRF method? 
> 
> Also could you please create a PR so we can discuss more details further?
> 
> All the best,
> 
> Hank
> 
>> On May 13, 2024, at 10:09 AM, Michael Wechner  
>> wrote:
>> 
>> Great, sounds like we have plan :-)
>> 
>> Hank and I can get started trying to understand the internals better ...
>> 
>> Thanks
>> 
>> Michael
>> 
>> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
>>> Sure, we can make it work but in a distributed environment you have to run 
>>> first each query distributed (aggregating all nodes) and then RRF on top of 
>>> the aggregated ranked lists.
>>> Doing RRF per node first and then aggregate per shard won't return the same 
>>> results I suspect.
>>> When I go back to working on the task I'll be able to elaborate more!
>>> 
>>> Cheers
>>> --
>>> Alessandro Benedetti
>>> Director @ Sease Ltd.
>>> Apache Lucene/Solr Committer
>>> Apache Solr PMC Member
>>> 
>>> e-mail: a.benede...@sease.io 
>>> 
>>> 
>>> Sease - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>> 
>>> Website: Sease.io 
>>> LinkedIn  | Twitter 
>>>  | Youtube 
>>>  | Github 
>>> 
>>> 
>>> On Mon, 13 May 2024 at 14:12, Adrien Grand >> > wrote:
 > Maybe Adrien Grand and others might also have some feedback :-)
 
 I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int 
 topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. 
 Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to 
 figure out which hits map to the same document.
 
 > Back in the day, I was reasoning on this and I didn't think Lucene was 
 > the right place for an interleaving algorithm, given that Reciprocal 
 > Rank Fusion is affected by distribution and it's not supposed to work 
 > per node.
 
 To me this is like `TopDocs#merge`. There are changes needed on the 
 application side to hook this call into the logic that combines hits that 
 come from multiple shards (multiple queries in the case of RRF), but 
 Lucene can still provide the merging logic.
 
 On Mon, May 13, 2024 at 1:41 PM Michael Wechner >>> > wrote:
> Thanks for your feedback Alessandro!
> 
> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but 
> would like to combine different result sets using RRF, therefore think 
> that Lucene itself could be a good place actually.
> 
> Looking forward to your additional elaboration!
> 
> Thanks
> 
> Michael
> 
> 
> 
> 
>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti 
>> mailto:a.benede...@sease.io>>:
>> 
>> This is not strictly related to Lucene, but I'll give a talk at Berlin 
>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
>> I'll resume my work on the contribution next week and have more to share 
>> later.
>> 
>> Back in the day, I was reasoning on this and I didn't think Lucene was 
>> the right place for an interleaving algorithm, given that Reciprocal 
>> Rank Fusion is affected by distribution and it's not supposed to work 
>> per node.
>> I think I evaluated the possibility of doing it as a Lucene query or a 
>> Lucene component but then ended up with a different approach.
>> I'll elaborate more when I go back to the task!
>> 
>> Cheers
>> --
>> Alessandro Benedetti
>> Director @ Sease Ltd.
>> Apache Lucene/Solr Committer
>> Apache Solr PMC Member
>> 
>> e-mail: a.benede...@sease.io 
>> 
>> 
>> Sease - Information Retrieval Applied
>> Consulting | Training | Open Source
>> 
>> Website: Sease.io 
>> LinkedIn  | Twitter 
>>  | Youtube 
>>  | Github 

Re: Any recommended issues to work on for a newcomer?

2024-05-18 Thread Chang Hank
Hey Michael, 

Sorry I was a bit busy this week, but I’ve looked into the resources you 
provided and also some useful advice from Alessandro and Adrien.

I have a briefly understanding of how RRF works, but I’m not quite sure how we 
should implement it. Based on the advice from Alessandro and Adrien, it seems 
we need to consider that the search results are located at different shards. 
According to Alessandro, we should aggregate the ranked lists from all 
distributed nodes and then apply RRF.
Are we going to implement this aggregation logic inside our RRF method? 

Also could you please create a PR so we can discuss more details further?

All the best,

Hank

> On May 13, 2024, at 10:09 AM, Michael Wechner  
> wrote:
> 
> Great, sounds like we have plan :-)
> 
> Hank and I can get started trying to understand the internals better ...
> 
> Thanks
> 
> Michael
> 
> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
>> Sure, we can make it work but in a distributed environment you have to run 
>> first each query distributed (aggregating all nodes) and then RRF on top of 
>> the aggregated ranked lists.
>> Doing RRF per node first and then aggregate per shard won't return the same 
>> results I suspect.
>> When I go back to working on the task I'll be able to elaborate more!
>> 
>> Cheers
>> --
>> Alessandro Benedetti
>> Director @ Sease Ltd.
>> Apache Lucene/Solr Committer
>> Apache Solr PMC Member
>> 
>> e-mail: a.benede...@sease.io 
>> 
>> 
>> Sease - Information Retrieval Applied
>> Consulting | Training | Open Source
>> 
>> Website: Sease.io 
>> LinkedIn  | Twitter 
>>  | Youtube 
>>  | Github 
>> 
>> 
>> On Mon, 13 May 2024 at 14:12, Adrien Grand > > wrote:
>>> > Maybe Adrien Grand and others might also have some feedback :-)
>>> 
>>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int 
>>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. 
>>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to 
>>> figure out which hits map to the same document.
>>> 
>>> > Back in the day, I was reasoning on this and I didn't think Lucene was 
>>> > the right place for an interleaving algorithm, given that Reciprocal Rank 
>>> > Fusion is affected by distribution and it's not supposed to work per node.
>>> 
>>> To me this is like `TopDocs#merge`. There are changes needed on the 
>>> application side to hook this call into the logic that combines hits that 
>>> come from multiple shards (multiple queries in the case of RRF), but Lucene 
>>> can still provide the merging logic.
>>> 
>>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner >> > wrote:
 Thanks for your feedback Alessandro!
 
 I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but 
 would like to combine different result sets using RRF, therefore think 
 that Lucene itself could be a good place actually.
 
 Looking forward to your additional elaboration!
 
 Thanks
 
 Michael
 
 
 
 
> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti  >:
> 
> This is not strictly related to Lucene, but I'll give a talk at Berlin 
> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
> I'll resume my work on the contribution next week and have more to share 
> later.
> 
> Back in the day, I was reasoning on this and I didn't think Lucene was 
> the right place for an interleaving algorithm, given that Reciprocal Rank 
> Fusion is affected by distribution and it's not supposed to work per node.
> I think I evaluated the possibility of doing it as a Lucene query or a 
> Lucene component but then ended up with a different approach.
> I'll elaborate more when I go back to the task!
> 
> Cheers
> --
> Alessandro Benedetti
> Director @ Sease Ltd.
> Apache Lucene/Solr Committer
> Apache Solr PMC Member
> 
> e-mail: a.benede...@sease.io 
> 
> 
> Sease - Information Retrieval Applied
> Consulting | Training | Open Source
> 
> Website: Sease.io 
> LinkedIn  | Twitter 
>  | Youtube 
>  | Github 
> 
> 
> On Sat, 11 May 2024 at 09:10, Michael Wechner  > wrote:
>> sure, no problem!
>> 
>> Maybe Adrien Grand and others might also have some feedback :-)
>> 
>> Thanks
>> 
>> Michael
>> 

Join module dependency

2024-05-18 Thread Bruno Roustant
The facet module has a dependency on com.carrotsearch:hppc.

Is it possible to add the same dependency to the join module ? What is the
rule ?

Thanks

Bruno


How much is ja.dict.UserDictionary used?

2024-05-18 Thread Bruno Roustant
Hi,

While looking at the various usages of Map with Integer keys, I found
ja.dict.UserDictionary with its lookup() method where there is a *TODO: can
we avoid this treemap/toIndexArray?*

I could propose something, but I would like to know how much it is used,
and if it is worth improving it.

Thanks

Bruno