Bugs with Re-ranking/LtR and ExplainAugmenterFactory

2019-01-11 Thread Sambhav Kothari (BLOOMBERG/ LONDON)
Hello,

Currently, if we use the ExplainAugmenterFactory with LtR, instead of using the 
model/re-rankers explain method, it uses the default query explain (tf-idf 
explanation). This happens because the BasicResultContext doesn't wrap the 
query(https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/response/BasicResultContext.java#L67)
  with the RankQuery when its set to context's query, which is then used by the 
ExplainAugmenterFactory. 
(https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/response/transform/ExplainAugmenterFactory.java#L111).
 

As a result there are discrepancies between queries like - 


http://localhost:8983/solr/collection1/select?q=*:*&collection=collectionName&wt=json&fl=[explain
 style=nl],score&rq={!ltr model=linear-model}


http://localhost:8983/solr/collection1/select?q=*:*&collection=collectionName&wt=json&fl=score&rq={!ltr
 model=linear-model}&debugQuery=true

the former outputs the explain from the SimilarityScorer's explain while the 
latter uses the correct LtR ModelScorer's explain.

There are a few other problems with the explain augmenter - for eg. it doesn't 
work with grouping (although the other doc transformers like LtR's 
LTRFeatureLoggerTransformerFactory work with grouping).

Just wanted to discuss these issues before creating tickets on Jira.

Thanks,
Sam

Help with multi-lang searches

2018-10-22 Thread Sambhav Kothari (BLOOMBERG/ LONDON)
Hi,

We have a problem with searches with multiple languages.
Our schema looks something like this:


field_en = English content for field

field_es = Spanish

field_it = Italian

etc.


When a user searches for a keyword, e.g.: 

"brexit" it can also specify several languages s/he wants to see in the 
response, and the query will be performed on all the fields requested. 

The issue is that for 'brexit' Italian results are boosted more because 
something like "Brexit" is unlikely to occur in the Italian language and the 
idf shoots up causing less relevant but Italian docs to rank higher than the 
English ones.

Is there some way to deal with this problem ?

The current solutions we can think of:

1. Create a catchall copyfield and use that to score the docs. (But this 
creates problems when a word is present in another language (for eg English) 
and not in the resulting document language (Italian) (we will have to pay also 
extra disk space of the copyfield and also problems with analysis for multiple 
languages) 
2. Create a new scorer called "IDFGroupScorer" wrapping multiple fields and 
computing a aggregate idf (by averaging or computing the min/max) across the 
fields in the group. 

Any thoughts on any other solutions or any suggestions on how we could possibly 
implement the IDFGroupScorer? 

Thanks,

Sambhav