[ 
https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750298#comment-17750298
 ] 

Doug Turnbull commented on SOLR-16894:
--------------------------------------

Hey thanks all. Sorry its taken a while to get back. I fixed the links in the 
description.

After discovering SOLR-8311 and learning about lifecycles, and why these 
safeguards exist, I went with the probably correct solution and built a 
specific field type for this:

Here is the code (WIP) [https://github.com/softwaredoug/managed-stats]

 

 

> Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-16894
>                 URL: https://issues.apache.org/jira/browse/SOLR-16894
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Doug Turnbull
>            Priority: Major
>
> I had been working on a plugin to allow the document frequency stats to be 
> controlled by the user. This has precedent in other search engines where 
> another corpus is more representative of a terms true document frequency / 
> significance. Specifically, [Vespa lets you pass significance at query 
> time|https://docs.vespa.ai/en/reference/query-language-reference.html#annotations]].
> From the Vespa query docs linked above
> > Significance - Significance value for text ranking features - see [text 
> > matching and 
> > ranking.|https://docs.vespa.ai/en/text-matching-ranking.html#weight-significance-and-connectedness]
> From the linked page
> > Significance - An indication of how rare a term is in the corpus of the 
> > language, used by a number of text matching [rank 
> > features|https://docs.vespa.ai/en/reference/rank-features.html]. This can 
> > be set explicitly for each term in [the 
> > query|https://docs.vespa.ai/en/reference/query-language-reference.html#significance]
>  
> This doesn't just apply to how doc freq is represented, but the entire set of 
> stats from total term freq, etc.
> This is a common painpoint in test corpuses where you have a smaller sample 
> of the documents than the global corpus. It was a frequency bugabear at 
> Shopify, and now at my current employer, for doing relevance testing. It's 
> also a problem whenever you have a corpus that may include some "outliers" 
> that actually aren't outliers in the sense of how your users perceive your 
> corpus. An example is "headache" may not be the jargon to use in a medical 
> textbook, it is just rare by happenstance. Yet searchers still perceive it as 
> a not very significant term.
> I had made some progress 
> ([here|https://github.com/softwaredoug/managed-stats]), however I noticed 
> only certain types of classes can be ResourceLoaderAware in order to read 
> configuration. Specifically I see this error running my tests:
>  
> {code:java}
> ./gradlew --stacktrace --info test 
> {code}
> {code:java}
>                 org.apache.solr.common.SolrException: Invalid 'Aware' object: 
> manual.idf.stats.ManagedStatsCache@5c19c030 -- 
> org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
> [org.apache.lucene.analysis.CharFilterFactory] 
> [org.apache.lucene.analysis.TokenFilterFactory] 
> [org.apache.lucene.analysis.TokenizerFactory] 
> [org.apache.solr.search.QParserPlugin] 
> [org.apache.solr.schema.FieldType]{code}
>  
>  
> Can I propose we add the StatsCache to the list of allowed 
> ResourceLoaderAware objects?
> Some alternatives I've thought about:
>  * I probably can do some ugly hacks to work around this, but I'd rather do 
> the "right thing"
>  * I'd prefer not to create a separate fieldtype that changes how the stats 
> are managed. For one, in my specific case, I don't want to have to have a 
> radically different test config compared to my setup. This is still "text" 
> with texty like configurability
>  ** Second I like the ability with the stats cache to "fall back" to an 
> internal stat if one is missing.
>  * Pass at query time - this is a more radical change similar to what it 
> would take to make BM25 params configurable at query time
>  * It's possible I could create a Similarity to change doc freq, however it 
> too, would not be ResourceLoaderAware apparently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to