[ 
https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743476#comment-17743476
 ] 

Doug Turnbull commented on SOLR-16894:
--------------------------------------

Actually now that I see SOLR-8311, I can see the problem here which makes 
adding this to resource loader aware not feasible.

The StatsCache's lifecycle appears to be tied to searhcers, whereas other 
resource loader plugins seem to be instatiated with the core itself. Still 
investigating.

> Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-16894
>                 URL: https://issues.apache.org/jira/browse/SOLR-16894
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Doug Turnbull
>            Priority: Major
>
> I had been working on a plugin to allow the document frequency stats to be 
> controlled by the user. This has precedent in other search engines where 
> another corpus is more representative of a terms true document frequency / 
> significance. Specifically, [Vespa lets you pass significance at query 
> time.|#significance]]. This doesn't just apply to how doc freq is 
> represented, but the entire set of stats from total term freq, etc.
> This is a common painpoint in test corpuses where you have a smaller sample 
> of the documents than the global corpus. It was a frequency bugabear at 
> Shopify, and now at my current employer, for doing relevance testing. It's 
> also a problem whenever you have a corpus that may include some "outliers" 
> that actually aren't outliers in the sense of how your users perceive your 
> corpus. An example is "headache" may not be the jargon to use in a medical 
> textbook, it is just rare by happenstance. Yet searchers still perceive it as 
> a not very significant term.
> I had made some progress ([here|http://example.com/]), however I noticed only 
> certain types of classes can be ResourceLoaderAware in order to read 
> configuration. Specifically I see this error running my tests:
>  
> {code:java}
> ./gradlew --stacktrace --info test 
> {code}
> {code:java}
>                 org.apache.solr.common.SolrException: Invalid 'Aware' object: 
> manual.idf.stats.ManagedStatsCache@5c19c030 -- 
> org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
> [org.apache.lucene.analysis.CharFilterFactory] 
> [org.apache.lucene.analysis.TokenFilterFactory] 
> [org.apache.lucene.analysis.TokenizerFactory] 
> [org.apache.solr.search.QParserPlugin] 
> [org.apache.solr.schema.FieldType]{code}
>  
>  
> Can I propose we add the StatsCache to the list of allowed 
> ResourceLoaderAware objects?
> Some alternatives I've thought about:
>  * I probably can do some ugly hacks to work around this, but I'd rather do 
> the "right thing"
>  * I'd prefer not to create a separate fieldtype that changes how the stats 
> are managed. For one, in my specific case, I don't want to have to have a 
> radically different test config compared to my setup. This is still "text" 
> with texty like configurability
>  ** Second I like the ability with the stats cache to "fall back" to an 
> internal stat if one is missing.
>  * Pass at query time - this is a more radical change similar to what it 
> would take to make BM25 params configurable at query time
>  * It's possible I could create a Similarity to change doc freq, however it 
> too, would not be ResourceLoaderAware apparently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to