Doug Turnbull created SOLR-16894:
------------------------------------

             Summary: Configurable doc freq: Allow StatsCache instances to be 
ResourceLoaderAware
                 Key: SOLR-16894
                 URL: https://issues.apache.org/jira/browse/SOLR-16894
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Doug Turnbull


I had been working on a plugin to allow the document frequency stats to be 
controlled by the user. This has precedent in other search engines where 
another corpus is more representative of a terms true document frequency / 
significance. Specifically, [Vespa lets you pass significance at query 
time.|[https://docs.vespa.ai/en/reference/query-language-reference.html#significance]].
 This doesn't just apply to how doc freq is represented, but the entire set of 
stats from total term freq, etc.

This is a common painpoint in test corpuses where you have a smaller sample of 
the documents than the global corpus. It was a frequency bugabear at Shopify, 
and now at my current employer, for doing relevance testing. It's also a 
problem whenever you have a corpus that may include some "outliers" that 
actually aren't outliers in the sense of natural language.

I had made some progress ([here|http://example.com]), however I noticed only 
certain types of classes can be ResourceLoaderAware in order to read 
configuration. Specifically I see this error running my tests:

 
{code:java}
./gradlew --stacktrace --info test 
{code}
{code:java}
                org.apache.solr.common.SolrException: Invalid 'Aware' object: 
manual.idf.stats.ManagedStatsCache@5c19c030 -- 
org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
[org.apache.lucene.analysis.CharFilterFactory] 
[org.apache.lucene.analysis.TokenFilterFactory] 
[org.apache.lucene.analysis.TokenizerFactory] 
[org.apache.solr.search.QParserPlugin] [org.apache.solr.schema.FieldType]{code}
 

 

Can I propose we add the StatsCache to the list of allowed ResourceLoaderAware 
objects?

Some alternatives I've thought about:
 * I probably can do some ugly hacks to work around this, but I'd rather do the 
"right thing"
 * I'd prefer not to create a separate fieldtype that changes how the stats are 
managed. For one, in my specific case, I don't want to have to have a radically 
different test config compared to my setup. This is still "text" with texty 
like configurability
 ** Second I like the ability with the stats cache to "fall back" to an 
internal stat if one is missing.
 * Pass at query time - this is a more radical change similar to what it would 
take to make BM25 params configurable at query time
 * It's possible I could create a Similarity to change doc freq, however it 
too, would not be ResourceLoaderAware apparently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to