[
https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748317#comment-17748317
]
Houston Putman commented on SOLR-16894:
---------------------------------------
I want to read some more, but your progress and Vespa links aren't working.
> Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware
> ---------------------------------------------------------------------------
>
> Key: SOLR-16894
> URL: https://issues.apache.org/jira/browse/SOLR-16894
> Project: Solr
> Issue Type: Bug
> Reporter: Doug Turnbull
> Priority: Major
>
> I had been working on a plugin to allow the document frequency stats to be
> controlled by the user. This has precedent in other search engines where
> another corpus is more representative of a terms true document frequency /
> significance. Specifically, [Vespa lets you pass significance at query
> time.|#significance]]. This doesn't just apply to how doc freq is
> represented, but the entire set of stats from total term freq, etc.
> This is a common painpoint in test corpuses where you have a smaller sample
> of the documents than the global corpus. It was a frequency bugabear at
> Shopify, and now at my current employer, for doing relevance testing. It's
> also a problem whenever you have a corpus that may include some "outliers"
> that actually aren't outliers in the sense of how your users perceive your
> corpus. An example is "headache" may not be the jargon to use in a medical
> textbook, it is just rare by happenstance. Yet searchers still perceive it as
> a not very significant term.
> I had made some progress ([here|http://example.com/]), however I noticed only
> certain types of classes can be ResourceLoaderAware in order to read
> configuration. Specifically I see this error running my tests:
>
> {code:java}
> ./gradlew --stacktrace --info test
> {code}
> {code:java}
> org.apache.solr.common.SolrException: Invalid 'Aware' object:
> manual.idf.stats.ManagedStatsCache@5c19c030 --
> org.apache.lucene.util.ResourceLoaderAware must be an instance of:
> [org.apache.lucene.analysis.CharFilterFactory]
> [org.apache.lucene.analysis.TokenFilterFactory]
> [org.apache.lucene.analysis.TokenizerFactory]
> [org.apache.solr.search.QParserPlugin]
> [org.apache.solr.schema.FieldType]{code}
>
>
> Can I propose we add the StatsCache to the list of allowed
> ResourceLoaderAware objects?
> Some alternatives I've thought about:
> * I probably can do some ugly hacks to work around this, but I'd rather do
> the "right thing"
> * I'd prefer not to create a separate fieldtype that changes how the stats
> are managed. For one, in my specific case, I don't want to have to have a
> radically different test config compared to my setup. This is still "text"
> with texty like configurability
> ** Second I like the ability with the stats cache to "fall back" to an
> internal stat if one is missing.
> * Pass at query time - this is a more radical change similar to what it
> would take to make BM25 params configurable at query time
> * It's possible I could create a Similarity to change doc freq, however it
> too, would not be ResourceLoaderAware apparently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]