[ https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750298#comment-17750298 ]
Doug Turnbull edited comment on SOLR-16894 at 8/2/23 1:02 PM: -------------------------------------------------------------- Hey thanks all. Sorry its taken a while to get back. I fixed the links in the description. After discovering SOLR-8311 and learning about lifecycles of different components, and why these safeguards exist, I went with the probably correct solution and built a specific field type for this: Here is the code (WIP) [https://github.com/softwaredoug/managed-stats] was (Author: softwaredoug): Hey thanks all. Sorry its taken a while to get back. I fixed the links in the description. After discovering SOLR-8311 and learning about lifecycles, and why these safeguards exist, I went with the probably correct solution and built a specific field type for this: Here is the code (WIP) [https://github.com/softwaredoug/managed-stats] > Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware > --------------------------------------------------------------------------- > > Key: SOLR-16894 > URL: https://issues.apache.org/jira/browse/SOLR-16894 > Project: Solr > Issue Type: Bug > Reporter: Doug Turnbull > Priority: Major > > I had been working on a plugin to allow the document frequency stats to be > controlled by the user. This has precedent in other search engines where > another corpus is more representative of a terms true document frequency / > significance. Specifically, [Vespa lets you pass significance at query > time|https://docs.vespa.ai/en/reference/query-language-reference.html#annotations]]. > From the Vespa query docs linked above > > Significance - Significance value for text ranking features - see [text > > matching and > > ranking.|https://docs.vespa.ai/en/text-matching-ranking.html#weight-significance-and-connectedness] > From the linked page > > Significance - An indication of how rare a term is in the corpus of the > > language, used by a number of text matching [rank > > features|https://docs.vespa.ai/en/reference/rank-features.html]. This can > > be set explicitly for each term in [the > > query|https://docs.vespa.ai/en/reference/query-language-reference.html#significance] > > This doesn't just apply to how doc freq is represented, but the entire set of > stats from total term freq, etc. > This is a common painpoint in test corpuses where you have a smaller sample > of the documents than the global corpus. It was a frequency bugabear at > Shopify, and now at my current employer, for doing relevance testing. It's > also a problem whenever you have a corpus that may include some "outliers" > that actually aren't outliers in the sense of how your users perceive your > corpus. An example is "headache" may not be the jargon to use in a medical > textbook, it is just rare by happenstance. Yet searchers still perceive it as > a not very significant term. > I had made some progress > ([here|https://github.com/softwaredoug/managed-stats]), however I noticed > only certain types of classes can be ResourceLoaderAware in order to read > configuration. Specifically I see this error running my tests: > > {code:java} > ./gradlew --stacktrace --info test > {code} > {code:java} > org.apache.solr.common.SolrException: Invalid 'Aware' object: > manual.idf.stats.ManagedStatsCache@5c19c030 -- > org.apache.lucene.util.ResourceLoaderAware must be an instance of: > [org.apache.lucene.analysis.CharFilterFactory] > [org.apache.lucene.analysis.TokenFilterFactory] > [org.apache.lucene.analysis.TokenizerFactory] > [org.apache.solr.search.QParserPlugin] > [org.apache.solr.schema.FieldType]{code} > > > Can I propose we add the StatsCache to the list of allowed > ResourceLoaderAware objects? > Some alternatives I've thought about: > * I probably can do some ugly hacks to work around this, but I'd rather do > the "right thing" > * I'd prefer not to create a separate fieldtype that changes how the stats > are managed. For one, in my specific case, I don't want to have to have a > radically different test config compared to my setup. This is still "text" > with texty like configurability > ** Second I like the ability with the stats cache to "fall back" to an > internal stat if one is missing. > * Pass at query time - this is a more radical change similar to what it > would take to make BM25 params configurable at query time > * It's possible I could create a Similarity to change doc freq, however it > too, would not be ResourceLoaderAware apparently. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org