[ 
https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-16894:
---------------------------------
    Description: 
I had been working on a plugin to allow the document frequency stats to be 
controlled by the user. This has precedent in other search engines where 
another corpus is more representative of a terms true document frequency / 
significance. Specifically, [Vespa lets you pass significance at query 
time.|#significance]]. This doesn't just apply to how doc freq is represented, 
but the entire set of stats from total term freq, etc.

This is a common painpoint in test corpuses where you have a smaller sample of 
the documents than the global corpus. It was a frequency bugabear at Shopify, 
and now at my current employer, for doing relevance testing. It's also a 
problem whenever you have a corpus that may include some "outliers" that 
actually aren't outliers in the sense of how your users perceive your corpus. 
An example is "headache" may not be the jargon to use in a medical textbook, it 
is just rare by happenstance. Yet searchers still perceive it as a not very 
significant term.

I had made some progress ([here|http://example.com/]), however I noticed only 
certain types of classes can be ResourceLoaderAware in order to read 
configuration. Specifically I see this error running my tests:

 
{code:java}
./gradlew --stacktrace --info test 
{code}
{code:java}
                org.apache.solr.common.SolrException: Invalid 'Aware' object: 
manual.idf.stats.ManagedStatsCache@5c19c030 -- 
org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
[org.apache.lucene.analysis.CharFilterFactory] 
[org.apache.lucene.analysis.TokenFilterFactory] 
[org.apache.lucene.analysis.TokenizerFactory] 
[org.apache.solr.search.QParserPlugin] [org.apache.solr.schema.FieldType]{code}
 

 

Can I propose we add the StatsCache to the list of allowed ResourceLoaderAware 
objects?

Some alternatives I've thought about:
 * I probably can do some ugly hacks to work around this, but I'd rather do the 
"right thing"
 * I'd prefer not to create a separate fieldtype that changes how the stats are 
managed. For one, in my specific case, I don't want to have to have a radically 
different test config compared to my setup. This is still "text" with texty 
like configurability
 ** Second I like the ability with the stats cache to "fall back" to an 
internal stat if one is missing.
 * Pass at query time - this is a more radical change similar to what it would 
take to make BM25 params configurable at query time
 * It's possible I could create a Similarity to change doc freq, however it 
too, would not be ResourceLoaderAware apparently.

  was:
I had been working on a plugin to allow the document frequency stats to be 
controlled by the user. This has precedent in other search engines where 
another corpus is more representative of a terms true document frequency / 
significance. Specifically, [Vespa lets you pass significance at query 
time.|[https://docs.vespa.ai/en/reference/query-language-reference.html#significance]].
 This doesn't just apply to how doc freq is represented, but the entire set of 
stats from total term freq, etc.

This is a common painpoint in test corpuses where you have a smaller sample of 
the documents than the global corpus. It was a frequency bugabear at Shopify, 
and now at my current employer, for doing relevance testing. It's also a 
problem whenever you have a corpus that may include some "outliers" that 
actually aren't outliers in the sense of natural language.

I had made some progress ([here|http://example.com]), however I noticed only 
certain types of classes can be ResourceLoaderAware in order to read 
configuration. Specifically I see this error running my tests:

 
{code:java}
./gradlew --stacktrace --info test 
{code}
{code:java}
                org.apache.solr.common.SolrException: Invalid 'Aware' object: 
manual.idf.stats.ManagedStatsCache@5c19c030 -- 
org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
[org.apache.lucene.analysis.CharFilterFactory] 
[org.apache.lucene.analysis.TokenFilterFactory] 
[org.apache.lucene.analysis.TokenizerFactory] 
[org.apache.solr.search.QParserPlugin] [org.apache.solr.schema.FieldType]{code}
 

 

Can I propose we add the StatsCache to the list of allowed ResourceLoaderAware 
objects?

Some alternatives I've thought about:
 * I probably can do some ugly hacks to work around this, but I'd rather do the 
"right thing"
 * I'd prefer not to create a separate fieldtype that changes how the stats are 
managed. For one, in my specific case, I don't want to have to have a radically 
different test config compared to my setup. This is still "text" with texty 
like configurability
 ** Second I like the ability with the stats cache to "fall back" to an 
internal stat if one is missing.
 * Pass at query time - this is a more radical change similar to what it would 
take to make BM25 params configurable at query time
 * It's possible I could create a Similarity to change doc freq, however it 
too, would not be ResourceLoaderAware apparently.


> Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-16894
>                 URL: https://issues.apache.org/jira/browse/SOLR-16894
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Doug Turnbull
>            Priority: Major
>
> I had been working on a plugin to allow the document frequency stats to be 
> controlled by the user. This has precedent in other search engines where 
> another corpus is more representative of a terms true document frequency / 
> significance. Specifically, [Vespa lets you pass significance at query 
> time.|#significance]]. This doesn't just apply to how doc freq is 
> represented, but the entire set of stats from total term freq, etc.
> This is a common painpoint in test corpuses where you have a smaller sample 
> of the documents than the global corpus. It was a frequency bugabear at 
> Shopify, and now at my current employer, for doing relevance testing. It's 
> also a problem whenever you have a corpus that may include some "outliers" 
> that actually aren't outliers in the sense of how your users perceive your 
> corpus. An example is "headache" may not be the jargon to use in a medical 
> textbook, it is just rare by happenstance. Yet searchers still perceive it as 
> a not very significant term.
> I had made some progress ([here|http://example.com/]), however I noticed only 
> certain types of classes can be ResourceLoaderAware in order to read 
> configuration. Specifically I see this error running my tests:
>  
> {code:java}
> ./gradlew --stacktrace --info test 
> {code}
> {code:java}
>                 org.apache.solr.common.SolrException: Invalid 'Aware' object: 
> manual.idf.stats.ManagedStatsCache@5c19c030 -- 
> org.apache.lucene.util.ResourceLoaderAware must be an instance of: 
> [org.apache.lucene.analysis.CharFilterFactory] 
> [org.apache.lucene.analysis.TokenFilterFactory] 
> [org.apache.lucene.analysis.TokenizerFactory] 
> [org.apache.solr.search.QParserPlugin] 
> [org.apache.solr.schema.FieldType]{code}
>  
>  
> Can I propose we add the StatsCache to the list of allowed 
> ResourceLoaderAware objects?
> Some alternatives I've thought about:
>  * I probably can do some ugly hacks to work around this, but I'd rather do 
> the "right thing"
>  * I'd prefer not to create a separate fieldtype that changes how the stats 
> are managed. For one, in my specific case, I don't want to have to have a 
> radically different test config compared to my setup. This is still "text" 
> with texty like configurability
>  ** Second I like the ability with the stats cache to "fall back" to an 
> internal stat if one is missing.
>  * Pass at query time - this is a more radical change similar to what it 
> would take to make BM25 params configurable at query time
>  * It's possible I could create a Similarity to change doc freq, however it 
> too, would not be ResourceLoaderAware apparently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to