[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277462#comment-16277462
 ] 

Doug Turnbull commented on SOLR-11662:
--------------------------------------

Thanks for helping with the change David!

I would probably personally do something like that. However, I tend to 
restructure most synonyms into a taxonomy. Many people aren't aware of 
hypernymy/hyponymy. It's not uncommon to see a synonym in an e-commerce 
clients, for example, that looks like `pants,khakis` with another line that's 
`pants,jeans` which of course creates an unintentional equivalence between 
jeans and khakis. Even when these are mixed in with true synonyms, I tend to 
restructure the whole thing as a taxonomy

For example, some people avoid this for example at query time by expanding the 
query, and expecting the "as_distinct_terms" behavior, which biases towards 
exact match

pants => jeans,pants,khakis
jeans => jeans,pants
khakis => jeans,khakis

A search for pants here shows a mix of different kinds of pants (khakis and 
jeans roughly equal)
A search for jeans puts jeans first (low doc freq), followed by various kinds 
of pants (high doc freq)
A search for khakis puts khakis first, followed by various kinds of non-jean 
pants

I tend to think of synonyms as hyponyms of a canonical name for an idea. So 
jeans for example, I might expand that to

blue_jeans => blue_jeans,jeans,pants
denim_jeans => denim_jeans,jeans,pants

With multiple analyzer chains, I might recommend controlling how loose the 
search is with different analyzer chains. For example, one could see forcing a 
strong boost for conceptually similar items. Or limiting the semantic expansion 
so that blue_jeans, for example, only expands up to the jeans level.

There's quite a lot of "it depends". The example above presupposes that pants 
have a higher doc freq than jeans, which may not be the case without a similar 
index-time expansion.


> Make overlapping query term scoring configurable per field type
> ---------------------------------------------------------------
>
>                 Key: SOLR-11662
>                 URL: https://issues.apache.org/jira/browse/SOLR-11662
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Doug Turnbull
>            Assignee: David Smiley
>             Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>   <fieldType name="text_general"  scoreOverlaps="pick_best"  
> class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to