[ https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277462#comment-16277462 ]
Doug Turnbull commented on SOLR-11662: -------------------------------------- Thanks for helping with the change David! I would probably personally do something like that. However, I tend to restructure most synonyms into a taxonomy. Many people aren't aware of hypernymy/hyponymy. It's not uncommon to see a synonym in an e-commerce clients, for example, that looks like `pants,khakis` with another line that's `pants,jeans` which of course creates an unintentional equivalence between jeans and khakis. Even when these are mixed in with true synonyms, I tend to restructure the whole thing as a taxonomy For example, some people avoid this for example at query time by expanding the query, and expecting the "as_distinct_terms" behavior, which biases towards exact match pants => jeans,pants,khakis jeans => jeans,pants khakis => jeans,khakis A search for pants here shows a mix of different kinds of pants (khakis and jeans roughly equal) A search for jeans puts jeans first (low doc freq), followed by various kinds of pants (high doc freq) A search for khakis puts khakis first, followed by various kinds of non-jean pants I tend to think of synonyms as hyponyms of a canonical name for an idea. So jeans for example, I might expand that to blue_jeans => blue_jeans,jeans,pants denim_jeans => denim_jeans,jeans,pants With multiple analyzer chains, I might recommend controlling how loose the search is with different analyzer chains. For example, one could see forcing a strong boost for conceptually similar items. Or limiting the semantic expansion so that blue_jeans, for example, only expands up to the jeans level. There's quite a lot of "it depends". The example above presupposes that pants have a higher doc freq than jeans, which may not be the case without a similar index-time expansion. > Make overlapping query term scoring configurable per field type > --------------------------------------------------------------- > > Key: SOLR-11662 > URL: https://issues.apache.org/jira/browse/SOLR-11662 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Doug Turnbull > Assignee: David Smiley > Fix For: 7.2, master (8.0) > > > This patch customizes the query-time behavior when query terms overlap > positions. Right now the only option is SynonymQuery. This is a fantastic > default & improvement on past versions. However, there are use cases where > terms overlap positions but don't carry exact synonymy relationships. Often > synonyms are actually used to model hypernym/hyponym relationships using > synonyms (or other analyzers). So the individual term scores matter, with > terms with higher specificity (hyponym) scoring higher than terms with lower > specificity (hypernym). > This patch adds the fieldType setting scoreOverlaps, as in: > {code:java} > <fieldType name="text_general" scoreOverlaps="pick_best" > class="solr.TextField" positionIncrementGap="100" multiValued="true"> > {code} > Valid values for scoreOverlaps are: > *as_one_term* > Default, most synonym use cases. Uses SynonymQuery > Treats all terms as if they're exactly equivalent, with document frequency > from underlying terms blended > *pick_best* > For a given document, score using the best scoring synonym (ie dismax over > generated terms). > Useful when synonyms not exactly equilevant. Instead they are used to model > hypernym/hyponym relationships. Such as expanding to synonyms of where terms > scores will reflect that quality > IE this query time expansion > tabby => tabby, cat, animal > Searching "text", generates the dismax (text:tabby | text:cat | text:animal) > *as_distinct_terms* > (The pre 6.0 behavior.) > Compromise between pick_best and as_oneSterm > Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets > scores stack, so documents with more tabby, cat, or animal the better w/ a > bias towards the term with highest specificity > Terms are turned into a boolean OR query, with documen frequencies not blended > IE this query time expansion > tabby => tabby, cat, animal > Searching "text", generates the boolean query (text:tabby text:cat > text:animal) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org