[ https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511306 ]
Mike Klaas commented on SOLR-293: --------------------------------- > Would it be useful to be able to configure this separately for words and > numbers? I think it would, but I wasn't sure. Trivial to implement in either case. >Is there anything that can be done along the same lines, when not catenating >for the query analyzer, so "foo-bar" will still become "foo bar", but "A9" >would stay as "A9"? There are a couple ways to approach this (though I'm not exactly sure what your question is): - instead of minimum part length, restrict analysis to tokens with length < some value. with N=3, this would let "HiFi/hi-fi" -> "hi fi" but "hi8" -> "hi8". This makes the setting dependent on separator characters. - ensure character inclusion. If any letter/number character was not included in any generated subpart, ensure that a larger containing token is generated. "high-figh-888" -> "high figh 888" (and not "highfigh888") "hi-fi-8" -> "hifi8" - approach the delimiter question differently. Currenly, parts are delimited on case change, alpha->num (and v.v.), and delimiter chars. The last is much, much stronger as a lexical delimiter, and it would be nice to recognize the difference between "java5", "mp3", "4x4" and "99-bottle" "20-cent-piece", etc. Save for the first, I can't think of easy, efficient implementations. Perhaps WDF shouldn't get too sophisticated. > Add "minPartLength" to WordDelimiterFilter > ------------------------------------------ > > Key: SOLR-293 > URL: https://issues.apache.org/jira/browse/SOLR-293 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.3 > Reporter: Mike Klaas > Assignee: Mike Klaas > Priority: Minor > Fix For: 1.3 > > > WDF is handy but over-tokenizes when faced with short word parts: > A9 > R2D2 > mp3 > This creates one- or two- character tokens which are extremely slow to query > as the doc freq is so high (this is contributing to a significant portion of > our slowest queries). > This patch adds a "minPartLength" option that disables generation of parts > below a certain length. It is recommended to use it with catenateAll, so as > to not lose tokens. > I'll add factory options and tests if we decide to include this (and are > happy with the parameter name). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.