[ 
https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511306
 ] 

Mike Klaas commented on SOLR-293:
---------------------------------

> Would it be useful to be able to configure this separately for words and 
> numbers? 

I think it would, but I wasn't sure.  Trivial to implement in either case.

>Is there anything that can be done along the same lines, when not catenating 
>for the query analyzer, so "foo-bar" will still become "foo bar", but "A9" 
>would stay as "A9"? 

There are a couple ways to approach this (though I'm not exactly sure what your 
question is):

 - instead of minimum part length, restrict analysis to tokens with length < 
some value.  with N=3, this would let "HiFi/hi-fi" -> "hi fi" but "hi8" -> 
"hi8".  This makes the setting dependent on separator characters.

- ensure character inclusion.  If any letter/number character was not included 
in any generated subpart, ensure that a larger containing token is generated.

"high-figh-888" -> "high figh 888" (and not "highfigh888")
"hi-fi-8" -> "hifi8"

- approach the delimiter question differently.  Currenly, parts are delimited 
on case change, alpha->num (and v.v.), and delimiter chars.  The last is much, 
much stronger as a lexical delimiter, and it would be nice to recognize the 
difference between "java5", "mp3", "4x4" and "99-bottle" "20-cent-piece", etc.

Save for the first, I can't think of easy, efficient implementations.  Perhaps 
WDF shouldn't get too sophisticated.

> Add "minPartLength" to WordDelimiterFilter
> ------------------------------------------
>
>                 Key: SOLR-293
>                 URL: https://issues.apache.org/jira/browse/SOLR-293
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Mike Klaas
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>
> WDF is handy but over-tokenizes when faced with short word parts:
> A9
> R2D2
> mp3
> This creates one- or two- character tokens which are extremely slow to query 
> as the doc freq is so high (this is contributing to a significant portion of 
> our slowest queries).
> This patch adds a "minPartLength" option that disables generation of parts 
> below a certain length.  It is recommended to use it with catenateAll, so as 
> to not lose tokens.
> I'll add factory options and tests if we decide to include this (and are 
> happy with the parameter name).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to