[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902600#action_12902600 ]
Peter Karich edited comment on SOLR-2059 at 8/25/10 3:46 PM: ------------------------------------------------------------- Ups, my mistake ... this helped! > What do you think of the file format, is it ok for describing these > categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO: {code} @ => ALPHA # => ALPHA {code} was (Author: peathal): Ups, my mistake ... this helped! > What do you think of the file format, is it ok for describing these > categories? I think it is ok. I even had a more simpler patch before stumbling over yours: handleAsChar="@#" which is now more powerful IMHO: @ => ALPHA # => ALPHA > Allow customizing how WordDelimiterFilter tokenizes text. > --------------------------------------------------------- > > Key: SOLR-2059 > URL: https://issues.apache.org/jira/browse/SOLR-2059 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Robert Muir > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: SOLR-2059.patch > > > By default, WordDelimiterFilter assigns 'types' to each character (computed > from Unicode Properties). > Based on these types and the options provided, it splits and concatenates > text. > In some circumstances, you might need to tweak the behavior of how this works. > It seems the filter already had this in mind, since you can pass in a custom > byte[] type table. > But its not exposed in the factory. > I think you should be able to customize the defaults with a configuration > file: > {noformat} > # A customized type mapping for WordDelimiterFilterFactory > # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM > # > # the default for any character without a mapping is always computed from > # Unicode character properties > # Map the $, %, '.', and ',' characters to DIGIT > # This might be useful for financial data. > $ => DIGIT > % => DIGIT > . => DIGIT > \u002C => DIGIT > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org