[ https://issues.apache.org/jira/browse/SOLR-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated SOLR-2059: ------------------------------ Attachment: SOLR-2059.patch > Allow customizing how WordDelimiterFilter tokenizes text. > --------------------------------------------------------- > > Key: SOLR-2059 > URL: https://issues.apache.org/jira/browse/SOLR-2059 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Robert Muir > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: SOLR-2059.patch > > > By default, WordDelimiterFilter assigns 'types' to each character (computed > from Unicode Properties). > Based on these types and the options provided, it splits and concatenates > text. > In some circumstances, you might need to tweak the behavior of how this works. > It seems the filter already had this in mind, since you can pass in a custom > byte[] type table. > But its not exposed in the factory. > I think you should be able to customize the defaults with a configuration > file: > {noformat} > # A customized type mapping for WordDelimiterFilterFactory > # the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM > # > # the default for any character without a mapping is always computed from > # Unicode character properties > # Map the $, %, '.', and ',' characters to DIGIT > # This might be useful for financial data. > $ => DIGIT > % => DIGIT > . => DIGIT > \u002C => DIGIT > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org