[ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929849#action_12929849 ]
Robert Muir commented on SOLR-2211: ----------------------------------- Great, I look forward to the results. By the way, on SOLR-2210 i also added the ICU filters, you could consider replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the defaults). In addition to better lowercasing (e.g. ß -> ss), this would also bring the advantages described in http://unicode.org/reports/tr15/ Alternatively, if you are already using both LowerCaseFilterFactory and ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory, which goes further and also incorporates http://www.unicode.org/reports/tr30/tr30-4.html > Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support > --------------------------------------------------------------------------- > > Key: SOLR-2211 > URL: https://issues.apache.org/jira/browse/SOLR-2211 > Project: Solr > Issue Type: New Feature > Affects Versions: 3.1 > Reporter: Tom Burton-West > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: SOLR-2211.patch > > > The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for > non-English tokenizing. Presently it can be invoked by using the > StandardTokenizerFactory and setting the Version to 3.1. However, it would > be useful to be able to use the improved unicode processing without > necessarily including the ip address and email address processing of > StandardAnalyzer. A FilterFactory that allowed the use of the > StandardTokenizer with UAX#29 support on its own would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org