[ https://issues.apache.org/jira/browse/LUCENE-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2070: -------------------------------- Attachment: LUCENE-2070.patch > document LengthFilter wrt Unicode 4.0 > ------------------------------------- > > Key: LUCENE-2070 > URL: https://issues.apache.org/jira/browse/LUCENE-2070 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Robert Muir > Priority: Trivial > Fix For: 3.1 > > Attachments: LUCENE-2070.patch > > > LengthFilter calculates its min/max length from TermAttribute.termLength() > This is not characters, but instead UTF-16 code units. > In my opinion this should not be changed, merely documented. > If we changed it, it would have an adverse performance impact because we > would have to actually calculate Character.codePointCount() on the text. > If you feel strongly otherwise, fixing it to count codepoints would be a > trivial patch, but I'd rather not hurt performance. > I admit I don't fully understand all the use cases for this filter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org