[ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771531#action_12771531 ]
Yonik Seeley commented on LUCENE-2016: -------------------------------------- bq. This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" Tricky semantics :-) It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either. > replace invalid U+FFFF character during indexing > ------------------------------------------------ > > Key: LUCENE-2016 > URL: https://issues.apache.org/jira/browse/LUCENE-2016 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 2.4, 2.4.1, 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 2.9.1, 3.0 > > Attachments: LUCENE-2016.patch > > > If the invalid U+FFFF character is embedded in a token, it actually causes > indexing to silently corrupt the index by writing duplicate terms into the > terms dict. CheckIndex will catch the error, and merging will hit exceptions > (I think). > We already replace invalid surrogate pairs with the replacement character > U+FFFD, so I'll just do the same with U+FFFF. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org