[ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771565#action_12771565 ]
Robert Muir commented on LUCENE-2016: ------------------------------------- Earwin, take a look at LUCENE-2019. I added a hyperlink to the list there... > replace invalid U+FFFF character during indexing > ------------------------------------------------ > > Key: LUCENE-2016 > URL: https://issues.apache.org/jira/browse/LUCENE-2016 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 2.4, 2.4.1, 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 2.9.1, 3.0 > > Attachments: LUCENE-2016.patch > > > If the invalid U+FFFF character is embedded in a token, it actually causes > indexing to silently corrupt the index by writing duplicate terms into the > terms dict. CheckIndex will catch the error, and merging will hit exceptions > (I think). > We already replace invalid surrogate pairs with the replacement character > U+FFFD, so I'll just do the same with U+FFFF. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org