On 14/12/2011 01:28, Marvin Humphrey wrote:
I just committed a test to trunk which verifies that utf8proc's normalization works properly, in that normalizing a second time is a no-op. However, I had to disable the test because utf8proc chokes when fed strings which contain either control characters or non-character code points.
You're right that utf8proc doesn't allow non-characters but I don't think that control characters are blocked.
contain noncharacters. Noncharacters are not supposed to be used for interchange, but Lucy is a library, not an application, and thus should pass noncharacters cleanly.
By that argument we could also remove the check for Unicode surrogates. OTOH, passing UTF-8 to a library is a kind of interchange.
Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc reports an error, we simply leave the token alone. That seems appropriate in the case of malformed UTF-8, but I question whether it is appropriate for valid UTF-8 sequences containing control characters or non-character code points.
We should either remove the check for non-characters from utf8proc or disallow non-characters in the rest of Lucy. I'm fine with either solution.
+ if ((code_point& 0xFFFF) == 0xFFEF
This should check for 0xFFFE. Nick
