On 14/12/2011 01:28, Marvin Humphrey wrote:
I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op.  However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.

You're right that utf8proc doesn't allow non-characters but I don't think that control characters are blocked.

contain noncharacters.  Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.

By that argument we could also remove the check for Unicode surrogates. OTOH, passing UTF-8 to a library is a kind of interchange.

Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone.  That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code
points.

We should either remove the check for non-characters from utf8proc or disallow non-characters in the rest of Lucy. I'm fine with either solution.

+        if ((code_point&  0xFFFF) == 0xFFEF

This should check for 0xFFFE.

Nick

Reply via email to