Marvin Humphrey wrote on 12/13/11 6:28 PM:
> Greets,
>
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op. However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.
>
> http://svn.apache.org/viewvc?view=revision&revision=1213996
>
> The test uses random UTF-8 data, generated by TestUtils_random_string(). With
> the hack below my sig, the test passes.
>
> Strings which contain control characters are valid UTF-8, as are strings which
> contain noncharacters. Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone. That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.
Swish3 uses \003 control character as an internal field delimiter so passing
that through is pretty vital. Are you saying that utf8proc chokes on that valid
UTF-8 sequence?
>
> Index: core/Lucy/Test/TestUtils.c
> ===================================================================
> --- core/Lucy/Test/TestUtils.c (revision 1213967)
> +++ core/Lucy/Test/TestUtils.c (working copy)
> @@ -17,6 +17,7 @@
> #define C_LUCY_TESTUTILS
> #include "Lucy/Util/ToolSet.h"
> #include <string.h>
> +#include <ctype.h>
>
> #include "Lucy/Test/TestUtils.h"
> #include "Lucy/Test.h"
> @@ -106,6 +107,15 @@
> if (code_point > 0xD7FF && code_point < 0xE000) {
> continue; // UTF-16 surrogate.
> }
> + if (iscntrl(code_point)) {
> + continue;
> + }
> + if ((code_point & 0xFFFF) == 0xFFEF
> + || (code_point & 0xFFFF) == 0xFFFF
> + || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
> + ) {
> + continue; // Unicode non-character code point.
> + }
> break;
> }
> return code_point;
>
>
--
Peter Karman . http://peknet.com/ . [email protected]