[ 
https://issues.apache.org/jira/browse/LUCY-179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marvin Humphrey updated LUCY-179:
---------------------------------

    Attachment: utf8_validation.patch

This patch provides a core implementation of StrHelp_utf8_valid() which can be
shared across all Lucy host languages.  It is coded to RFC 3629, referencing
the Unicode Standard 6.0 for clarifications and definitions.

  * All "Unicode scalar values" (code points 0x0 to 0xD7FF and 0xE000 to
    0x10FFFF inclusive) are passed.
  * Code points above 0x10FFFF are not passed.
  * UTF-16 surrogates are not passed, whether paired or in isolation.
  * The entire string must be comprised of valid byte sequences, with each
    byte sequence made up of a lead byte followed by the appropriate number of
    continuation bytes.
  * Shortest form is enforced.
  * Each byte sequence occupies a maximum of 4 bytes.

References:

  * http://www.ietf.org/rfc/rfc3629.txt
  * http://www.unicode.org/versions/Unicode6.0.0/
  * http://unicode.org/reports/tr36/  (Unicode security considerations)
  * 
http://lab.gsi.dit.upm.es/semanticwiki/index.php/Using_UTF-8_Encoding_to_Bypass_Validation_Logic

> Tighten UTF-8 validity checks.
> ------------------------------
>
>                 Key: LUCY-179
>                 URL: https://issues.apache.org/jira/browse/LUCY-179
>             Project: Lucy
>          Issue Type: Improvement
>          Components: Util
>            Reporter: Marvin Humphrey
>             Fix For: 0.3.0 (incubating)
>
>         Attachments: utf8_validation.patch
>
>
> Lucy currently outsources UTF-8 validity checking to the Perl C API function
> is_utf8_string().  This suffices for sanity checking of basic byte sequences
> and detecting non-shortest-form, but since is_utf8_string() only validates to
> the loose Perl internal "utf8" format[1], it allows through certain constructs
> we should probably thwart: UTF-8 coded UTF-16 surrogates (both paired and
> isolated), and code points above 0x10FFFF.
> Since Lucy is not an application but rather a library, we should continue to
> pass through "noncharacter" code points which are discouraged for "public
> exchange"[2] but are allowed for internal application use, such as U+FFFF.
> (Such code points may be useful as e.g. sentinels or separators).  These code
> points will be allowed to end up in indexes; it will be the responsibility of
> the application to filter them at input or output.
> [1] http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8
> [2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf section 3.2, clause 
> C2

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to