Re: svn commit: r265095 - head/lib/libc/locale

2014-04-30 Thread Pedro Giffuni


On 04/30/14 16:10, Jilles Tjoelker wrote:

On Tue, Apr 29, 2014 at 03:25:57PM +, Pedro F. Giffuni wrote:

Author: pfg
Date: Tue Apr 29 15:25:57 2014
New Revision: 265095
URL: http://svnweb.freebsd.org/changeset/base/265095
Log:
   citrus: Avoid invalid code points.
   
   From the OpenBSD log:

   The UTF-8 decoder should not accept byte sequences which decode to unicode
   code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+.
   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
   http://unicode.org/faq/utf_bom.html#utf8-4
   Reported by: Stefan Sperling
   Obtained from:   OpenBSD
   MFC after:   5 days
Modified:
   head/lib/libc/locale/utf8.c
Modified: head/lib/libc/locale/utf8.c
==
--- head/lib/libc/locale/utf8.c Tue Apr 29 15:12:23 2014(r265094)
+++ head/lib/libc/locale/utf8.c Tue Apr 29 15:25:57 2014(r265095)
@@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc,
errno = EILSEQ;
return ((size_t)-1);
}
+   if ((wch >= 0xd800 && wch <= 0xdfff) ||
+   wch == 0xfffe || wch == 0x) {
+   /*
+* Malformed input; invalid code points.
+*/
+   errno = EILSEQ;
+   return ((size_t)-1);
+   }
if (pwc != NULL)
*pwc = wch;
us->want = 0;

Hmm, I think U+FFFE and U+ should be passed through normally.
According to http://www.unicode.org/faq/private_use.html they are
"noncharacters" (basically a more private variant of private-use
characters) and must be mapped through UTFs.

The part that rejects U+D800 to U+DFFF is definitely correct, though.
http://unicode.org/faq/utf_bom.html#utf8-4 tells to do only that.

The part about U+FFFE and U+ in
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 seems out of date.
Note the last modified date of that page: 2009-05-11.

On another note, everything above U+0010 should perhaps be rejected
since those codes, which cannot be encoded in UTF-16, were excluded from
Unicode and ISO 10646.



Thank you! I will fix soon the UTF-8 part.

Pedro.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r265095 - head/lib/libc/locale

2014-04-30 Thread Jilles Tjoelker
On Tue, Apr 29, 2014 at 03:25:57PM +, Pedro F. Giffuni wrote:
> Author: pfg
> Date: Tue Apr 29 15:25:57 2014
> New Revision: 265095
> URL: http://svnweb.freebsd.org/changeset/base/265095

> Log:
>   citrus: Avoid invalid code points.
>   
>   From the OpenBSD log:
>   The UTF-8 decoder should not accept byte sequences which decode to unicode
>   code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+.

>   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
>   http://unicode.org/faq/utf_bom.html#utf8-4

>   Reported by:Stefan Sperling
>   Obtained from:  OpenBSD
>   MFC after:  5 days

> Modified:
>   head/lib/libc/locale/utf8.c

> Modified: head/lib/libc/locale/utf8.c
> ==
> --- head/lib/libc/locale/utf8.c   Tue Apr 29 15:12:23 2014
> (r265094)
> +++ head/lib/libc/locale/utf8.c   Tue Apr 29 15:25:57 2014
> (r265095)
> @@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc, 
>   errno = EILSEQ;
>   return ((size_t)-1);
>   }
> + if ((wch >= 0xd800 && wch <= 0xdfff) ||
> + wch == 0xfffe || wch == 0x) {
> + /*
> +  * Malformed input; invalid code points.
> +  */
> + errno = EILSEQ;
> + return ((size_t)-1);
> + }
>   if (pwc != NULL)
>   *pwc = wch;
>   us->want = 0;

Hmm, I think U+FFFE and U+ should be passed through normally.
According to http://www.unicode.org/faq/private_use.html they are
"noncharacters" (basically a more private variant of private-use
characters) and must be mapped through UTFs.

The part that rejects U+D800 to U+DFFF is definitely correct, though.
http://unicode.org/faq/utf_bom.html#utf8-4 tells to do only that.

The part about U+FFFE and U+ in
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 seems out of date.
Note the last modified date of that page: 2009-05-11.

On another note, everything above U+0010 should perhaps be rejected
since those codes, which cannot be encoded in UTF-16, were excluded from
Unicode and ISO 10646.

-- 
Jilles Tjoelker
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


svn commit: r265095 - head/lib/libc/locale

2014-04-29 Thread Pedro F. Giffuni
Author: pfg
Date: Tue Apr 29 15:25:57 2014
New Revision: 265095
URL: http://svnweb.freebsd.org/changeset/base/265095

Log:
  citrus: Avoid invalid code points.
  
  From the OpenBSD log:
  The UTF-8 decoder should not accept byte sequences which decode to unicode
  code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+.
  
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
  http://unicode.org/faq/utf_bom.html#utf8-4
  
  Reported by:  Stefan Sperling
  Obtained from:OpenBSD
  MFC after:5 days

Modified:
  head/lib/libc/locale/utf8.c

Modified: head/lib/libc/locale/utf8.c
==
--- head/lib/libc/locale/utf8.c Tue Apr 29 15:12:23 2014(r265094)
+++ head/lib/libc/locale/utf8.c Tue Apr 29 15:25:57 2014(r265095)
@@ -203,6 +203,14 @@ _UTF8_mbrtowc(wchar_t * __restrict pwc, 
errno = EILSEQ;
return ((size_t)-1);
}
+   if ((wch >= 0xd800 && wch <= 0xdfff) ||
+   wch == 0xfffe || wch == 0x) {
+   /*
+* Malformed input; invalid code points.
+*/
+   errno = EILSEQ;
+   return ((size_t)-1);
+   }
if (pwc != NULL)
*pwc = wch;
us->want = 0;
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"