On Fri, 27 Jul 2012 20:13:42 +0100
Frediano Ziglio <[email protected]> wrote:
> Hi,
> I'm currently trying to support utf-16 with characters not in plane 0.
>
> I'm currently end up with this patch. Currently is not against latest
> kernel but the problem still reside in last git kernel.
>
> wchar_t is currently 16bit so converting a utf8 encoded characters not
> in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
> -EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
> code calling directly utf8_to_utf32.
>
> Does it sound a good patch or just a bad hack. Perhaps would be better
> to change char2uni converting to unicode_t (32bit) instead of wchar_t
> but probably many code have to be checked in order to make sure it does
> not lead to wrong conversions, overflows or other bad stuff.
>
> Is it worth working in this hacking way? I'd like to upstream this
> patch.
>
>
> diff -r c2325d754e8d fs/cifs/cifs_unicode.c
> --- a/fs/cifs/cifs_unicode.c Fri Jul 27 15:12:23 2012 +0100
> +++ b/fs/cifs/cifs_unicode.c Fri Jul 27 19:09:04 2012 +0100
> @@ -192,22 +192,40 @@ cifs_strtoUCS(__le16 *to, const char *fr
That function doesn't exist anymore. You should base this on a more
recent upstream tree.
> {
> int charlen;
> int i;
> - wchar_t *wchar_to = (wchar_t *)to; /* needed to quiet sparse */
> + int is_utf8 = !strcmp(codepage->charset, "utf8");
Gross...there must be a better way to do that?
> + wchar_t wchar_to; /* needed to quiet sparse */
> + unicode_t uni;
>
> for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
>
> /* works for 2.4.0 kernel or later */
> - charlen = codepage->char2uni(from, len, &wchar_to[i]);
> + if (is_utf8) {
> + charlen = utf8_to_utf32(from, len, &uni);
> + } else {
> + charlen = codepage->char2uni(from, len, &wchar_to);
> + uni = wchar_to;
> + }
> +
> if (charlen < 1) {
> cERROR(1,
> ("strtoUCS: char2uni of %d returned %d",
> (int)*from, charlen));
> /* A question mark */
> - to[i] = cpu_to_le16(0x003f);
> + wchar_to = 0x003f;
> charlen = 1;
> - } else
> - to[i] = cpu_to_le16(wchar_to[i]);
> -
> + } else if (uni < 0x10000) {
"uni" will be unintialized here if is_utf8 is false.
> + wchar_to = uni;
> + } else if (uni < 0x110000) {
> + uni -= 0x10000;
> + to[i++] = cpu_to_le16(0xD800 | (uni >> 10));
> + wchar_to = 0xDC00 | (uni & 0x3FF);
> + } else {
> + cERROR(1,
> + ("strtoUCS: char2uni of %d returned %d",
> + (int)*from, charlen));
> + wchar_to = 0x003f;
> + }
> + to[i] = cpu_to_le16(wchar_to);
> }
>
> to[i] = 0;
>
> Signed-off-by: "Frediano Ziglio" <[email protected]>
>
> Regards,
> Frediano
>
The basic idea looks ok, but I agree that this could use some more
commenting and/or some #define'd constants. Does the conversion of on
the wire characters to the local charset need similar work?
--
Jeff Layton <[email protected]>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html