On Fri, 27 Jul 2012 20:13:42 +0100
Frediano Ziglio <[email protected]> wrote:

> Hi,
>   I'm currently trying to support utf-16 with characters not in plane 0.
> 
> I'm currently end up with this patch. Currently is not against latest
> kernel but the problem still reside in last git kernel.
> 
> wchar_t is currently 16bit so converting a utf8 encoded characters not
> in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
> -EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
> code calling directly utf8_to_utf32.
> 
> Does it sound a good patch or just a bad hack. Perhaps would be better
> to change char2uni converting to unicode_t (32bit) instead of wchar_t
> but probably many code have to be checked in order to make sure it does
> not lead to wrong conversions, overflows or other bad stuff.
> 
> Is it worth working in this hacking way? I'd like to upstream this
> patch.
> 
> 
> diff -r c2325d754e8d fs/cifs/cifs_unicode.c
> --- a/fs/cifs/cifs_unicode.c  Fri Jul 27 15:12:23 2012 +0100
> +++ b/fs/cifs/cifs_unicode.c  Fri Jul 27 19:09:04 2012 +0100
> @@ -192,22 +192,40 @@ cifs_strtoUCS(__le16 *to, const char *fr

That function doesn't exist anymore. You should base this on a more
recent upstream tree.

>  {
>       int charlen;
>       int i;
> -     wchar_t *wchar_to = (wchar_t *)to; /* needed to quiet sparse */
> +     int is_utf8 = !strcmp(codepage->charset, "utf8");

Gross...there must be a better way to do that?


> +     wchar_t wchar_to; /* needed to quiet sparse */
> +     unicode_t uni;
>  
>       for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
>  
>               /* works for 2.4.0 kernel or later */
> -             charlen = codepage->char2uni(from, len, &wchar_to[i]);
> +             if (is_utf8) {
> +                     charlen = utf8_to_utf32(from, len, &uni);
> +             } else {
> +                     charlen = codepage->char2uni(from, len, &wchar_to);
> +                     uni = wchar_to;
> +             }
> +
>               if (charlen < 1) {
>                       cERROR(1,
>                              ("strtoUCS: char2uni of %d returned %d",
>                               (int)*from, charlen));
>                       /* A question mark */
> -                     to[i] = cpu_to_le16(0x003f);
> +                     wchar_to = 0x003f;
>                       charlen = 1;
> -             } else
> -                     to[i] = cpu_to_le16(wchar_to[i]);
> -
> +             } else if (uni < 0x10000) {

        "uni" will be unintialized here if is_utf8 is false.

> +                     wchar_to = uni;
> +             } else if (uni < 0x110000) {
> +                     uni -= 0x10000;
> +                     to[i++] = cpu_to_le16(0xD800 | (uni >> 10));
> +                     wchar_to = 0xDC00 | (uni & 0x3FF);
> +             } else {
> +                     cERROR(1,
> +                            ("strtoUCS: char2uni of %d returned %d",
> +                             (int)*from, charlen));
> +                     wchar_to = 0x003f;
> +             }
> +             to[i] = cpu_to_le16(wchar_to);
>       }
>  
>       to[i] = 0;
> 
> Signed-off-by: "Frediano Ziglio" <[email protected]>
> 
> Regards,
>   Frediano
> 

The basic idea looks ok, but I agree that this could use some more
commenting and/or some #define'd constants. Does the conversion of on
the wire characters to the local charset need similar work?

-- 
Jeff Layton <[email protected]>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to