Re: Unicode USB strings conversion

Tim Kientzle Wed, 19 Nov 2008 23:13:23 -0800

Nick Hibma wrote:

In the USB code (and I bet it is the same in the USB4BSD code) unicodecharacters in strings are converted in a very crude way to ASCII. As I havea user on the line who sees rubbish in his logs and when usingusbctl/usbdevs/etc., I bet this is the problem.
I'd like to try and fix this problem by using libkern/libiconv.
1) Is this the right approach to convert UTF8 to printable string in thekernel?
2) Is this needed at all in the short term future? I remember seeingattempts at making the kernel use UTF8.
3) Does anyone know of a good example in the code without me having to huntthrough the kernel to find it?
For reference: The code that needs replacing is:

usbd_get_string():

        s = buf;
        n = size / 2 - 1;
        for (i = 0; i < n && i < len - 1; i++) {
                c = UGETW(us.bString[i]);
                /* Convert from Unicode, handle buggy strings. */
                if ((c & 0xff00) == 0)
                        *s++ = c;
                else if ((c & 0x00ff) == 0 && swap)
                        *s++ = c >> 8;
                else
                        *s++ = '?';
        }
        *s++ = 0;
I haven't got the USB specs handy, but I believe that this is a simple wayof converting LE and BE UTF8 to ASCII.


First, get your terminology straight.  It looks
like UGETW() is returning 16-bit Unicode code points.
That would be UTF-16, not UTF-8.  UTF-8 is a popular
multibyte encoding which uses 1 to 4 bytes per character.
ASCII values (less than 128) get preserved, anything else
gets encoded.

There are two problems with UTF-16:  First is determining
the byte order.  Second is that nobody displays UTF-16
directly.  (Well, almost nobody.)

The code above is fine if you're sure you're getting ASCII
(it looks at each character and guesses the byte order)
but is otherwise pretty lame.  You didn't show the code
that set the 'swap' variable.

If you really want legible output, your best option by
far is to really convert it to UTF8 and emit that.  That
still preserves ASCII, but gives a chance of viewing
non-ASCII in a suitable terminal program.  (And there
are even a couple of folks looking into UTF8 support for
syscons.)

<rolling up sleeves>  The basic UTF-16 to UTF-8
conversion is pretty simple:

     if (c < 0x7f) { *s++ = c; }
     else if (c < 0x7ff) {
        *s++ = 0xc0 | ((c >> 6) & 0x1f);
        *s++ = 0x80 | (c & 0x3f);
     } else if (c < 0xffff) {
        *s++ = 0xe0 | ((c >> 12) & 0x0f);
        *s++ = 0x80 | ((c >> 6) & 0x3f);
        *s++ = 0x80 | (c & 0x3f);
     } else {
        *s++ = 0xf0 | ((c >> 18) & 0x07);
        *s++ = 0x80 | ((c >> 12) & 0x3f);
        *s++ = 0x80 | ((c >> 6) & 0x3f);
        *s++ = 0x80 | (c & 0x3f);
     }

This assumes that 'c' is a UTF-16 Unicode character
in native byte order.  If you really don't know the
byte order, you'll need to find some way to guess.

One way to guess is to assume that ASCII characters
are common, in which case, you'll see things with the
high order byte 0.  In some environments, a "Byte-order mark"
is used as the first character.  This is character 0xFEFF.
(The byte-swapped 0xFFFE is illegal, so if you see that,
you know you've got the wrong byte order.)

Good luck!

Tim

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Unicode USB strings conversion

Reply via email to