Re: [linux-usb-devel] [Bugme-new] [Bug 8310] New: USB device names are not sanitized for UTF-8

Alan Stern Tue, 05 Jun 2007 14:01:18 -0700

On Mon, 4 Jun 2007, David Brownell wrote:

> On Monday 04 June 2007, Pete Zaitcev wrote:
> > On Mon, 4 Jun 2007 16:52:01 -0400 (EDT), Alan Stern <[EMAIL PROTECTED]> 
> > wrote:
> 
> > > Does anybody think it would be worthwhile to convert string descriptors 
> > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in?
> 
> Or even UTF-7 ... ?   FWIW the input isn't UCS-16; it's UTF16-LE.


Do you happen to know where "UCS-16" is defined?

> > I remember that issue. I thought that we wanted some kind of escape
> > syntax... Like what HTML uses with &#xxxx; perhaps. This would allow
> > to edit xorg.conf on systems which are not UTF-8 clean. But perhaps
> > it's a non-goal. How big is the code to convert (we need both ways,
> > right)?
> 
> How big?  Not big.  UTF-16 to UTF-8 is a simple algorithm.  For
> the reverse, see drivers/usb/gadget/usbstring.c ... the trick is
> you'd need to know enough Unicode to not goof it.  Or, to find
> some code that does it right.

Here's a patch.  Anybody see anything wrong with it?  I don't have any 
devices with non-ASCII characters in the default language descriptors 
for testing.  It would be nice if there was a library in the kernel to 
do these sorts of conversions, but there doesn't appear to be.

Nicolas, does it make your life any easier?

Alan Stern


Index: usb-2.6/drivers/usb/core/message.c
===================================================================
--- usb-2.6.orig/drivers/usb/core/message.c
+++ usb-2.6/drivers/usb/core/message.c
@@ -731,24 +731,71 @@ static int usb_string_sub(struct usb_dev
 }
 
 /**
- * usb_string - returns ISO 8859-1 version of a string descriptor
+ * utf16le_to_utf8 - convert a string encoded in UTF-16LE to UTF-8
+ * @dst: the UTF-8 output buffer
+ * @dst_len: number of bytes available in @dest
+ * @src: the UTF-16LE input buffer
+ * @src_len: number of two-byte characters in @src
+ *
+ * Stores as many completely converted characters from @src as will fit
+ * in @dst (i.e., no partial character will remain at the end of @dst).
+ * No terminating NULL is appended to @dst.
+ *
+ * Returns the number of bytes stored in @dst.
+ */
+static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t src_len)
+{
+       unsigned c;
+       u8 *d, *e1, *e2, *e3;
+
+       e1 = dst + dst_len - 1;
+       e2 = e1 - 1;
+       e3 = e2 - 1;
+       for (d = dst; src_len > 0; (--src_len, src += 2)) {
+               c = src[0] | (src[1] << 8);
+               if (c < 0x80) {
+                       /*  0******* */
+                       if (d > e1)
+                               break;
+                       d[0] = c;
+                       d += 1;
+               } else if (c < 0x800) {
+                       /* 110***** 10****** */
+                       if (d > e2)
+                               break;
+                       d[0] = 0xc0 | (c >> 6);
+                       d[1] = 0x80 | (c & 0x3f);
+                       d += 2;
+               } else {
+                       /* 1110**** 10****** 10****** */
+                       if (d > e3)
+                               break;
+                       d[0] = 0xe0 | (c >> 12);
+                       d[1] = 0x80 | ((c >> 6) & 0x3f);
+                       d[2] = 0x80 | (c & 0x3f);
+                       d += 3;
+               }
+       }
+       return d - dst;
+}
+
+/**
+ * usb_string - returns UTF-8 version of a string descriptor
  * @dev: the device whose string descriptor is being retrieved
  * @index: the number of the descriptor
  * @buf: where to put the string
  * @size: how big is "buf"?
  * Context: !in_interrupt ()
  * 
- * This converts the UTF-16LE encoded strings returned by devices, from
- * usb_get_string_descriptor(), to null-terminated ISO-8859-1 encoded ones
- * that are more usable in most kernel contexts.  Note that all characters
- * in the chosen descriptor that can't be encoded using ISO-8859-1
- * are converted to the question mark ("?") character, and this function
- * chooses strings in the first language supported by the device.
+ * This retrieves a UTF-16LE encoded string from a device and converts
+ * it to a NULL-terminated UTF-8 encoded string as used by the rest of
+ * the kernel.  Note that this function chooses strings in the first
+ * language supported by the device.
  *
  * The ASCII (or, redundantly, "US-ASCII") character set is the seven-bit
- * subset of ISO 8859-1. ISO-8859-1 is the eight-bit subset of Unicode,
- * and is appropriate for use many uses of English and several other
- * Western European languages.  (But it doesn't include the "Euro" symbol.)
+ * subset of UTF-8.  Strings containing only ASCII characters appear exactly
+ * the same when encoded in UTF-8.  Characters (or "code-points") with
+ * values above 127 are encoded using multiple bytes.
  *
  * This call is synchronous, and may not be used in an interrupt context.
  *
@@ -758,7 +805,6 @@ int usb_string(struct usb_device *dev, i
 {
        unsigned char *tbuf;
        int err;
-       unsigned int u, idx;
 
        if (dev->state == USB_STATE_SUSPENDED)
                return -EHOSTUNREACH;
@@ -794,20 +840,12 @@ int usb_string(struct usb_device *dev, i
        if (err < 0)
                goto errout;
 
-       size--;         /* leave room for trailing NULL char in output buffer */
-       for (idx = 0, u = 2; u < err; u += 2) {
-               if (idx >= size)
-                       break;
-               if (tbuf[u+1])                  /* high byte */
-                       buf[idx++] = '?';  /* non ISO-8859-1 character */
-               else
-                       buf[idx++] = tbuf[u];
-       }
-       buf[idx] = 0;
-       err = idx;
+       err = utf16le_to_utf8(buf, size - 1, &tbuf[2], (err - 2) / 2);
+       buf[err] = 0;
 
        if (tbuf[1] != USB_DT_STRING)
-               dev_dbg(&dev->dev, "wrong descriptor type %02x for string %d 
(\"%s\")\n", tbuf[1], index, buf);
+               dev_dbg(&dev->dev, "wrong descriptor type %02x for string "
+                               "%d (\"%s\")\n", tbuf[1], index, buf);
 
  errout:
        kfree(tbuf);



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
[email protected]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Re: [linux-usb-devel] [Bugme-new] [Bug 8310] New: USB device names are not sanitized for UTF-8

Reply via email to