On Mon, 4 Jun 2007, David Brownell wrote: > On Monday 04 June 2007, Pete Zaitcev wrote: > > On Mon, 4 Jun 2007 16:52:01 -0400 (EDT), Alan Stern <[EMAIL PROTECTED]> > > wrote: > > > > Does anybody think it would be worthwhile to convert string descriptors > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in? > > Or even UTF-7 ... ? FWIW the input isn't UCS-16; it's UTF16-LE.
Do you happen to know where "UCS-16" is defined? > > I remember that issue. I thought that we wanted some kind of escape > > syntax... Like what HTML uses with &#xxxx; perhaps. This would allow > > to edit xorg.conf on systems which are not UTF-8 clean. But perhaps > > it's a non-goal. How big is the code to convert (we need both ways, > > right)? > > How big? Not big. UTF-16 to UTF-8 is a simple algorithm. For > the reverse, see drivers/usb/gadget/usbstring.c ... the trick is > you'd need to know enough Unicode to not goof it. Or, to find > some code that does it right. Here's a patch. Anybody see anything wrong with it? I don't have any devices with non-ASCII characters in the default language descriptors for testing. It would be nice if there was a library in the kernel to do these sorts of conversions, but there doesn't appear to be. Nicolas, does it make your life any easier? Alan Stern Index: usb-2.6/drivers/usb/core/message.c =================================================================== --- usb-2.6.orig/drivers/usb/core/message.c +++ usb-2.6/drivers/usb/core/message.c @@ -731,24 +731,71 @@ static int usb_string_sub(struct usb_dev } /** - * usb_string - returns ISO 8859-1 version of a string descriptor + * utf16le_to_utf8 - convert a string encoded in UTF-16LE to UTF-8 + * @dst: the UTF-8 output buffer + * @dst_len: number of bytes available in @dest + * @src: the UTF-16LE input buffer + * @src_len: number of two-byte characters in @src + * + * Stores as many completely converted characters from @src as will fit + * in @dst (i.e., no partial character will remain at the end of @dst). + * No terminating NULL is appended to @dst. + * + * Returns the number of bytes stored in @dst. + */ +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t src_len) +{ + unsigned c; + u8 *d, *e1, *e2, *e3; + + e1 = dst + dst_len - 1; + e2 = e1 - 1; + e3 = e2 - 1; + for (d = dst; src_len > 0; (--src_len, src += 2)) { + c = src[0] | (src[1] << 8); + if (c < 0x80) { + /* 0******* */ + if (d > e1) + break; + d[0] = c; + d += 1; + } else if (c < 0x800) { + /* 110***** 10****** */ + if (d > e2) + break; + d[0] = 0xc0 | (c >> 6); + d[1] = 0x80 | (c & 0x3f); + d += 2; + } else { + /* 1110**** 10****** 10****** */ + if (d > e3) + break; + d[0] = 0xe0 | (c >> 12); + d[1] = 0x80 | ((c >> 6) & 0x3f); + d[2] = 0x80 | (c & 0x3f); + d += 3; + } + } + return d - dst; +} + +/** + * usb_string - returns UTF-8 version of a string descriptor * @dev: the device whose string descriptor is being retrieved * @index: the number of the descriptor * @buf: where to put the string * @size: how big is "buf"? * Context: !in_interrupt () * - * This converts the UTF-16LE encoded strings returned by devices, from - * usb_get_string_descriptor(), to null-terminated ISO-8859-1 encoded ones - * that are more usable in most kernel contexts. Note that all characters - * in the chosen descriptor that can't be encoded using ISO-8859-1 - * are converted to the question mark ("?") character, and this function - * chooses strings in the first language supported by the device. + * This retrieves a UTF-16LE encoded string from a device and converts + * it to a NULL-terminated UTF-8 encoded string as used by the rest of + * the kernel. Note that this function chooses strings in the first + * language supported by the device. * * The ASCII (or, redundantly, "US-ASCII") character set is the seven-bit - * subset of ISO 8859-1. ISO-8859-1 is the eight-bit subset of Unicode, - * and is appropriate for use many uses of English and several other - * Western European languages. (But it doesn't include the "Euro" symbol.) + * subset of UTF-8. Strings containing only ASCII characters appear exactly + * the same when encoded in UTF-8. Characters (or "code-points") with + * values above 127 are encoded using multiple bytes. * * This call is synchronous, and may not be used in an interrupt context. * @@ -758,7 +805,6 @@ int usb_string(struct usb_device *dev, i { unsigned char *tbuf; int err; - unsigned int u, idx; if (dev->state == USB_STATE_SUSPENDED) return -EHOSTUNREACH; @@ -794,20 +840,12 @@ int usb_string(struct usb_device *dev, i if (err < 0) goto errout; - size--; /* leave room for trailing NULL char in output buffer */ - for (idx = 0, u = 2; u < err; u += 2) { - if (idx >= size) - break; - if (tbuf[u+1]) /* high byte */ - buf[idx++] = '?'; /* non ISO-8859-1 character */ - else - buf[idx++] = tbuf[u]; - } - buf[idx] = 0; - err = idx; + err = utf16le_to_utf8(buf, size - 1, &tbuf[2], (err - 2) / 2); + buf[err] = 0; if (tbuf[1] != USB_DT_STRING) - dev_dbg(&dev->dev, "wrong descriptor type %02x for string %d (\"%s\")\n", tbuf[1], index, buf); + dev_dbg(&dev->dev, "wrong descriptor type %02x for string " + "%d (\"%s\")\n", tbuf[1], index, buf); errout: kfree(tbuf); ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel