On Friday 01 December 2006 16:46, Jason Tackaberry wrote: > On Fri, 2006-12-01 at 16:36 +0100, Duncan Webb wrote: > > First, when a string is a Unicode string does this mean that every > > character is 2 or 4 bytes wide? > > Not necessarily. Depends on the encoding. This isn't the case for > latin1 and UTF8. But Duncan asked for Unicode strings, how can those be latin8 or utf-8? AFAIK, u"Hans" will always be a 16bit string. Let's see.. interesting: http://docs.python.org/api/unicodeObjects.html says:
Python's default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4). You may encode Unicode strings down to 8-bit strings with the "encode" member function: u"Hans".encode("utf-8") will make it an 8 bit unicode string again, but that string has no property which says that it's in UTF-8 encoding, which is why using Unicode objects where possible is best for i18n. > > Second, file names from a fat system seem to be in latin1 but on the > > ext2/3 are in utf8. How can they be processed in a safe way without > > causing UnicodeErrors? > > Firstly, the encoding type is not always utf8 on ext3. The filesystem > encoding can be gotten via sys.getfilesystemencoding(), but that doesn't > mean a filename isn't encoded latin1 anyway. Consequently, you must > never use unicode for storing filenames, and always keep them as str > objects. I agree with this part of your answer, but... > For purposes of displaying a filename you can then convert to unicode > for proper display. kaa.strutils.str_to_unicode attempts to do the > right thing when you don't know whether a string is encoded latin1 or > utf8. (kaa.strutils is in kaa.base, you can just copy that function > into the 1.x tree if you need it.) ..I find this misleading, since with a properly setup system, you *should* know which encoding the filename has. I am used to Qt, which has QFile.encodeName() and QFile.decodeName() for proper 8bit<->Unicode conversions. What a pity that Python lacks such useful functions. But as I see now, str_to_unicode properly tries the user's locale first. I wonder if there should be an additional filename_to_unicode function which uses sys.getfilesystemencoding() instead of strutils.ENCODING? def path_to_unicode(s): """ Attempts to convert a local filesystem path to a unicode string. First it tries to decode the string based on sys.getfilesystemencoding(). If that fails, it uses str_to_unicode() as a fallback (which in turn tries the locale's preferred encoding, UTF-8, and latin-1 in order). """ if not type(s) == str: return s try: return s.decode(sys.getfilesystemencoding()) except UnicodeDecodeError: pass return str_to_unicode(s) -- Ciao, / / .o. /--/ ..o / / ANS ooo
pgpTHQoYTwWFE.pgp
Description: PGP signature
------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________ Freevo-devel mailing list Freevo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freevo-devel