On Friday 01 December 2006 16:46, Jason Tackaberry wrote:
> On Fri, 2006-12-01 at 16:36 +0100, Duncan Webb wrote:
> > First, when a string is a Unicode string does this mean that every
> > character is 2 or 4 bytes wide?
>
> Not necessarily.  Depends on the encoding.  This isn't the case for
> latin1 and UTF8.
But Duncan asked for Unicode strings, how can those be latin8 or utf-8?
AFAIK, u"Hans" will always be a 16bit string.  Let's see.. interesting:
http://docs.python.org/api/unicodeObjects.html says:

Python's default builds use a 16-bit type for Py_UNICODE and store Unicode 
values internally as UCS2. It is also possible to build a UCS4 version of 
Python (most recent Linux distributions come with UCS4 builds of Python). 
These builds then use a 32-bit type for Py_UNICODE and store Unicode data 
internally as UCS4. On platforms where wchar_t is available and compatible 
with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias 
for wchar_t to enhance native platform compatibility. On all other platforms, 
Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned 
long (UCS4).

You may encode Unicode strings down to 8-bit strings with the "encode" member 
function: u"Hans".encode("utf-8") will make it an 8 bit unicode string again, 
but that string has no property which says that it's in UTF-8 encoding, which 
is why using Unicode objects where possible is best for i18n.

> > Second, file names from a fat system seem to be in latin1 but on the
> > ext2/3 are in utf8. How can they be processed in a safe way without
> > causing UnicodeErrors?
>
> Firstly, the encoding type is not always utf8 on ext3.  The filesystem
> encoding can be gotten via sys.getfilesystemencoding(), but that doesn't 
> mean a filename isn't encoded latin1 anyway.  Consequently, you must
> never use unicode for storing filenames, and always keep them as str
> objects.

I agree with this part of your answer, but...

> For purposes of displaying a filename you can then convert to unicode
> for proper display.  kaa.strutils.str_to_unicode attempts to do the
> right thing when you don't know whether a string is encoded latin1 or
> utf8.  (kaa.strutils is in kaa.base, you can just copy that function
> into the 1.x tree if you need it.)

..I find this misleading, since with a properly setup system, you *should* 
know which encoding the filename has.  I am used to Qt, which has 
QFile.encodeName() and QFile.decodeName() for proper 8bit<->Unicode 
conversions.  What a pity that Python lacks such useful functions.
But as I see now, str_to_unicode properly tries the user's locale first.
I wonder if there should be an additional filename_to_unicode function which 
uses sys.getfilesystemencoding() instead of strutils.ENCODING?

def path_to_unicode(s):
    """
    Attempts to convert a local filesystem path to a unicode string.
    First it tries to decode the string based on
    sys.getfilesystemencoding().  If that fails, it uses
    str_to_unicode() as a fallback (which in turn tries the
    locale's preferred encoding, UTF-8, and latin-1 in order).
    """
    if not type(s) == str:
        return s

    try:
        return s.decode(sys.getfilesystemencoding())
    except UnicodeDecodeError:
        pass

    return str_to_unicode(s)

-- 
Ciao, /  /                                                    .o.
     /--/                                                     ..o
    /  / ANS                                                  ooo

Attachment: pgpTHQoYTwWFE.pgp
Description: PGP signature

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Freevo-devel mailing list
Freevo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freevo-devel

Reply via email to