On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.z...@gmail.com> wrote: > if I am using the standard csv library to read contents of a csv file which > contains Unicode strings (short example: '\xe8\x9f\x92\xe8\x9b\x87'), how do > I use a python Unicode method such as decode or encode to transform this > string type into a python unicode type? Must I know the encoding (byte > groupings) of the Unicode? Can I get this from the file? Perhaps I need to > open the file with particular attributes? >
Start here: http://www.joelonsoftware.com/articles/Unicode.html The CSV file, being stored on disk, cannot contain Unicode strings; it can only contain bytes. If you know the encoding (eg UTF-8, UCS-2, etc), then you can decode it using that. If you don't, your best bet is to ask the origin of the file; failing that, check the first few bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the encodings of the BOM). There may be other clues, too, but normally it's best to get the encoding separately from the data rather than try to decode it from the data itself. Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list