------------------------------------------- On Tue, 10/29/13, eryksun <eryk...@gmail.com> wrote:
Subject: Re: [Tutor] UnicodeDecodeError while parsing a .csv file. To: "Steven D'Aprano" <st...@pearwood.info> Cc: tutor@python.org Date: Tuesday, October 29, 2013, 3:24 AM On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <st...@pearwood.info> wrote: > > By default Python 3 uses UTF-8 when reading files. As the error below > shows, your file actually isn't UTF-8. Modules default to UTF-8, but io.TextIOWrapper defaults to the locale preferred encoding. To handle terminals, it first tries os.device_encoding (i.e. _Py_device_encoding). Otherwise for files it defaults to locale.getpreferredencoding(False). ==> Why is do_setlocale=False here? Actually, what does this parameter do? It seems strange that a getter function has a 'set' argument. >>> import locale >>> help(locale.getpreferredencoding) Help on function getpreferredencoding in module locale: getpreferredencoding(do_setlocale=True) Return the charset that the user is likely using. Other remark: I have not read this entire thread, but I was thinking the OP might use codecs.open to open the file in the correct encoding. If that encoding is unknown, maybe chardet could be used to guess it: https://pypi.python.org/pypi/chardet. I have never used this module, but it seems worth giving a try. The other day I received a file that was encoded multiple times so accented characters were all messed up. I had to reverse engineer this and it turned out that a sequence of latin-1 and utf-8 had been used. Would be nice if (1) this wouldn't happen in the first place ;-) (2) Some library would help with this "de-mojibake" process. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor