Hi, I have problems getting my Python code to work with UTF-8 encoding when reading from stdin / writing to stdout.
Say I have a file, utf8_input, that contains a single character, é, coded as UTF-8: $ hexdump -C utf8_input 00000000 c3 a9 00000002 If I read this file by opening it in this Python script: $ cat utf8_from_file.py import codecs file = codecs.open('utf8_input', encoding='utf-8') data = file.read() print "length of data =", len(data) everything goes well: $ python utf8_from_file.py length of data = 1 The contents of utf8_input is one character coded as two bytes, so UTF-8 decoding is working here. Now, I would like to do the same with standard input. Of course, this: $ cat utf8_from_stdin.py import sys data = sys.stdin.read() print "length of data =", len(data) does not work: $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input length of data = 2 Here, the contents of utf8_input is not interpreted as UTF-8, so Python believes there are two separate characters. The question, then: How could one get utf8_from_stdin.py to work properly with UTF-8? (And same question for stdout.) I googled around, and found rather complex stuff (see, for example, http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even that didn't work: I still get "length of data = 2" even after successively calling sys.setdefaultencoding('utf-8'). -- dave -- http://mail.python.org/mailman/listinfo/python-list