UTF-8 and stdin/stdout?

dave_140390 Wed, 28 May 2008 02:11:47 -0700

Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.


Say I have a file, utf8_input, that contains a single character, é,
coded as UTF-8:

        $ hexdump -C utf8_input
        00000000  c3 a9
        00000002

If I read this file by opening it in this Python script:

        $ cat utf8_from_file.py
        import codecs
        file = codecs.open('utf8_input', encoding='utf-8')
        data = file.read()
        print "length of data =", len(data)

everything goes well:

        $ python utf8_from_file.py
        length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

        $ cat utf8_from_stdin.py
        import sys
        data = sys.stdin.read()
        print "length of data =", len(data)

does not work:

        $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
        length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,
http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
--
http://mail.python.org/mailman/listinfo/python-list

UTF-8 and stdin/stdout?

Reply via email to