Re: What's wrong with these codes as failed to read the strings in Chinese? Is it because Chinese characters can't be read on Mac? Many thanks

Cameron Simpson Thu, 08 Nov 2018 22:45:02 -0800

On 08Nov2018 19:30, Annie Lu <[email protected]> wrote:

# -*- coding: UTF-8 -*-
... f = open('/Users/annielu/Desktop/namelist1801.txt')

namelist1801txt = f.read()
f.close()
namelist1801txt

'\xe9\x99\x88\xe5\xb7\x8d\n\xe8\x83\xa1\xe6\x99\xba\xe5\x81\xa5\r\xe9\xbb\x84\xe5\x9d\xa4\xe6\xa6\x95\r\xe6\x9d\x8e\xe6\x98\x9f\xe7\x81\xbf\r\xe5\x88\x98\xe8\xb6\x85\xe6\x9d\xb0\r\xe7\x8e\x8b\xe4\xbf\x8a\xe5\x80\xbc\r\xe4\xbd\x99\xe4\xb8\x9c\xe6\xbd\xae\r\xe9\x99\x88\xe6\x80\x9d\xe5\x87\xbd\r\xe5\x86\xaf\xe5\xb0\x91\xe5\x90\x9b\r\xe9\xbb\x84\xe5\x98\x89\xe8\xb0\x8a\r\xe9\xbb\x84\xe7\x90\xaa\xe7\x90\xaa\r\xe8\xb5\x96\xe5\xa9\x89\xe5\xa9\xb7\r\xe8\xb5\x96\xe5\xbd\xa6\xe9\x9c\x8f\r\xe5\xbb\x96\xe7\xbf\xa0\xe7\x9b\x88\r\xe6\x9e\x97\xe7\xbe\xbd\xe7\x8f\x82\r\xe5\x88\x98\xe5\xae\x89\xe7\x90\xaa\r\xe9\xa9\xac\xe7\x91\x9e\r\xe5\xbd\xad\xe5\x98\x89\xe4\xbb\xaa\r\xe9\x82\xb1\xe6\xaf\x93\xe4\xbb\xaa\r\xe5\xad\x99\xe6\xa3\xae\xe6\xa3\x8b\r\xe8\xb0\xad\xe5\x98\x89\xe7\x90\xaa\r\xe7\x8e\x8b\xe5\xa4\xa9\xe9\x9f\xb5\r\xe5\x90\xb4\xe5\xad\x90\xe7\x8f\xba\r\xe6\x9d\xa8\xe5\x88\xa9\xe8\x8c\xb5\r\xe5\xa7\x9a\xe5\x98\x89\xe9\x9b\xaf\r\xe8\xa2\x81\xe6\x9c\x88\xe6\xbb\xa2\r\xe5\xbc\xa0\xe9\x87\x87\xe7\x8e\
x89\r\xe5\xbc\xa0\xe6\xb2\x81\xe7\x8e\xa5'


It should be fine, but how it works out is very dependent on:

- your Python version, particularly Python 2 versus Python 3

- the text encoding used in the file namelist1801.txt

If you're not using Python 3, I recommend that you do. I _suspect_ fromthe output you have shown, that you are using Python 2.

On a UNIX system (your Mac is a UNIX system, BTW), a text file is astream of bytes. Because it contains text, that text is encoded tobytes in some fashion. On modern systems, the commonest encoding is'utf-8', a variable length encoding of Unicode code points.


In order to read text back from a file, it must be decoded.

You've opened your file as text (which is good, because it containstext).

In Python 2 that is pretty simply minded: you get back _byte_ strings:Python 2 strings are just arrays of bytes, so no decoding reallyhappens. For ASCII text, that gets by. For languages requiring glyphsbeyond that, interpretation is needed. You need unicode strings, whichare _not_ Python 2's default, so your text needs converting.

In Python 3, strings are unicode strings to start with. You must stillindicate the file encoding, but there is a default inferred from youroperating environment, and that is usually 'utf-8'.


So here's an (untested) Python 2 example loop:

 with open('namelist.txt') as f:
   for line in f:
     line = line.strip()
     print("line =", line)
     uline = unicode(line, 'utf-8')
     print("uline =", uline)

Here's a Python 2 example of taking your text string and converting it:

 >>> 
s='\xe9\x99\x88\xe5\xb7\x8d\n\xe8\x83\xa1\xe6\x99\xba\xe5\x81\xa5\r\xe9\xbb\x84\xe5\x9d\xa4\xe6\xa6\x95\r\xe6\x9d\x8e\xe6\x98\x9f\xe7\x81\xbf\r\xe5\x88\x98\xe8\xb6\x85'
 >>> unicode(s,'utf-8')
 
u'\u9648\u5dcd\n\u80e1\u667a\u5065\r\u9ec4\u5764\u6995\r\u674e\u661f\u707f\r\u5218\u8d85'
 >>> print(unicode(s,'utf-8'))
 陈巍
 刘超灿
 >>>

I cannot read Chinese text, but the glyphs look like it to my eye.

I'm using a Mac, and did nothing special.

Note that I had to take portion of your text which ended on a completeunicode character, otherwise the decode fails. My first cut/pastestopped one byte beyond the \x85 that ends the string above, and failed.Your entire string should also decode cleanly.


In Python 3 the loop is much cleaner:

 with open('namelist.txt', encoding='utf-8') as f:
   for line in f:
     line = line.strip()
     print("line =", line)

because the file open understands the encoding. I have explicitlyspecified 'utf-8' there, but you may find that it is the default foryou.


Cheers,
Cameron Simpson <[email protected]>
--
https://mail.python.org/mailman/listinfo/python-list

Re: What's wrong with these codes as failed to read the strings in Chinese? Is it because Chinese characters can't be read on Mac? Many thanks

Reply via email to