On 08Nov2018 19:30, Annie Lu <gabriella19930...@gmail.com> wrote:
# -*- coding: UTF-8 -*-
... f = open('/Users/annielu/Desktop/namelist1801.txt')
namelist1801txt = f.read()
f.close()
namelist1801txt
'\xe9\x99\x88\xe5\xb7\x8d\n\xe8\x83\xa1\xe6\x99\xba\xe5\x81\xa5\r\xe9\xbb\x84\xe5\x9d\xa4\xe6\xa6\x95\r\xe6\x9d\x8e\xe6\x98\x9f\xe7\x81\xbf\r\xe5\x88\x98\xe8\xb6\x85\xe6\x9d\xb0\r\xe7\x8e\x8b\xe4\xbf\x8a\xe5\x80\xbc\r\xe4\xbd\x99\xe4\xb8\x9c\xe6\xbd\xae\r\xe9\x99\x88\xe6\x80\x9d\xe5\x87\xbd\r\xe5\x86\xaf\xe5\xb0\x91\xe5\x90\x9b\r\xe9\xbb\x84\xe5\x98\x89\xe8\xb0\x8a\r\xe9\xbb\x84\xe7\x90\xaa\xe7\x90\xaa\r\xe8\xb5\x96\xe5\xa9\x89\xe5\xa9\xb7\r\xe8\xb5\x96\xe5\xbd\xa6\xe9\x9c\x8f\r\xe5\xbb\x96\xe7\xbf\xa0\xe7\x9b\x88\r\xe6\x9e\x97\xe7\xbe\xbd\xe7\x8f\x82\r\xe5\x88\x98\xe5\xae\x89\xe7\x90\xaa\r\xe9\xa9\xac\xe7\x91\x9e\r\xe5\xbd\xad\xe5\x98\x89\xe4\xbb\xaa\r\xe9\x82\xb1\xe6\xaf\x93\xe4\xbb\xaa\r\xe5\xad\x99\xe6\xa3\xae\xe6\xa3\x8b\r\xe8\xb0\xad\xe5\x98\x89\xe7\x90\xaa\r\xe7\x8e\x8b\xe5\xa4\xa9\xe9\x9f\xb5\r\xe5\x90\xb4\xe5\xad\x90\xe7\x8f\xba\r\xe6\x9d\xa8\xe5\x88\xa9\xe8\x8c\xb5\r\xe5\xa7\x9a\xe5\x98\x89\xe9\x9b\xaf\r\xe8\xa2\x81\xe6\x9c\x88\xe6\xbb\xa2\r\xe5\xbc\xa0\xe9\x87\x87\xe7\x8e\
x89\r\xe5\xbc\xa0\xe6\xb2\x81\xe7\x8e\xa5'


It should be fine, but how it works out is very dependent on:

- your Python version, particularly Python 2 versus Python 3

- the text encoding used in the file namelist1801.txt

If you're not using Python 3, I recommend that you do. I _suspect_ from the output you have shown, that you are using Python 2.

On a UNIX system (your Mac is a UNIX system, BTW), a text file is a stream of bytes. Because it contains text, that text is encoded to bytes in some fashion. On modern systems, the commonest encoding is 'utf-8', a variable length encoding of Unicode code points.

In order to read text back from a file, it must be decoded.

You've opened your file as text (which is good, because it contains text).

In Python 2 that is pretty simply minded: you get back _byte_ strings: Python 2 strings are just arrays of bytes, so no decoding really happens. For ASCII text, that gets by. For languages requiring glyphs beyond that, interpretation is needed. You need unicode strings, which are _not_ Python 2's default, so your text needs converting.

In Python 3, strings are unicode strings to start with. You must still indicate the file encoding, but there is a default inferred from your operating environment, and that is usually 'utf-8'.

So here's an (untested) Python 2 example loop:

 with open('namelist.txt') as f:
   for line in f:
     line = line.strip()
     print("line =", line)
     uline = unicode(line, 'utf-8')
     print("uline =", uline)

Here's a Python 2 example of taking your text string and converting it:

 >>> 
s='\xe9\x99\x88\xe5\xb7\x8d\n\xe8\x83\xa1\xe6\x99\xba\xe5\x81\xa5\r\xe9\xbb\x84\xe5\x9d\xa4\xe6\xa6\x95\r\xe6\x9d\x8e\xe6\x98\x9f\xe7\x81\xbf\r\xe5\x88\x98\xe8\xb6\x85'
 >>> unicode(s,'utf-8')
 
u'\u9648\u5dcd\n\u80e1\u667a\u5065\r\u9ec4\u5764\u6995\r\u674e\u661f\u707f\r\u5218\u8d85'
 >>> print(unicode(s,'utf-8'))
 陈巍
 刘超灿
 >>>

I cannot read Chinese text, but the glyphs look like it to my eye.

I'm using a Mac, and did nothing special.

Note that I had to take portion of your text which ended on a complete unicode character, otherwise the decode fails. My first cut/paste stopped one byte beyond the \x85 that ends the string above, and failed. Your entire string should also decode cleanly.

In Python 3 the loop is much cleaner:

 with open('namelist.txt', encoding='utf-8') as f:
   for line in f:
     line = line.strip()
     print("line =", line)

because the file open understands the encoding. I have explicitly specified 'utf-8' there, but you may find that it is the default for you.

Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to