subject:"Re\\\: unknown encoding problem"

Re: unknown encoding problem

2005-04-08 Thread Peter Otten

Uwe Mayer wrote:

 I need to read in a text file which seems to be stored in some unknown
 encoding. Opening and reading the files content returns:
 
 f.read()
 '\x00 \x00 \x00\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
 
 Each character has a \x00 prepended to it. I suspect its some kind of
 unicode - how do I get rid of it?

Intermittent '\x00' bytes are a indeed strong evidence for unicode. Use
codecs.open() to access the data in such a file:

 import codecs
 f = codecs.open(filename, r, UTF-16-BE)
 f.read()
u'  logEntry'

If you don't want unicode, convert back to str:

 _.encode(latin1)
'  logEntry'

Note that the last step may fail if the file contains characters not
available in the string encoding you specify.

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unknown encoding problem

2005-04-08 Thread Leif K-Brooks

Uwe Mayer wrote:
Hi,
I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

f.read()
'\x00 \x00 \x00\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...
Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it? 
f.read().decode('utf16')
--
http://mail.python.org/mailman/listinfo/python-list

Re: unknown encoding problem

2005-04-08 Thread John Machin

On Fri, 08 Apr 2005 15:45:35 +0200, Uwe Mayer [EMAIL PROTECTED]
wrote:

Hi,

I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

 f.read()
'\x00 \x00 \x00\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it? 


Interesting attitude. Why do you want to get rid of it? Have you
considered investigating the source of this suspicious text? You never
know, there could be something really interesting in there, like
'\x00v\x00o\x00n\x00 \x04\x1c\x04\x04A\x04:\x042\x040\x00
\x00m\x00i\x00t\x00 \x00L\x00i\x00e\x00b' :-)

 str.replace('\x00', '')

Why not go the whole hog:

''.join([c for c in foreign_text if 32 = ord(c) = 126 or c in
'\t\r\n'])

Alternatively, try embracing Unicode -- it's the way forward, and it's
not that difficult.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unknown encoding problem

Re: unknown encoding problem

Re: unknown encoding problem

3 matches

Site Navigation

Mail list logo

Footer information