subject:"Re\: \[Tutor\] bogus characters in a windows file"

Re: [Tutor] bogus characters in a windows file

2012-02-09 Thread Peter Otten

Garry Willgoose wrote:

 I input the data with the lines
 
 infile = open('c:\cpu.txt','r')
 infile.readline()
 infile.readline()
 infile.readline()
 
 the readline()s yield the following output
 
 '\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
 '\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
 '\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'

You were already told that you are trying to read a UTF-16-encoded file. 
Here's how to deal with that:

 import codecs
 with codecs.open(cpu.txt, rU, encoding=UTF-16) as f:
... for line in f:
... print line.rstrip(\n)
...
ProcessId
0
4


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] bogus characters in a windows file

2012-02-09 Thread Garry Willgoose

 
 I'm reading a file output by the system utility WMIC in windows (so I can 
 track CPU usage by process ID) and the text file WMIC outputs seems to have 
 extra characters in I've not seen before.
 
 I use os.system('WMIC /OUTPUT:c:\cpu.txt PROCESS GET ProcessId') to output 
 the file and parse file c:\cpu.txt
 
 First mistake.  If you use backslash inside a python literal string, you need 
 to do one of two things:
   1) use a raw string
   2) double the backslash
 It so happens that \c is not a python escape sequence, so you escaped this 
 particular bug.

Lucked out on that one ... slipped under my radar. I was just cutting and 
pasting some code from the documentation to WMIC ;-)

 
 The first few lines of the file look like this in notepad
 
 ProcessId
 0
 4
 568
 624
 648
 
 
 I input the data with the lines
 
 infile = open('c:\cpu.txt','r')
 Same thing.  You should either make it r'c:\cpu.txt'   or   'c:\\cpu.txt'  or 
  even 'c:/cpu.txt'
 infile.readline()
 infile.readline()
 infile.readline()
 
 OK, so you throw away the first 3 lines of the file.
 
 the readline()s yield the following output
 
 '\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
 '\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
 '\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
 
 Now, how did you get those bytes displayed;  they've already been thrown out.

Simple run the readline() commands at the command line and python interpreter 
prompt (or IDLE if you like). The results are not thrown away ... they are 
echoed to the screen.

 Now for the first line the title 'ProcessId' is in this string but the 
 individual characters are separated by '\x00' and at least for the first 
 line of the file there is an extra '\xff\xfe'. For subsequent its just 
 '\x00. Now I can just replace the '\x**' with '' but that seems a bit 
 inelegant. I've tried various options on the open 'rU' and 'rb' but no 
 effect.
 
 Does anybody know what the rubbish characters are and what has caused the. 
 I'm using the latest Enthought python if that matters.
 It matters, but it'd save each of us lots of trouble if you told us what 
 version that was;  especially which version of Python.  The latest Enthought 
 I see is called EPD 7.2.  But after 10 minutes on the site, I can't see 
 whether there actually is a Python on there or not.  it seems to be just a 
 bunch of libraries for Python.  But whether they're for CPython, IronPython, 
 or something else, who knows?

My fault. Its Python 2.7.1 ... Ipython interpreter. 

 
 
 I don't see any rubbish characters.  What I see is some unicode strings, 
 displayed as though they were byte strings.  the first two bytes are the BOM 
 code, commonly put at the beginning of a file encoded in UTF-16.  The 
 remaining pairs of bytes are UTF-16 encodings for ordinary characters.  
 Notepad would recognize the UTF-16 encoding, and display the characters 
 correctly.  Perhaps you need to do the same.

Yes well this was the insight I was after. At one stage I was using a 
distribution compiled for Unicode (so I'm guessing I would have never seen this 
problem then) but it seems like the last distribution from Enthought is 
non-Unicode (I've sent them an email to confirm this ... but thats what it 
looks like). This is the first time I've explicitly faced Unicode input from a 
text file so the \x00 stuff was unfamiliar with the details of how it works and 
displays itself in a normal string. Mostly I've seen them in python as 
u'string' and never paid much attention (unless I passed them as a file name to 
open() ... when they caused all sorts of grief until I realised I needed to 
change their type to str with str())


Since this is one-off to get one of my PhD students out of hole I might just 
filter out the \x** characters explicitly since the remainder looks OK. 

As background the reason for this is to manage a stand-alone science code 
developed elsewhere to ensure that CPU usage doesn't go out of control. We're 
doing thousands of runs with this code (monte-carlo simulation), launching the 
code for each simulation with os.system() and occasionally a simulation goes 
into an infinite loop, which stalls the monte-carlo so we just want to be able 
to kill that simulation and go to the next one. WE do this sort of stuff on 
*NIX all the time using the unix command 'ps' but because the executable we 
need to use is somebody else's we are stuck on Windows ... and WMIC looks the 
easiest, quickest way to achieve this sort of process control on Windows. If 
anybody has any other ideas how to do this direct from python that might be 
platform independent (being able to set some CPU limits on a popen call for 
instance) I'd be interested but looking on the web most of the solutions look 
rather difficult.


Prof Garry Willgoose,
Director, Centre for Climate Impact Management (C2IM),
Head

Re: [Tutor] bogus characters in a windows file

2012-02-08 Thread Marc Tompkins

On Wed, Feb 8, 2012 at 5:46 PM, Garry Willgoose 
garry.willgo...@newcastle.edu.au wrote:

 I'm reading a file output by the system utility WMIC in windows (so I can
 track CPU usage by process ID) and the text file WMIC outputs seems to have
 extra characters in I've not seen before.

 I use os.system('WMIC /OUTPUT:c:\cpu.txt PROCESS GET ProcessId') to output
 the file and parse file c:\cpu.txt

 The first few lines of the file look like this in notepad

 ProcessId
 0
 4
 568
 624
 648


 I input the data with the lines

 infile = open('c:\cpu.txt','r')
 infile.readline()
 infile.readline()
 infile.readline()

 the readline()s yield the following output

 '\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
 '\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
 '\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'

 Now for the first line the title 'ProcessId' is in this string but the
 individual characters are separated by '\x00' and at least for the first
 line of the file there is an extra '\xff\xfe'. For subsequent its just
 '\x00. Now I can just replace the '\x**' with '' but that seems a bit
 inelegant. I've tried various options on the open 'rU' and 'rb' but no
 effect.

 Does anybody know what the rubbish characters are and what has caused the.
 I'm using the latest Enthought python if that matters.

 You're trying to read a Unicode text file byte-by-byte.  It'll end in
tears...
The \xff\xfe at the beginning is the Byte Order Marker or BOM.

Here's a quick primer on Unicode:
http://www.joelonsoftware.com/articles/Unicode.html
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] bogus characters in a windows file

2012-02-08 Thread Dave Angel


On 02/08/2012 08:46 PM, Garry Willgoose wrote:

I'm reading a file output by the system utility WMIC in windows (so I can track 
CPU usage by process ID) and the text file WMIC outputs seems to have extra 
characters in I've not seen before.

I use os.system('WMIC /OUTPUT:c:\cpu.txt PROCESS GET ProcessId') to output the 
file and parse file c:\cpu.txt


First mistake.  If you use backslash inside a python literal string, you 
need to do one of two things:

   1) use a raw string
   2) double the backslash
It so happens that \c is not a python escape sequence, so you escaped 
this particular bug.



The first few lines of the file look like this in notepad

ProcessId
0
4
568
624
648


I input the data with the lines

infile = open('c:\cpu.txt','r')
Same thing.  You should either make it r'c:\cpu.txt'   or   
'c:\\cpu.txt'  or  even 'c:/cpu.txt'

infile.readline()
infile.readline()
infile.readline()


OK, so you throw away the first 3 lines of the file.


the readline()s yield the following output

'\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
'\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
'\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'

Now, how did you get those bytes displayed;  they've already been thrown 
out.

Now for the first line the title 'ProcessId' is in this string but the 
individual characters are separated by '\x00' and at least for the first line 
of the file there is an extra '\xff\xfe'. For subsequent its just '\x00. Now I 
can just replace the '\x**' with '' but that seems a bit inelegant. I've tried 
various options on the open 'rU' and 'rb' but no effect.

Does anybody know what the rubbish characters are and what has caused the. I'm 
using the latest Enthought python if that matters.
It matters, but it'd save each of us lots of trouble if you told us what 
version that was;  especially which version of Python.  The latest 
Enthought I see is called EPD 7.2.  But after 10 minutes on the site, I 
can't see whether there actually is a Python on there or not.  it seems 
to be just a bunch of libraries for Python.  But whether they're for 
CPython, IronPython, or something else, who knows?



I don't see any rubbish characters.  What I see is some unicode strings, 
displayed as though they were byte strings.  the first two bytes are the 
BOM code, commonly put at the beginning of a file encoded in UTF-16.  
The remaining pairs of bytes are UTF-16 encodings for ordinary 
characters.  Notepad would recognize the UTF-16 encoding, and display 
the characters correctly.  Perhaps you need to do the same.


You showed us a fragment of code which would throw away the first 3 
lines of the file.  You don't show us any code indicating what you mean 
by yield the following output.


So you want us to read your mind, and tell you what's there?



--

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] bogus characters in a windows file

2012-02-08 Thread Marc Tompkins

On Wed, Feb 8, 2012 at 6:09 PM, Marc Tompkins marc.tompk...@gmail.comwrote:

 On Wed, Feb 8, 2012 at 5:46 PM, Garry Willgoose 
 garry.willgo...@newcastle.edu.au wrote:

 I'm reading a file output by the system utility WMIC in windows (so I can
 track CPU usage by process ID) and the text file WMIC outputs seems to have
 extra characters in I've not seen before.

 I use os.system('WMIC /OUTPUT:c:\cpu.txt PROCESS GET ProcessId') to
 output the file and parse file c:\cpu.txt

 The first few lines of the file look like this in notepad

 ProcessId
 0
 4
 568
 624
 648


 I input the data with the lines

 infile = open('c:\cpu.txt','r')
 infile.readline()
 infile.readline()
 infile.readline()

 the readline()s yield the following output

 '\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
 '\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
 '\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'

 Now for the first line the title 'ProcessId' is in this string but the
 individual characters are separated by '\x00' and at least for the first
 line of the file there is an extra '\xff\xfe'. For subsequent its just
 '\x00. Now I can just replace the '\x**' with '' but that seems a bit
 inelegant. I've tried various options on the open 'rU' and 'rb' but no
 effect.

 Does anybody know what the rubbish characters are and what has caused
 the. I'm using the latest Enthought python if that matters.

 You're trying to read a Unicode text file byte-by-byte.  It'll end in
 tears...
 The \xff\xfe at the beginning is the Byte Order Marker or BOM.

 Here's a quick primer on Unicode:
 http://www.joelonsoftware.com/articles/Unicode.html

 In particular, this phrase:

 we decided to do everything internally in UCS-2 (two byte) Unicode, which
 is what Visual Basic, COM, and Windows NT/2000/XP use as their native
 string type.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] bogus characters in a windows file

Re: [Tutor] bogus characters in a windows file

Re: [Tutor] bogus characters in a windows file

Re: [Tutor] bogus characters in a windows file

Re: [Tutor] bogus characters in a windows file

5 matches

Site Navigation

Mail list logo

Footer information