On 01/-10/-28163 02:59 PM, Martin Hvidberg wrote:
I'm reading a fixed format text file, line by line. I hereunder present
the code. I have <snipped> out part not related to the file reading.
Only relevant detail left out is the lstCutters. It looks like this:
[[1, 9], [11, 21], [23, 48], [50, 59], [61, 96], [98, 123], [125, 150]]
It specifies the first and last character position of each token in the
fixed format of the input line.
All this works fine, and is only to explain where I'm going.

The code, in the function definition, is broken up in more lines than
necessary, to be able to monitor the variables, step by step.

--- Code start ------

import codecs

<snip>

def CutLine2List(strIn,lstCut):
strIn = strIn.strip()
print '>InNextLine>',strIn
# skip if line is empty
if len(strIn)<1:
return False
lstIn = list()
for cc in lstCut:
strSubline =strIn[cc[0]-1:cc[1]-1].strip()
lstIn.append(strSubline)
print '>InSubline2>'+lstIn[len(lstIn)-1]+'<'
del strIn, lstCut,cc
print '>InReturLst>',lstIn
return lstIn

<snip>

filIn = codecs.open(
strFileNameIn,
mode='r',
encoding='utf-8',
errors='strict',
buffering=1)
for linIn in filIn:
lstIn = CutLine2List(linIn,lstCutters)

--- Code end ------

A sample output, representing one line from the input file looks like this:

 >InNextLine> I 30 2002-12-11 20:01:19.280 563 FANØ
2001-12-12-15.46.12.734502 2001-12-12-15.46.12.734502
 >InSubline2>I<
 >InSubline2>30<
 >InSubline2>2002-12-11 20:01:19.280<
 >InSubline2>563<
 >InSubline2>FANØ<
 >InSubline2>2001-12-12-15.46.12.73450<
 >InSubline2>2001-12-12-15.46.12.73450<
 >InReturLst> [u'I', u'30', u'2002-12-11 20:01:19.280', u'563',
u'FAN\xd8', u'2001-12-12-15.46.12.73450', u'2001-12-12-15.46.12.73450']


Question:
In the last printout, tagged >InReturLst> all entries turn into
uni-code. What happens here?
Look for the word 'FANØ'. This word changes from 'FANØ' to u'FAN\xd8' --
That's a problem to me, and I don't want it to change like this.

What do I do to stop this behavior?

Best Regards
Martin


If you don't want Unicode, why do you specify that the file is encoded as utf-8 ? If it's ASCII, just open the file, without using a utf-8 codec. Of course, then you'll have to fix the input file to make it ASCII.

The character in the input file following the letters "FAN" is not a zero, it's some other character, apparently 00D8 in the Unicode table, not 0030.

It didn't "change" in the InRturLst line. You were reading Unicode strings from the file. When you print Unicode, it encodes it in whatever your console device specifies. But when you print a "list," it uses repr() on the elements, so you get to see what their real type is.

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to