Something I've occasionally found helpful with problem text files is to build a histogram of character counts, something like this:
""" chist.py print a histogram of character frequencies in a nemed input file """ import sys whitespace = ' \t\n\r\v\f' lowercase = 'abcdefghijklmnopqrstuvwxyz' uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' letters = lowercase + uppercase ascii_lowercase = lowercase ascii_uppercase = uppercase ascii_letters = ascii_lowercase + ascii_uppercase digits = '0123456789' hexdigits = digits + 'abcdef' + 'ABCDEF' octdigits = '01234567' punctuation = """!"#$%&'()*+,-./:;<=>[EMAIL PROTECTED]|}~""" printable = digits + letters + punctuation try: fname = sys.argv[1] except: print "usage is chist yourfilename" sys.exit() chars = {} f = open (fname, "rb") lines = f.readlines() for line in lines: for c in line: try: chars[ord(c)] += 1 except: chars[ord(c)] = 1 ords = chars.keys() ords.sort() for o in ords: if chr(o) in printable: c = chr(o) else: c = "UNP" print "%5d %-5s %10d" % (o, c, chars[o]) print "_" * 50 Gerry On Dec 20, 5:47 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 21, 8:13 am, Steven D'Aprano <[EMAIL PROTECTED] > > > > cybersource.com.au> wrote: > > [Fixing top-posting.] > > > On Thu, 20 Dec 2007 12:41:44 -0800, Wojciech Gryc wrote: > > > On Dec 20, 3:30 pm, John Machin <[EMAIL PROTECTED]> wrote: > > [snip] > > >> > However, when I use Python's various methods -- readline(), > > >> > readlines(), or xreadlines() and loop through the lines of the file, > > >> > the line program exits at 16,000 lines. No error output or anything > > >> > -- it seems the end of the loop was reached, and the code was > > >> > executed successfully. > > ... > > >> One possibility: you are running this on Windows and the file contains > > >> Ctrl-Z aka chr(26) aka '\x1a'. > > > > Hi, > > > > Python 2.5, on Windows XP. Actually, I think you may be right about \x1a > > > -- there's a few lines that definitely have some strange character > > > sequences, so this would make sense... Would you happen to know how I > > > can actually fix this (e.g. replace the character)? Since Python doesn't > > > see the rest of the file, I don't even know how to get to it to fix the > > > problem... Due to the nature of the data I'm working with, manual > > > editing is also not an option. > > > > Thanks, > > > Wojciech > > > Open the file in binary mode: > > > open(filename, 'rb') > > > and Windows should do no special handling of Ctrl-Z characters. > > > -- > > Steven > > I don't know whether it's a bug or a feature or just a dark corner, > but using mode='rU' does no special handling of Ctrl-Z either. > > >>> x = 'foo\r\n\x1abar\r\n' > >>> f = open('udcray.txt', 'wb') > >>> f.write(x) > >>> f.close() > >>> open('udcray.txt', 'r').readlines() > ['foo\n'] > >>> open('udcray.txt', 'rU').readlines() > > ['foo\n', '\x1abar\n']>>> for line in open('udcray.txt', 'rU'): > > ... print repr(line) > ... > 'foo\n' > '\x1abar\n' > > > > Using 'rU' should make the OP's task of finding the strange character > sequences a bit easier -- he won't have to read a block at a time and > worry about the guff straddling a block boundary. -- http://mail.python.org/mailman/listinfo/python-list