On 8 Jul 2005 11:31:14 -0700, "gov" <[EMAIL PROTECTED]> wrote:
>Hi, > >I've just started to learn programming and was told this was a good >place to ask questions :) > >Where I work, we receive large quantities of data which is currently >all printed on large, obsolete, dot matrix printers. This is a problem >because the replacement parts will not be available for much longer. > >So I'm trying to create a program which will capture the fixed width >text file data and convert as well as sort the data (there are several >different report types) into a different format which would allow it to >be printed normally, or viewed on a computer. > >I've been reading up on the Regular Expression module and ways in which >to manipulate strings however it has been difficult to think of a way >in which to extract an address. > >Here's an example of the raw text that I have to work with: > > >ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: >**************************** > >FOR/POUR AL/LA: 20 > CORR TYP: A1B 2C3 P:3 CHNGD/CHANG > LANG: E CONS/REGR: ####### > MRS XXX X XXXXXXX > ### XXXXXXXXX ST DD TYP: P:6 >CHNGD/CHANG > MONCTON NB LANG: E CONS/REGR: >####### > MRS XXX X XXXXXXX > ##### > #### > ###-###-# > >ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: >**************************** > >FOR/POUR AL/LA: 30 > BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG > LANG: E CONS/REGR: ####### > MISS XXXX XXXXX > ### XXXXXXXX ST > MONCTON NB > >EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS: >*********** > >(the # = any number, and the X's are just regular text) >I would like to extract the address information, but the two different >text objects on the right hand side are difficult to remove. I think >it would be easier if I could just extract a fixed square of >information, but I don't have a clue as to how to go about it. > >If anyone could give me suggestions as to methods in sorting this type >of data, it would be appreciated. > If this is all fixed-width font characters and fixed record formats, you might get some ideas about extracting a "fixed square". I re-joined the strings of the fixed square with '\n'.join(<lines_of_the_square>) to print it, but you could extract data from the lines in various ways with regexes and such. I used your data example and added some under the alternate header. (Not tested beyond what you see ;-) ----< legacy_data_parsing.py >--------------------------------------------------- data = """\ ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: **************************** FOR/POUR AL/LA: 20 CORR TYP: A1B 2C3 P:3 CHNGD/CHANG LANG: E CONS/REGR: ####### MRS XXX X XXXXXXX ### XXXXXXXXX ST DD TYP: P:6 CHNGD/CHANG MONCTON NB LANG: E CONS/REGR: ####### MRS XXX X XXXXXXX ##### #### ###-###-# ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: **************************** FOR/POUR AL/LA: 30 BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG LANG: E CONS/REGR: ####### MISS XXXX XXXXX ### XXXXXXXX ST MONCTON NB EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS: *********** 1 [Don't know what [<- 1,34 This is a box of 2 goes in this kind text with top/left 3 of record, but this character row/col 1,34 4 is some text to show and bottom/right at 4,62 ->] 5 how it might get 6 extracted] """ record_headers = [ """\ ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: """, """\ EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS: """ ] import re recsplitter = re.compile('('+ '|'.join(map(re.escape,record_headers))+')') def extract_block(tl, br, data): lines = [s.ljust(br[1]+1) for s in data.splitlines()] return '\n'.join([line[tl[1]:br[1]+1] for line in lines[tl[0]:br[0]+1]]) for i, hdr_or_body in enumerate(recsplitter.split(data)): if i==0: print '='*10, 'file prefix', '='*30 data_type = '' elif i%2: print '='*10, 'record hdr', '='*30 data_type = hdr_or_body else: print '='*10, 'record data', '='*30 print hdr_or_body print '='*10 if not i%2 and data_type == record_headers[1]: # EARNINGS etc print '---- earnings record right block ----' print extract_block((1,34),(4,62), hdr_or_body) print '----' --------------------------------------------------------------------------------- Produces: [15:33] C:\pywk\clp>py24 legacy_data_parsing.py ========== file prefix ============================== ========== ========== record hdr ============================== ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: ========== ========== record data ============================== **************************** FOR/POUR AL/LA: 20 CORR TYP: A1B 2C3 P:3 CHNGD/CHANG LANG: E CONS/REGR: ####### MRS XXX X XXXXXXX ### XXXXXXXXX ST DD TYP: P:6 CHNGD/CHANG MONCTON NB LANG: E CONS/REGR: ####### MRS XXX X XXXXXXX ##### #### ###-###-# ========== ========== record hdr ============================== ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE: ========== ========== record data ============================== **************************** FOR/POUR AL/LA: 30 BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG LANG: E CONS/REGR: ####### MISS XXXX XXXXX ### XXXXXXXX ST MONCTON NB ========== ========== record hdr ============================== EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS: ========== ========== record data ============================== *********** 1 [Don't know what [<- 1,34 This is a box of 2 goes in this kind text with top/left 3 of record, but this character row/col 1,34 4 is some text to show and bottom/right at 4,62 ->] 5 how it might get 6 extracted] ========== ---- earnings record right block ---- [<- 1,34 This is a box of text with top/left character row/col 1,34 and bottom/right at 4,62 ->] ---- HTH Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list