Re: Reading by positions plain text files
On Dec 12, 11:21 pm, Dennis Lee Bieber wlfr...@ix.netcom.com wrote: On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd javiervan...@gmail.com declaimed the following in gmane.comp.python.general: f = open(r'c:c:\somefile.txt', 'w') f.write('0123456789\n0123456789\n0123456789') Not the most explanatory sample data... It would be better if the records had different contents. f.close() f = open(r'c:\somefile.txt', 'r') for line in f: Here you extract one line from the file f.seek(3,0) print f.read(1) #just to know if its printing the rigth column And here you ignored the entire line you read, seeking to the fourth byte from the beginning of the file, andreadingjust one byte from it. I have no idea of how seek()/read() behaves relative to line iteration in the for loop... Given the small size of the test data set it is quite likely that the first for line in f resulted in the entire file being read into a buffer, and that buffer scanned to find the line ending and return the data preceding it; then the buffer position is set to after that line ending so the next for line continues from that point. But in a situation with a large data set, or an unbuffered I/O system, the seek()/read() could easily result in resetting the file position used by the for line, so that the second call returns 456789\n... And all subsequent calls too, resulting in an infinite loop. Presuming the assignment requires pulling multiple selected fields from individual records, where each record is of the same format/spacing, AND that the field selection can not be preprogrammed... Sample data file (use fixed width font to view): -=-=-=-=-=- Wulfraed 09Ranger 1915 Bask Euren 13Cleric 1511 Aethelwulf 07Mage 0908 Cwiculf 08Mage 1008 -=-=-=-=-=- Sample format definition file: -=-=-=-=-=- Name 0-14 Level 15-16 Class 17-24 THAC0 25-26 Armor 27-28 -=-=-=-=-=- Code to process (Python 2.5, with minimal error handling): -=-=-=-=-=- class Extractor(object): def __init__(self, formatFile): ff = open(formatFile, r) self._format = {} self._length = 0 for line in ff: form = line.split(\t) #file must be tab separated if len(form) != 2: print Invalid file format definition: %s % line continue name = form[0] columns = form[1].split(-) if len(columns) == 1: #single column definition start = int(columns[0]) end = start elif len(columns) == 2: start = int(columns[0]) end = int(columns[1]) else: print Invalid column definition: %s % form[1] continue self._format[name] = (start, end) self._length = max(self._length, end) ff.close() def __call__(self, line): data = {} if len(line) self._length: print Data line is too short for required format: ignored else: for (name, (start, end)) in self._format.items(): data[name] = line[start:end+1] return data if __name__ == __main__: FORMATFILE = SampleFormat.tsv DATAFILE = SampleData.txt characterExtractor = Extractor(FORMATFILE) df = open(DATAFILE, r) for line in df: fields = characterExtractor(line) for (name, value) in fields.items(): print Field name: '%s'\t\tvalue: '%s' % (name, value) print df.close() -=-=-=-=-=- Output from running above code: -=-=-=-=-=- Field name: 'Armor' value: '15' Field name: 'THAC0' value: '19' Field name: 'Level' value: '09' Field name: 'Class' value: 'Ranger ' Field name: 'Name' value: 'Wulfraed ' Field name: 'Armor' value: '11' Field name: 'THAC0' value: '15' Field name: 'Level' value: '13' Field name: 'Class' value: 'Cleric ' Field name: 'Name' value: 'Bask Euren ' Field name: 'Armor' value: '08' Field name: 'THAC0' value: '09' Field name: 'Level' value: '07' Field name: 'Class' value: 'Mage ' Field name: 'Name' value: 'Aethelwulf ' Field name: 'Armor' value: '08' Field name: 'THAC0' value: '10' Field name: 'Level' value: '08' Field name: 'Class' value: 'Mage ' Field name: 'Name' value: 'Cwiculf ' -=-=-=-=-=- Note that string fields have not been trimmed, also numeric fields are still intextformat... The format definition file would need to be expanded to include a string, integer, float (and Boolean?) code in order for the extractor to do proper type
Re: Reading by positions plain text files
On Dec 1, 7:15 am, Tim Harig user...@ilthio.net wrote: On 2010-12-01, javivd javiervan...@gmail.com wrote: On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote: On 2010-11-30, javivd javiervan...@gmail.com wrote: I have a case now in wich anotherfilehas been provided (besides the database) that tells me in wich column of thefileis every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This secondfilespecify the variable name and his position: VARIABLE NAME POSITION (COLUMN) INFILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can usefile.seek() to move to an absolute offset in thefile, if that really is what you are looking for. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. Thenfile.seek() is what you are looking for; but, you need to be aware of line endings and encodings as indicated. Make sure that you open thefile using whatever encoding was used when it was generated or you could have problems with multibyte characters affecting the offsets. I've tried your advice and something is wrong. Here is my code, f = open(r'c:c:\somefile.txt', 'w') f.write('0123456789\n0123456789\n0123456789') f.close() f = open(r'c:\somefile.txt', 'r') for line in f: f.seek(3,0) print f.read(1) #just to know if its printing the rigth column I used .seek() in this manner, but is not working. Let me put the problem in another way. I have .txt file with NO headers, and NO blanks between any columns. But i know that from columns, say 13 to 15, is variable VARNAME_1 (of course, a three digit var). How can extract that column in a list call VARNAME_1?? Obviously, this should extend to all the positions and variables i have to extract from the file. Thanks! J -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 2010-12-12, javivd javiervan...@gmail.com wrote: On Dec 1, 7:15 am, Tim Harig user...@ilthio.net wrote: On 2010-12-01, javivd javiervan...@gmail.com wrote: On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote: encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, ^^ Note that I specifically questioned the use of absolute file position vs. postion within a column. These are two different things. You use different methods to extract each. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. Thenfile.seek() is what you are looking for; but, you need to be aware of line endings and encodings as indicated. Make sure that you open thefile using whatever encoding was used when it was generated or you could have problems with multibyte characters affecting the offsets. f = open(r'c:c:\somefile.txt', 'w') I suspect you don't need to use the c: twice. f.write('0123456789\n0123456789\n0123456789') Note that the file you a writing contains three lines. Is the data that you are looking for located at an absolute position in the file or on a position within a individual line? If the latter, not that line endings may be composed of more then a single character. f.write('0123456789\n0123456789\n0123456789') ^ postion 3 using fseek() for line in f: Perhaps you meant: for character in f.read(): or for line in f.read().splitlines() f.seek(3,0) This will always take you back to the exact fourth position in the file (indicated above). I used .seek() in this manner, but is not working. It is working the way it is supposed to. If you want the absolution position 3 in a file then: f = open('somefile.txt', 'r') f.seek(3) variable = f.read(1) If you want the absolute position in a column: f = open('somefile.txt', 'r').read().splitlines() for column in f: variable = column[3] -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 2010-12-12, Tim Harig user...@ilthio.net wrote: I used .seek() in this manner, but is not working. It is working the way it is supposed to. If you want the absolute position in a column: f = open('somefile.txt', 'r').read().splitlines() for column in f: variable = column[3] or: f = open('somefile.txt', 'r') for column in f.readlines(): variable = column[3] -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On Dec 1, 3:15 am, Tim Harig user...@ilthio.net wrote: On 2010-12-01, javivd javiervan...@gmail.com wrote: On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote: On 2010-11-30, javivd javiervan...@gmail.com wrote: I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can use file.seek() to move to an absolute offset in the file, if that really is what you are looking for. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. Then file.seek() is what you are looking for; but, you need to be aware of line endings and encodings as indicated. Make sure that you open the file using whatever encoding was used when it was generated or you could have problems with multibyte characters affecting the offsets. Ok, I will try it and let you know. Thanks all!! -- http://mail.python.org/mailman/listinfo/python-list
RE: Reading by positions plain text files
Ok. I will try it and let you know. Thanks a lot!! J Date: Tue, 30 Nov 2010 20:32:56 -0600 From: python.l...@tim.thechases.com To: javiervan...@gmail.com CC: python-list@python.org Subject: Re: Reading by positions plain text files On 11/30/2010 08:03 PM, javivd wrote: On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net wrote: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) and no, MRAB, it's not the similar problem (at least what i understood of it). I have to associate the position this file give me with the variable name this file give me for those positions. MRAB may be referring to my reply in that thread where you can do something like OFFSETS = 'offsets.txt' offsets = {} f = file(OFFSETS) f.next() # throw away the headers for row in f: varname, rest = row.split()[:2] # sanity check if varname in offsets: print [%s] in %s twice?! % (varname, OFFSETS) if '-' not in rest: continue start, stop = map(int, rest.split('-')) offsets[varname] = slice(start, stop+1) # 0-based offsets #offsets[varname] = slice(start+1, stop+2) # 1-based offsets f.close() def do_something_with(data): # your real code goes here print data['var_name_2'] for row in file('data.txt'): data = dict((name, row[offsets[name]]) for name in offsets) do_something_with(data) There's additional robustness-checks I'd include if your offsets-file isn't controlled by you (people send me daft data). -tkc -- http://mail.python.org/mailman/listinfo/python-list
Reading by positions plain text files
Hi all, Sorry, newbie question: I have database in a plain text file (could be .txt or .dat, it's the same) that I need to read in python in order to do some data validation. In other files I read this kind of files with the split() method, reading line by line. But split() relies on a separator character (I think... all I know is that it's work OK). I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) How can I read this so each position in the file it's associated with each variable name? Thanks a lot!! Javier -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 2010-11-30, javivd javiervan...@gmail.com wrote: I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1123-123 var_name_2124-125 var_name_3126-126 .. .. var_name_N512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can use file.seek() to move to an absolute offset in the file, if that really is what you are looking for. -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 30/11/2010 21:31, javivd wrote: Hi all, Sorry, newbie question: I have database in a plain text file (could be .txt or .dat, it's the same) that I need to read in python in order to do some data validation. In other files I read this kind of files with the split() method, reading line by line. But split() relies on a separator character (I think... all I know is that it's work OK). I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) How can I read this so each position in the file it's associated with each variable name? It sounds like a similar problem to this: http://groups.google.com/group/comp.lang.python/browse_thread/thread/53e6f41bfff6/123422d510187dc3?show_docid=123422d510187dc3 -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote: On 2010-11-30, javivd javiervan...@gmail.com wrote: I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can use file.seek() to move to an absolute offset in the file, if that really is what you are looking for. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. and no, MRAB, it's not the similar problem (at least what i understood of it). I have to associate the position this file give me with the variable name this file give me for those positions. thank you both and sorry for my english! J -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 01/12/2010 02:03, javivd wrote: On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net wrote: On 2010-11-30, javivdjaviervan...@gmail.com wrote: I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can use file.seek() to move to an absolute offset in the file, if that really is what you are looking for. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. and no, MRAB, it's not the similar problem (at least what i understood of it). I have to associate the position this file give me with the variable name this file give me for those positions. thank you both and sorry for my english! You just have to parse the second file to build a list (or dict) containing the name, start position and end position of each variable: variables = [(var_name_1, 123, 123), ...] and then work through that list, extracting the data between those positions in the first file and putting the values in another list (or dict). You also need to check whether the positions are 1-based or 0-based (Python uses 0-based). -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 11/30/2010 08:03 PM, javivd wrote: On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net wrote: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) and no, MRAB, it's not the similar problem (at least what i understood of it). I have to associate the position this file give me with the variable name this file give me for those positions. MRAB may be referring to my reply in that thread where you can do something like OFFSETS = 'offsets.txt' offsets = {} f = file(OFFSETS) f.next() # throw away the headers for row in f: varname, rest = row.split()[:2] # sanity check if varname in offsets: print [%s] in %s twice?! % (varname, OFFSETS) if '-' not in rest: continue start, stop = map(int, rest.split('-')) offsets[varname] = slice(start, stop+1) # 0-based offsets #offsets[varname] = slice(start+1, stop+2) # 1-based offsets f.close() def do_something_with(data): # your real code goes here print data['var_name_2'] for row in file('data.txt'): data = dict((name, row[offsets[name]]) for name in offsets) do_something_with(data) There's additional robustness-checks I'd include if your offsets-file isn't controlled by you (people send me daft data). -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Reading by positions plain text files
On 2010-12-01, javivd javiervan...@gmail.com wrote: On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote: On 2010-11-30, javivd javiervan...@gmail.com wrote: I have a case now in wich another file has been provided (besides the database) that tells me in wich column of the file is every variable, because there isn't any blank or tab character that separates the variables, they are stick together. This second file specify the variable name and his position: VARIABLE NAME POSITION (COLUMN) IN FILE var_name_1 123-123 var_name_2 124-125 var_name_3 126-126 .. .. var_name_N 512-513 (last positions) I am unclear on the format of these positions. They do not look like what I would expect from absolute references in the data. For instance, 123-123 may only contain one byte??? which could change for different encodings and how you mark line endings. Frankly, the use of the world columns in the header suggests that the data *is* separated by line endings rather then absolute position and the position refers to the line number. In which case, you can use splitlines() to break up the data and then address the proper line by index. Nevertheless, you can use file.seek() to move to an absolute offset in the file, if that really is what you are looking for. I work in a survey research firm. the data im talking about has a lot of 0-1 variables, meaning yes or no of a lot of questions. so only one position of a character is needed (not byte), explaining the 123-123 kind of positions of a lot of variables. Then file.seek() is what you are looking for; but, you need to be aware of line endings and encodings as indicated. Make sure that you open the file using whatever encoding was used when it was generated or you could have problems with multibyte characters affecting the offsets. -- http://mail.python.org/mailman/listinfo/python-list