Yi Xing wrote: > Hi, > > I need to read specific lines of huge text files. Each time, I know > exactly which line(s) I want to read. readlines() or readline() in a > loop is just too slow. Since different lines have different size, I > cannot use seek(). So I am thinking of building an index for the file > for fast access. Can anybody give me some tips on how to do this in > Python? Thanks. > > Yi
I had to do this for some large log files. I wrote one simple script to generate the index file and another that used the index file to read lines from the log file. Here are (slightly cleaned up for clarity) the two scripts. (Note that they'll only work with files less than 4,294,967,296 bytes long.. If your files are larger than that substitute 'Q' for 'L' in the struct formats.) First, genoffsets.py #!/usr/bin/env python ''' Write the byte offset of each line. ''' import fileinput import struct import sys def f(n): return struct.pack('L', n) def main(): total = 0 # Main processing.. for n, line in enumerate(fileinput.input()): sys.stdout.write(f(total)) total += len(line) # Status output. if not n % 1000: print >> sys.stderr, '%i lines processed' % n print >> sys.stderr, '%i lines processed' % (n + 1) if __name__ == '__main__': main() You use it (on linux) like so: cat large_file | ./genoffsets.py > index.dat And here's the getline.py script: #!/usr/bin/env python ''' Usage: "getline.py <datafile> <indexfile> <num>" Prints line num from datafile using indexfile. ''' import struct import sys fmt = 'L' fmt_size = struct.calcsize(fmt) def F(n, fn): ''' Return the byte offset of line n from index file fn. ''' f = open(fn) try: f.seek(n * fmt_size) data = f.read(fmt_size) finally: f.close() return struct.unpack(fmt, data)[0] def getline(n, data_file, index_file): ''' Return line n from data file using index file. ''' n = F(n, index_file) f = open(data_file) try: f.seek(n) data = f.readline() finally: f.close() return data if __name__ == '__main__': dfn, ifn, lineno = sys.argv[-3:] n = int(lineno) print getline(n, dfn, ifn) Hope this helps, ~Simon -- http://mail.python.org/mailman/listinfo/python-list