On Thu, 08 Dec 2005 02:09:58 -0500, Mike Meyer <[EMAIL PROTECTED]> wrote:
>[EMAIL PROTECTED] writes: >> I have a file which is very large eg over 200Mb , and i am going to use >> python to code a "tail" >> command to get the last few lines of the file. What is a good algorithm >> for this type of task in python for very big files? >> Initially, i thought of reading everything into an array from the file >> and just get the last few elements (lines) but since it's a very big >> file, don't think is efficient. > >Well, 200mb isn't all that big these days. But it's easy to code: > ># untested code >input = open(filename) >tail = input.readlines()[:tailcount] >input.close() > >and you're done. However, it will go through a lot of memory. Fastest >is probably working through it backwards, but that may take multiple >tries to get everything you want: > ># untested code >input = open(filename) >blocksize = tailcount * expected_line_length >tail = [] >while len(tail) < tailcount: > input.seek(-blocksize, EOF) > tail = input.read().split('\n') > blocksize *= 2 >input.close() >tail = tail[:tailcount] > >It would probably be more efficient to read blocks backwards and paste >them together, but I'm not going to get into that. > Ok, I'll have a go (only tested slightly ;-) >>> def frsplit(fname, nitems=10, splitter='\n', chunk=8192): ... f = open(fname, 'rb') ... f.seek(0, 2) ... bufpos = f.tell() # pos from file beg == file length ... buf = [''] ... for nl in xrange(nitems): ... while len(buf)<2: ... chunk = min(chunk, bufpos) ... bufpos = bufpos-chunk ... f.seek(bufpos) ... buf = (f.read(chunk)+buf[0]).split(splitter) ... if buf== ['']: break ... if bufpos==0: break ... if len(buf)>1: yield buf.pop(); continue ... if bufpos==0: yield buf.pop(); break ... 20 lines from the tail of november's python-dev archive >>> print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', >>> 20)))) lives in the mimelib project's hidden CVS on SF, but that seems pretty silly. Basically I'm just going to add the test script, setup.py, generated html docs and a few additional unit tests, along with svn:external refs to pull in Lib/email from the appropriate Python svn tree. This way, I'll be able to create standalone email packages from the sandbox (which I need to do because I plan on fixing a few outstanding email bugs). -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it wouldn't work with the above archive, since it has unix endings, and I didn't download it in a manner that would convert it. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list