On Tue, 2011-07-05 at 22:54 -0700, Phlip wrote: > Pythonistas > Consider this hashing code: > import hashlib > file = open(path) > m = hashlib.md5() > m.update(file.read()) > digest = m.hexdigest() > file.close() > If the file were huge, the file.read() would allocate a big string and > thrash memory. (Yes, in 2011 that's still a problem, because these > files could be movies and whatnot.)
Yes, the simple rule is do not *ever* file.read(). No matter what the year this will never be OK. Always chunk reading a file into reasonable I/O blocks. For example I use this function to copy a stream and return a SHA512 and the output streams size: def write(self, in_handle, out_handle): m = hashlib.sha512() data = in_handle.read(4096) while True: if not data: break m.update(data) out_handle.write(data) data = in_handle.read(4096) out_handle.flush() return (m.hexdigest(), in_handle.tell()) > Does hashlib have a file-ready mode, to hide the streaming inside some > clever DMA operations? Chunk it to something close to the block size of your underlying filesystem. -- http://mail.python.org/mailman/listinfo/python-list