On Tue, Jul 5, 2011 at 10:54 PM, Phlip <phlip2...@gmail.com> wrote: > Pythonistas: > > Consider this hashing code: > > import hashlib > file = open(path) > m = hashlib.md5() > m.update(file.read()) > digest = m.hexdigest() > file.close() > > If the file were huge, the file.read() would allocate a big string and > thrash memory. (Yes, in 2011 that's still a problem, because these > files could be movies and whatnot.) > > So if I do the stream trick - read one byte, update one byte, in a > loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit > CPU. So that's the same problem; it would still be slow. > > So now I try this: > > sum = os.popen('sha256sum %r' % path).read() > > Those of you who like to lie awake at night thinking of new ways to > flame abusers of 'eval()' may have a good vent, there.
Indeed (*eyelid twitch*). That one-liner is arguably better written as: sum = subprocess.check_output(['sha256sum', path]) > Does hashlib have a file-ready mode, to hide the streaming inside some > clever DMA operations? Barring undocumented voodoo, no, it doesn't appear to. You could always read from the file in suitably large chunks instead (rather than byte-by-byte, which is indeed ridiculous); see io.DEFAULT_BUFFER_SIZE and/or the os.stat() trick referenced therein and/or the block_size attribute of hash objects. http://docs.python.org/library/io.html#io.DEFAULT_BUFFER_SIZE http://docs.python.org/library/hashlib.html#hashlib.hash.block_size Cheers, Chris -- http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list