Re: [Tutor] Problem reading large files in binary mode

Peter Otten Fri, 13 Jun 2014 03:38:45 -0700

Mirage Web Studio wrote:

> Try reading the file in chunks instead:
> 
> CHUNKSIZE = 2**20
> hash = hashlib.md5()
> while True:
>      chunk = f.read(CHUNKSIZE)
>      if not chunk:
>          break
>      hash.update(chunk)
> hashvalue = hash.hexdigest()
> 
> 
> Thank you peter for the above valubale reply.  but shouldn't read() by
> itself work because i have enough memory to load it or should it be a bug.


I think you are right. At least you should get a MemoryError (the well-
behaved way of the Python interpreter to say that it cannot allocate enough 
memory) while your description hints at a segmentation fault.

A quick test with the Python versions I have lying around:

$ python -c 'open("bigfile", "rb").read()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
MemoryError
$ python3.3 -c 'open("bigfile", "rb").read()'
Segmentation fault
$ python3.3 -V                               
Python 3.3.2+
$ python3.4 -c 'open("bigfile", "rb").read()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
MemoryError

So the bug occurs in 3.3 at least up to 3.3.2.

If you don't have the latest bugfix release Python 3.3.4 you can try and 
install that or if you are not tied to 3.3 update to 3.4.1.

Note that you may still run out of memory, particularly if you are using the 
32 bit version.

Also it is never a good idea to load a lot of data into memory when you 
intend to use it just once. Therefore I recommend that you calculate the 
checksum the way that I have shown in the example. 

PS: There was an email in my inbox where eryksun suggests potential 
improvements to my code:

> You might see better performance if you preallocate a bytearray and
> `readinto` it. On Windows, you might see even better performance if
> you map sections of the file using mmap; the map `length` needs to be
> a multiple of ALLOCATIONGRANULARITY (except the residual) to set the
> `offset` for a sliding window.

While I don't expect significant improvements since the problem is "I/O-
bound", i. e. the speed limit is imposed by communication with the harddisk 
rather than the Python interpreter, you may still find it instructive to 
compare the various approaches.

Another candidate when you are working in an environment where the md5sum 
utility is available is to delegate the work to the "specialist":

hashvalue = subprocess.Popen(
    ["md5sum", filename], 
    stdout=subprocess.PIPE).communicate()[0].split()[0].decode()

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problem reading large files in binary mode

Reply via email to