David Rasmussen wrote: > Steven D'Aprano wrote: > >> On Fri, 28 Oct 2005 06:22:11 -0700, [EMAIL PROTECTED] wrote: >> >>> Which is quite fast. The only problems is that the file might be huge. >> >> >> What *you* call huge and what *Python* calls huge may be very different >> indeed. What are you calling huge? >> > > I'm not saying that it is too big for Python. I am saying that it is too > big for the systems it is going to run on. These files can be 22 MB or 5 > GB or ..., depending on the situation. It might not be okay to run a > tool that claims that much memory, even if it is available.
If your files can reach multiple gigabytes, you will definitely need an algorithm that avoids reading the entire file into memory at once. [snip] > print file("filename", "rb").count("\x00\x00\x01\x00") > > (or something like that) > > instead of the original > > print file("filename", "rb").read().count("\x00\x00\x01\x00") > > it would be exactly what I am after. I think I can say, without risk of contradiction, that there is no built-in method to do that. > What is the conceptual difference? > The first solution should be at least as fast as the second. I have to > read and compare the characters anyway. I just don't need to store them > in a string. In essence, I should be able to use the "count occurences" > functionality on more things, such as a file, or even better, a file > read through a buffer with a size specified by me. Of course, if you feel like coding the algorithm and submitting it to be included in the next release of Python... :-) I can't help feeling that a generator with a buffer is the way to go, but I just can't *quite* deal with the case where the pattern overlaps the boundary... it is very annoying. But not half as annoying as it must be to you :-) However, there may be a simpler solution *fingers crossed* -- you are searching for a sub-string "\x00\x00\x01\x00", which is hex 0x100. Surely you don't want any old substring of "\x00\x00\x01\x00", but only the ones which align on word boundaries? So "ABCD\x00\x00\x01\x00" would match (in hex, it is 0x41424344 0x100), but "AB\x00\x00\x01\x00CD" should not, because that is 0x41420000 0x1004344 in hex. If that is the case, your problem is simpler: you don't have to worry about the pattern crossing a boundary, so long as your buffer is a multiple of four bytes. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list