Tony Cappellini wrote:

I was trying to see if I could speed up processing huge files (in the
10's of Gigabytes)
by passing various values to the readline() method of the file object.

No matter what I passed to readline() each call was slower and slower
than passing no argument at all.

I've used a SATA bus analyzer to see what kind of block sizes WIndows
was using for the reads, and typically for non-AHCI reads, the block
sizes were surpsingly small. 8 blocks were typical.

It's somewhat difficult to separate I/O traffic that windows is
routinely doing even when you're not reading a specific file, so
putting the file on a secondary drive will eliminate most of that
traffic.

Can anyone explain why passing any non-negative values > 0 to
readline() makes the file processing slower instead of faster?

Critical question, what version of Python?  Print out  sys.version for us.

Have you actually measured to be sure the bulk of the time is not spent in processing the lines? And if you have, have you also measured the time spent in a dummy loop of simple read() calls, to make sure it isn't mostly the disk time?

I'll assume you've somehow narrowed it down to readline() itself. If you haven't, this is mostly a waste of time. (eg. If half the time is spent waiting for the disk, you're a candidate for multi-threading, and get_line() specifically allows threads for overlapping I/O.)

file.readline is apparently written in C; you could look up the sources to it. I could make a wild guess and say that any tight loop that's checking two termination conditions instead of one would be slower. If it were my code, I might read into a buffer, append a newline to the buffer, and search the buffer for successive newlines. If the user gives me a count, it would slow down that inner loop.

But I'm guessing your performance is being hampered by the newline logic. For example, specifiying "u" on the open call will slow things down. What are you doing to make that quicker? Are the files in ASCII, are they in Unix or in Windows mode (the "b" flag on open) ? Are you interpreting the file as UTF-8? Are you in Python 2.x or 3.x ?

For example, if you're opening a Windows/style text file, with crlf at the end of each line, you might try timing readline() with the "b" flag on, even though that'll give you \r\n instead of \n at the end of each line. Depending on how you're parsing the line, that might not slow down your other logic at all. So if it speeds up readline() it might be worth it.

Have you investigated using readlines(), with a specified buffer size? Obviously you can't (on Win32 anyway) do a single call with the default argument. But that call is mentioned (in fileinput.py) as "a significant speedup." And elsewhere I think the docs imply that "for line in myfile" would be fastest.

Looking at the source:
in Objects\fileobject.c, function file_readline() seems to be the one we want; it delegates to get_line(). get_line() has a very simple loop if you're not in universal newline mode.

Another interesting tidbit in that file:
/* A larger buffer size may actually decrease performance. */
#define READAHEAD_BUFSIZE 8192





_______________________________________________
python-win32 mailing list
python-win32@python.org
http://mail.python.org/mailman/listinfo/python-win32

Reply via email to