Re: [python-win32] FIle I/O on Windows XP

Dave Angel Mon, 15 Jun 2009 06:23:47 -0700

Tony Cappellini wrote:

I was trying to see if I could speed up processing huge files (in the
10's of Gigabytes)
by passing various values to the readline() method of the file object.


No matter what I passed to readline() each call was slower and slower
than passing no argument at all.

I've used a SATA bus analyzer to see what kind of block sizes WIndows
was using for the reads, and typically for non-AHCI reads, the block
sizes were surpsingly small. 8 blocks were typical.

It's somewhat difficult to separate I/O traffic that windows is
routinely doing even when you're not reading a specific file, so
putting the file on a secondary drive will eliminate most of that
traffic.

Can anyone explain why passing any non-negative values > 0 to
readline() makes the file processing slower instead of faster?

Critical question, what version of Python?  Print out  sys.version for us.

Have you actually measured to be sure the bulk of the time is not spentin processing the lines? And if you have, have you also measured thetime spent in a dummy loop of simple read() calls, to make sure it isn'tmostly the disk time?

I'll assume you've somehow narrowed it down to readline() itself. Ifyou haven't, this is mostly a waste of time. (eg. If half the time isspent waiting for the disk, you're a candidate for multi-threading, andget_line() specifically allows threads for overlapping I/O.)

file.readline is apparently written in C; you could look up the sourcesto it. I could make a wild guess and say that any tight loop that'schecking two termination conditions instead of one would be slower. Ifit were my code, I might read into a buffer, append a newline to thebuffer, and search the buffer for successive newlines. If the usergives me a count, it would slow down that inner loop.

But I'm guessing your performance is being hampered by the newlinelogic. For example, specifiying "u" on the open call will slow thingsdown. What are you doing to make that quicker? Are the files in ASCII,are they in Unix or in Windows mode (the "b" flag on open) ? Are youinterpreting the file as UTF-8? Are you in Python 2.x or 3.x ?

For example, if you're opening a Windows/style text file, with crlf atthe end of each line, you might try timing readline() with the "b" flagon, even though that'll give you \r\n instead of \n at the end of eachline. Depending on how you're parsing the line, that might not slowdown your other logic at all. So if it speeds up readline() it might beworth it.

Have you investigated using readlines(), with a specified buffer size?Obviously you can't (on Win32 anyway) do a single call with the defaultargument. But that call is mentioned (in fileinput.py) as "asignificant speedup." And elsewhere I think the docs imply that "forline in myfile" would be fastest.


Looking at the source:

in Objects\fileobject.c, function file_readline() seems to be the onewe want; it delegates to get_line(). get_line() has a very simple loopif you're not in universal newline mode.


Another interesting tidbit in that file:
/* A larger buffer size may actually decrease performance. */
#define READAHEAD_BUFSIZE 8192





_______________________________________________
python-win32 mailing list
python-win32@python.org
http://mail.python.org/mailman/listinfo/python-win32

Re: [python-win32] FIle I/O on Windows XP

Reply via email to