Hi Tony,

> I was trying to see if I could speed up processing huge files (in the 10's of 
> Gigabytes) by passing various values to the readline() method of the file 
> object.

We also work with large sized text files (we use Python for ETL jobs in
the 100-200 Gb range) and have noticed similar behaviors.

We're currently running 32 and 64 bit versions of Python 2.6.1 on
Windows XP (32-bit/2 Gb) and Windows 2008 (64-bit/48 Gb) with SCSI and
eSATA drives. All our file access is local (vs. over a network) and the
boxes our jobs run on are dedicated machines running nothing else but
our Python scripts. Our boxes are on an isolated network and we have no
virus checking software running in the background.

Since our boxes are maxed out with memory, we thought that supplying
large buffer sizes to open() would improve performance. Like you, we
were surprised that this strategy significantly slowed down our
processing.

We're currently opening our text files via open() and allowing Python to
choose the default buffer size. My experience with Python is that many
performance enhancement techniques are not (initially) intuitive - but
my gut tells me that the behavior we are both seeing is *NOT* one of
these cases, eg. something smells fishy.

I'm happy to run experiments on our side if anyone has suggestions.

Thanks for bringing this up.

Regards,
Malcolm
_______________________________________________
python-win32 mailing list
python-win32@python.org
http://mail.python.org/mailman/listinfo/python-win32

Reply via email to