En Fri, 19 Mar 2010 14:18:17 -0300, djc <slais-...@ucl.ac.uk> escribió:
Ben Finney wrote:
What happens, then, when you make a smaller program that deals with only
one file?
What happens when you make a smaller program that only reads the file,
and doesn't write any? Or a different program that only writes a file,
and doesn't read any?
It's these sort of reductions that will help narrow down exactly what
the problem is. Do make sure that each example is also complete (i.e.
can be run as is by someone who uses only that code with no additions).
The program reads one csv file of 9,293,271 lines.
869M wb.csv
It creates set of files containing the same lines but where
each
output file in the set contains only those lines where the value of a
particular column is the same, the number of output files will depend on
the number of distinct values in that column In the example that results
in 19 files
changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:
causes a program that runs in
43 seconds to take 4 minutes to process the same data. In this particular
case that is not very important, any unicode strings in the data are
not
worth troubling over and I have already spent more time satisfying
curiousity that will ever be required to process the dataset in
future. But I have another project in hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.
Ok. Your test program is too large to determine what's going on. Try to
determine first *which* part is slow:
- reading: measure the time it takes only to read a file, with open() and
codecs.open()
It might be important the density of non-ascii characters and their
relative code points (as utf-8 is much more efficient for ASCII data than,
say, Hanzi)
- processing: measure the time it takes the processing part (fed with str
vs unicode data)
- writing: measure the time it takes only to write a file, with open() and
codecs.open()
Only then one can focus on optimizing the bottleneck.
--
Gabriel Genellina
--
http://mail.python.org/mailman/listinfo/python-list