En Fri, 19 Mar 2010 14:18:17 -0300, djc <slais-...@ucl.ac.uk> escribió:
Ben Finney wrote:

What happens, then, when you make a smaller program that deals with only
one file?

What happens when you make a smaller program that only reads the file,
and doesn't write any? Or a different program that only writes a file,
and doesn't read any?

It's these sort of reductions that will help narrow down exactly what
the problem is. Do make sure that each example is also complete (i.e.
can be run as is by someone who uses only that code with no additions).



The program reads one csv file of 9,293,271 lines.
869M wb.csv
It creates set of files containing the same lines but where each
 output file in the set  contains only those lines where the value of a
particular column is the same, the number of  output files will depend on
the number of distinct values  in that column In the example that results
in 19 files

changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:

and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:

causes a program that runs  in
43 seconds to take 4 minutes to process the same data. In this particular
case that is not very important, any unicode strings in the data are not
worth troubling over and I have already  spent more  time satisfying
curiousity  that  will  ever  be  required  to  process  the dataset  in
future.  But  I have another  project  in  hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.

Ok. Your test program is too large to determine what's going on. Try to determine first *which* part is slow:

- reading: measure the time it takes only to read a file, with open() and codecs.open() It might be important the density of non-ascii characters and their relative code points (as utf-8 is much more efficient for ASCII data than, say, Hanzi) - processing: measure the time it takes the processing part (fed with str vs unicode data) - writing: measure the time it takes only to write a file, with open() and codecs.open()

Only then one can focus on optimizing the bottleneck.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to