Re: Python optimization (was Python's "only one way to do it" philosophy isn't good?)

Paul Rubin Wed, 13 Jun 2007 19:17:50 -0700

"Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
> And if only the html-parsing is slow, you might consider creating an
> extension for that. Using e.g. Pyrex.


I just tried using BeautifulSoup to pull some fields out of some html
files--about 2 million files, output of a web crawler.  It parsed very
nicely at about 5 files per second.  Of course Python being Python, I
wanted to run the program a whole lot of times, modifying it based on
what I found from previous runs, and at 5/sec each run was going to
take about 4 days (OK, I probably could have spread it across 5 or so
computers and gotten it to under 1 day, at the cost of more effort to
write the parallelizing code and to scare up extra machines).  By
simply treating the html as a big string and using string.find to
locate the fields I wanted, I got it up to about 800 files/second,
which made each run about 1/2 hour.  Simplest still would be if Python
just ran about 100x faster than it does, a speedup which is not
outlandish to hope for.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python optimization (was Python's "only one way to do it" philosophy isn't good?)

Reply via email to