"Diez B. Roggisch" <[EMAIL PROTECTED]> writes: > And if only the html-parsing is slow, you might consider creating an > extension for that. Using e.g. Pyrex.
I just tried using BeautifulSoup to pull some fields out of some html files--about 2 million files, output of a web crawler. It parsed very nicely at about 5 files per second. Of course Python being Python, I wanted to run the program a whole lot of times, modifying it based on what I found from previous runs, and at 5/sec each run was going to take about 4 days (OK, I probably could have spread it across 5 or so computers and gotten it to under 1 day, at the cost of more effort to write the parallelizing code and to scare up extra machines). By simply treating the html as a big string and using string.find to locate the fields I wanted, I got it up to about 800 files/second, which made each run about 1/2 hour. Simplest still would be if Python just ran about 100x faster than it does, a speedup which is not outlandish to hope for. -- http://mail.python.org/mailman/listinfo/python-list