Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills <rhi...@medimorphosis.com.au> writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...


Rob Hills
Waikiki, Western Australia


Reply via email to