Hi Paul, On 28/11/15 13:11, Paul Rubin wrote: > Rob Hills <rhi...@medimorphosis.com.au> writes: >> Note, in the beginning of this project, I looked at using "Beautiful >> Soup" but my reading and limited testing lead me to believe that it is >> designed for well-formed HTML/XML and therefore was unsuitable for the >> text/html soup I have. If that belief is incorrect, I'd be grateful for >> general tips about using Beautiful Soup in this scenario... > Beautiful Soup can deal with badly formed HTML pretty well, or at least > it could in earlier versions. It gives you several different parsing > options to choose from now. I think the default is lxml which is fast > but maybe more strict. Check what the others are and see if a loose > slow one is still there. It really is pretty slow so plan on a big > computation task if you're converting a large forum.
I've had another look at Beautiful Soup and while it doesn't really help me much with urls (relative or absolute) embedded within text, it seems to do a good job of separating out links from the rest, so that could be useful in itself. WRT time, I'm converting about 65MB of data which currently takes 14 seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is pretty amazing performance for Python3, especially given my relatively crude coding skills. It'll be interesting to see if using Beautiful Soup adds significantly to that. > phpBB gets a bad rap that's maybe well-deserved but I don't know what to > suggest instead. I did start to investigate Python-based alternatives; I've not heard much good said about php, but I probably move in the wrong circles. However, our hosting service doesn't support Python so I stopped hunting. Plus there is a significant group of forum members who hold very strong opinions about the functionality they want and it took a lot of work to get them to agree on something! All that said, I'd be interested to see specific (and hopefully unbiased) info about phpBB's failings... Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list