Hi, For my sins I am migrating a volunteer association forum from one platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way through the process.
Posts to our original forum comprise a soup of plain text, HTML and BBCodes. A post */may/* include links done as either standard HTML links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or sometimes just text: ( http://blah.blah.com.au or even just www.blah.blah.com.au ). In my conversion process, I am trying to identify cross-links (links from one post on the forum to another) so I can convert them to links that will work in the new forum. My current code uses a Regular Expression (yes, I read the recent posts on this forum about regex and HTML!) to pull out "absolute" links ( starting with http:// ) and then I use Python to identify and convert the specific links I am interested in. However, the forum also contains "cross-links" done using relative links and I'm unsure how best to proceed with that one. Googling so far has not been helpful, but that might be me using the wrong search terms. Some examples of what I am talking about are: Post fragment containing an "Absolute" cross-link: <br />ive made a new thread: <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958 <br /> converts to: <br /> <br />ive made a new thread: <br />/viewtopic.php?t=316&p=1958#1958 Post fragment containing a "Relative" cross-link: <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br /> Needs converting to: <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br /> So, my question is: What is the best way to extract a list of "relative links" from mixed text/html that I can then walk through to identify the specific ones I want to convert? Note, in the beginning of this project, I looked at using "Beautiful Soup" but my reading and limited testing lead me to believe that it is designed for well-formed HTML/XML and therefore was unsuitable for the text/html soup I have. If that belief is incorrect, I'd be grateful for general tips about using Beautiful Soup in this scenario... TIA, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list