Re: Find relative url in mixed text/html
Hi Paul, On 28/11/15 13:11, Paul Rubin wrote: > Rob Hillswrites: >> Note, in the beginning of this project, I looked at using "Beautiful >> Soup" but my reading and limited testing lead me to believe that it is >> designed for well-formed HTML/XML and therefore was unsuitable for the >> text/html soup I have. If that belief is incorrect, I'd be grateful for >> general tips about using Beautiful Soup in this scenario... > Beautiful Soup can deal with badly formed HTML pretty well, or at least > it could in earlier versions. It gives you several different parsing > options to choose from now. I think the default is lxml which is fast > but maybe more strict. Check what the others are and see if a loose > slow one is still there. It really is pretty slow so plan on a big > computation task if you're converting a large forum. I've had another look at Beautiful Soup and while it doesn't really help me much with urls (relative or absolute) embedded within text, it seems to do a good job of separating out links from the rest, so that could be useful in itself. WRT time, I'm converting about 65MB of data which currently takes 14 seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is pretty amazing performance for Python3, especially given my relatively crude coding skills. It'll be interesting to see if using Beautiful Soup adds significantly to that. > phpBB gets a bad rap that's maybe well-deserved but I don't know what to > suggest instead. I did start to investigate Python-based alternatives; I've not heard much good said about php, but I probably move in the wrong circles. However, our hosting service doesn't support Python so I stopped hunting. Plus there is a significant group of forum members who hold very strong opinions about the functionality they want and it took a lot of work to get them to agree on something! All that said, I'd be interested to see specific (and hopefully unbiased) info about phpBB's failings... Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Hi Laura, On 29/11/15 01:04, Laura Creighton wrote: > In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes: >> All that said, I'd be interested to see specific (and hopefully >> unbiased) info about phpBB's failings... > People I know of who run different bb software say that the spammers > really prefer phpBB. So keeping it spam free is about 4 times the > work as for, for instance, IPB. > > Hackers seem to like it too -- possibly due to this: > http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/ > > make sure you aren't vulnerable. Thanks for the link and the advice. Personally, I'd rather go with something based on a language I am reasonably familiar with (eg Python or Java) however it seems the vast bulk of Forum software is based on PHP :-( Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes: >All that said, I'd be interested to see specific (and hopefully >unbiased) info about phpBB's failings... People I know of who run different bb software say that the spammers really prefer phpBB. So keeping it spam free is about 4 times the work as for, for instance, IPB. Hackers seem to like it too -- possibly due to this: http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/ make sure you aren't vulnerable. -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Rob Hillswrites: > Personally, I'd rather go with something based on a language I am > reasonably familiar with (eg Python or Java) however it seems the vast > bulk of Forum software is based on PHP :-( It's certainly possible to write good software in PHP, so it's mostly a matter of the design and implementation quality. I was on a big PhpBB forum years ago and it got very slow as the database got large, and there were multiple incidents of database corruption. The board eventually switched to VBB which was a lot better. VBB is the best one I know of but it's not FOSS. I'm on another one right now which uses IPB (also not FOSS) and don't like it much (too clever for its own good). Another one is FluxBB which is nice and lightweight and FOSS, but it's a small forum and the software might not be up to handling a bigger one. Some people like Discourse. I don't like it much myself, but that's just me. There's certainly plenty of cheap hosting available these days (or raw VPS) that let you run Python or whatever else you want. But it seems to me that forum software is something of a ghetto. I do think there is some written in Python but I don't remember any specifics. -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Hi Grobu, On 28/11/15 15:07, Grobu wrote: > Is it safe to assume that all the relative (cross) links take one of > the following forms? : > > http://www.aeva.asn.au/forums/forum_posts.asp > www.aeva.asn.au/forums/forum_posts.asp > /forums/forum_posts.asp > /forum_posts.asp (are you really sure about this one?) > > If so, and if your goal boils down to converting all instances of old > style URLs to new style ones regardless of the context where they > appear, why would a regex fail to meet your needs? I'm actually not discounting anything and as I mentioned, I've already used some regex to extract the properly-formed URLs (those starting with http://). I was fortunately able to find some example regex that I could figure out enough to tweak for my purpose. Unfortunately, my small brain hurts whenever I try and understand what a piece of regex is doing and I don't like having bits in my code that hurt my brain. BTW, that's not meant to be an invitation to someone to produce some regex for me, if I can't find any other way of doing it, I'll try and create my own regex and come back here if I can't get that working. Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Find relative url in mixed text/html
Hi, For my sins I am migrating a volunteer association forum from one platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way through the process. Posts to our original forum comprise a soup of plain text, HTML and BBCodes. A post */may/* include links done as either standard HTML links ( http://blah.blah.com.au or even just www.blah.blah.com.au ). In my conversion process, I am trying to identify cross-links (links from one post on the forum to another) so I can convert them to links that will work in the new forum. My current code uses a Regular Expression (yes, I read the recent posts on this forum about regex and HTML!) to pull out "absolute" links ( starting with http:// ) and then I use Python to identify and convert the specific links I am interested in. However, the forum also contains "cross-links" done using relative links and I'm unsure how best to proceed with that one. Googling so far has not been helpful, but that might be me using the wrong search terms. Some examples of what I am talking about are: Post fragment containing an "Absolute" cross-link: ive made a new thread: http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958 converts to: ive made a new thread: /viewtopic.php?t=316=1958#1958 Post fragment containing a "Relative" cross-link: Battery Management SystemVeroboard prototype Needs converting to: Battery Management SystemVeroboard prototype So, my question is: What is the best way to extract a list of "relative links" from mixed text/html that I can then walk through to identify the specific ones I want to convert? Note, in the beginning of this project, I looked at using "Beautiful Soup" but my reading and limited testing lead me to believe that it is designed for well-formed HTML/XML and therefore was unsuitable for the text/html soup I have. If that belief is incorrect, I'd be grateful for general tips about using Beautiful Soup in this scenario... TIA, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Rob Hillswrites: > Note, in the beginning of this project, I looked at using "Beautiful > Soup" but my reading and limited testing lead me to believe that it is > designed for well-formed HTML/XML and therefore was unsuitable for the > text/html soup I have. If that belief is incorrect, I'd be grateful for > general tips about using Beautiful Soup in this scenario... Beautiful Soup can deal with badly formed HTML pretty well, or at least it could in earlier versions. It gives you several different parsing options to choose from now. I think the default is lxml which is fast but maybe more strict. Check what the others are and see if a loose slow one is still there. It really is pretty slow so plan on a big computation task if you're converting a large forum. phpBB gets a bad rap that's maybe well-deserved but I don't know what to suggest instead. -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
On 28/11/15 03:35, Rob Hills wrote: Hi, For my sins I am migrating a volunteer association forum from one platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way through the process. Posts to our original forum comprise a soup of plain text, HTML and BBCodes. A post */may/* include links done as either standard HTML links ( http://blah.blah.com.au or even just www.blah.blah.com.au ). In my conversion process, I am trying to identify cross-links (links from one post on the forum to another) so I can convert them to links that will work in the new forum. My current code uses a Regular Expression (yes, I read the recent posts on this forum about regex and HTML!) to pull out "absolute" links ( starting with http:// ) and then I use Python to identify and convert the specific links I am interested in. However, the forum also contains "cross-links" done using relative links and I'm unsure how best to proceed with that one. Googling so far has not been helpful, but that might be me using the wrong search terms. Some examples of what I am talking about are: Post fragment containing an "Absolute" cross-link: ive made a new thread: http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958 converts to: ive made a new thread: /viewtopic.php?t=316=1958#1958 Post fragment containing a "Relative" cross-link: Battery Management SystemVeroboard prototype Needs converting to: Battery Management SystemVeroboard prototype So, my question is: What is the best way to extract a list of "relative links" from mixed text/html that I can then walk through to identify the specific ones I want to convert? Note, in the beginning of this project, I looked at using "Beautiful Soup" but my reading and limited testing lead me to believe that it is designed for well-formed HTML/XML and therefore was unsuitable for the text/html soup I have. If that belief is incorrect, I'd be grateful for general tips about using Beautiful Soup in this scenario... TIA, Hi Rob Is it safe to assume that all the relative (cross) links take one of the following forms? : http://www.aeva.asn.au/forums/forum_posts.asp www.aeva.asn.au/forums/forum_posts.asp /forums/forum_posts.asp /forum_posts.asp (are you really sure about this one?) If so, and if your goal boils down to converting all instances of old style URLs to new style ones regardless of the context where they appear, why would a regex fail to meet your needs? -- https://mail.python.org/mailman/listinfo/python-list