Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills  writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Laura,

On 29/11/15 01:04, Laura Creighton wrote:
> In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes:
>> All that said, I'd be interested to see specific (and hopefully
>> unbiased) info about phpBB's failings...
> People I know of who run different bb software say that the spammers
> really prefer phpBB.  So keeping it spam free is about 4 times the
> work as for, for instance, IPB.
>
> Hackers seem to like it too -- possibly due to this:
> http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/
>
> make sure you aren't vulnerable.

Thanks for the link and the advice.

Personally, I'd rather go with something based on a language I am
reasonably familiar with (eg Python or Java) however it seems the vast
bulk of Forum software is based on PHP :-(

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Laura Creighton
In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes:
>All that said, I'd be interested to see specific (and hopefully
>unbiased) info about phpBB's failings...

People I know of who run different bb software say that the spammers
really prefer phpBB.  So keeping it spam free is about 4 times the
work as for, for instance, IPB.

Hackers seem to like it too -- possibly due to this:
http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/

make sure you aren't vulnerable.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Paul Rubin
Rob Hills  writes:
> Personally, I'd rather go with something based on a language I am
> reasonably familiar with (eg Python or Java) however it seems the vast
> bulk of Forum software is based on PHP :-(

It's certainly possible to write good software in PHP, so it's mostly
a matter of the design and implementation quality.

I was on a big PhpBB forum years ago and it got very slow as the
database got large, and there were multiple incidents of database
corruption.  The board eventually switched to VBB which was a lot
better.  VBB is the best one I know of but it's not FOSS.

I'm on another one right now which uses IPB (also not FOSS) and don't
like it much (too clever for its own good).

Another one is FluxBB which is nice and lightweight and FOSS, but it's a
small forum and the software might not be up to handling a bigger one.

Some people like Discourse.  I don't like it much myself, but that's
just me.

There's certainly plenty of cheap hosting available these days (or raw
VPS) that let you run Python or whatever else you want.  But it seems to
me that forum software is something of a ghetto.  I do think there is
some written in Python but I don't remember any specifics.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Grobu,

On 28/11/15 15:07, Grobu wrote:
> Is it safe to assume that all the relative (cross) links take one of
> the following forms? :
>
> http://www.aeva.asn.au/forums/forum_posts.asp
> www.aeva.asn.au/forums/forum_posts.asp
> /forums/forum_posts.asp
> /forum_posts.asp (are you really sure about this one?)
>
> If so, and if your goal boils down to converting all instances of old
> style URLs to new style ones regardless of the context where they
> appear, why would a regex fail to meet your needs?

I'm actually not discounting anything and as I mentioned, I've already
used some regex to extract the properly-formed URLs (those starting with
http://).  I was fortunately able to find some example regex that I
could figure out enough to tweak for my purpose.  Unfortunately, my
small brain hurts whenever I try and understand what a piece of regex is
doing and I don't like having bits in my code that hurt my brain. 

BTW, that's not meant to be an invitation to someone to produce some
regex for me, if I can't find any other way of doing it, I'll try and
create my own regex and come back here if I can't get that working.

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Find relative url in mixed text/html

2015-11-27 Thread Rob Hills
Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms. 

Some examples of what I am talking about are:

Post fragment containing an "Absolute" cross-link:

ive made a new thread:
http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958


converts to:


ive made a new thread:
/viewtopic.php?t=316=1958#1958

Post fragment containing a "Relative" cross-link:

Battery Management SystemVeroboard prototype

Needs converting to:

Battery Management SystemVeroboard prototype

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-27 Thread Paul Rubin
Rob Hills  writes:
> Note, in the beginning of this project, I looked at using "Beautiful
> Soup" but my reading and limited testing lead me to believe that it is
> designed for well-formed HTML/XML and therefore was unsuitable for the
> text/html soup I have.  If that belief is incorrect, I'd be grateful for
> general tips about using Beautiful Soup in this scenario...

Beautiful Soup can deal with badly formed HTML pretty well, or at least
it could in earlier versions.  It gives you several different parsing
options to choose from now.  I think the default is lxml which is fast
but maybe more strict.  Check what the others are and see if a loose
slow one is still there.  It really is pretty slow so plan on a big
computation task if you're converting a large forum.

phpBB gets a bad rap that's maybe well-deserved but I don't know what to
suggest instead.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-27 Thread Grobu

On 28/11/15 03:35, Rob Hills wrote:

Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms.

Some examples of what I am talking about are:

 Post fragment containing an "Absolute" cross-link:

 ive made a new thread:
 http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958
 

 converts to:

 
 ive made a new thread:
 /viewtopic.php?t=316=1958#1958

 Post fragment containing a "Relative" cross-link:

 Battery Management SystemVeroboard prototype

 Needs converting to:

 Battery Management SystemVeroboard prototype

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,



Hi Rob

Is it safe to assume that all the relative (cross) links take one of the 
following forms? :


http://www.aeva.asn.au/forums/forum_posts.asp
www.aeva.asn.au/forums/forum_posts.asp
/forums/forum_posts.asp
/forum_posts.asp (are you really sure about this one?)

If so, and if your goal boils down to converting all instances of old 
style URLs to new style ones regardless of the context where they 
appear, why would a regex fail to meet your needs?



--
https://mail.python.org/mailman/listinfo/python-list