Re: Remove all HTML tags

Peter Scott Sat, 21 May 2011 09:09:11 -0700

On Wed, 18 May 2011 09:13:20 -0400, Shawn H Corey wrote:
> On 11-05-18 09:02 AM, Mike Blezien wrote:
>> Is there a perl module available, or a regex method, that will prase an
>> HTML formatted file then remove ALL the HTML elements so you end up
>> with just the text content of the file?
> 
> HTML::TreeBuilder loads HTML::Element which has a method as_text().  Use
> HTML::Element::look_down() to find the body, than use as_text()
> 
> http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/TreeBuilder.pm
> http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/Element.pm


That's the answer I would give, I would just add to the OP that what you 
think the text content of a page ought to be may not match what this 
returns.  Text without formatting runs together and for the majority of 
pages produces a useless mess.  Usually more complex parsing is called 
for based on specific knowledge of the page.  Although if all you want 
the text content for is further machine processing like checksums, 
concordance, or indexing, then this is fine.

-- 
Peter Scott
http://www.perlmedic.com/     http://www.perldebugged.com/
http://www.informit.com/store/product.aspx?isbn=0137001274
http://www.oreillyschool.com/courses/perl3/

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Remove all HTML tags

Reply via email to