On Wed, 18 May 2011 09:13:20 -0400, Shawn H Corey wrote: > On 11-05-18 09:02 AM, Mike Blezien wrote: >> Is there a perl module available, or a regex method, that will prase an >> HTML formatted file then remove ALL the HTML elements so you end up >> with just the text content of the file? > > HTML::TreeBuilder loads HTML::Element which has a method as_text(). Use > HTML::Element::look_down() to find the body, than use as_text() > > http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/TreeBuilder.pm > http://search.cpan.org/~jfearn/HTML-Tree-4.2/lib/HTML/Element.pm
That's the answer I would give, I would just add to the OP that what you think the text content of a page ought to be may not match what this returns. Text without formatting runs together and for the majority of pages produces a useless mess. Usually more complex parsing is called for based on specific knowledge of the page. Although if all you want the text content for is further machine processing like checksums, concordance, or indexing, then this is fine. -- Peter Scott http://www.perlmedic.com/ http://www.perldebugged.com/ http://www.informit.com/store/product.aspx?isbn=0137001274 http://www.oreillyschool.com/courses/perl3/ -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/