On Apr 10, 2009, at 3:22 PM, Josh Rosenbaum wrote: > Lee.M wrote: >> On Apr 10, 2009, at 10:58 AM, Lee.M wrote: >>> Along w/ the problem of unbalancing tags there is also the white >>> space >>> issue (e.g. you want 100 characters you could have 'a' . >>> $ninety_five_spaces . 'b' . $tons_of_text. and the truncated >>> verbiage >>> is essencially 'a b' >>> >>> length of character entities (e.g < == 1 character not 4) >>> >>> Fortunately it looks like someone has already addressed all of that >>> for us :) >>> >>> http://search.cpan.org/perldoc?HTML::Truncate >> Also, if you want to go the opposite route of just making it plain >> text: >> http://search.cpan.org/perldoc?HTML::Obliterate > > For stripping down to text, I usually prefer to use HTML::Parser and > something like their example script: > http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.60/eg/htext
interesting, I'll have to study that > > HTML::Parser can usually handle improper HTML better at the expense > of speed. I think it uses HTML::Truncate under the hood > HTML::Strip seems like a good alternative to the HTML::Obliterate > mentioned above as well: > http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm > > HTML::Strip is wrote in XS and says it's about 5 times quicker than > regexp. Whether that's true or not is up to someone else to test. I doubt that in this case, naturally XS is "fast" and regex can be considered "slow" but Strip looks to be fairly convoluted: you have to do an object, set tags, call the parse method, and tell it you're done (why 'eof' that has nothing to do with what we are doing....). In other words 10 pounds of XS is still heavier than an ounce of regex :) Plus it optionally decodes HTML entities (which *is* a bunch of regexes), decoding those are really 'clean up' or 'reformatting' not 'stripping', I dunno, If I just want 100% of all HTML gone I'd almost bet that HTML::Obliterate would be faster than HTML::Strip, if I wanted to turn certain entities into their regular version I'd use HTML::Entities to do it, then rip out the left over HTML (including entities I don;t want preserved) Now that his claim has my curiosity I went ahead and Benchmarked it sort of: time perl -MHTML::Obliterate -e 'for (1..1000) { print HTML::Obliterate::remove_html_from_string("<p> <em> <strong>fast brutal</strong> </em> </p> howdy") . "\n";} ' time perl -MHTML::Strip -e 'my $hs=HTML::Strip->new;for (1..1000) { print $hs->parse("<p> <em> <strong>fast brutal</strong> </em> </p> howdy") . "\n";$hs->eof;}' HTML::Strip results in 'fast ? brutal howdy' in my terminal HTML::Obliterate results in 'fast brutal howdy' (i.e. it stripped out all HTML as advertised) Since benchmarking is a fuzzy thing I ran both of those commands above 10 times in a row times and kept the fastest time for each: HTML::Obliterate: real 0m0.031s user 0m0.018s sys 0m0.008s HTML::Strip: real 0m0.047s user 0m0.026s sys 0m0.010s On a side note the command using HTML::Strip uses appx 1/3 MB more memory. Also I noticed that as I increased the size of the HTML being parsed the time's remained about the same relatively *but* HTML::Strip's memory use grew, HTML::Oblitaerate's did not. I'd say HTML::Strip needs to put some of it's XS mojo to better use than making misleading claims :) > Thanks for the HTML::Truncate suggestion. I may end up checking that > out in the future. np :) I just started using it, so far so good! _______________________________________________ templates mailing list templates@template-toolkit.org http://mail.template-toolkit.org/mailman/listinfo/templates