On Apr 10, 2009, at 3:22 PM, Josh Rosenbaum wrote:

> Lee.M wrote:
>> On Apr 10, 2009, at 10:58 AM, Lee.M wrote:
>>> Along w/ the problem of unbalancing tags there is also the white  
>>> space
>>> issue (e.g. you want 100 characters you could have 'a' .
>>> $ninety_five_spaces . 'b' . $tons_of_text. and the truncated  
>>> verbiage
>>> is essencially 'a b'
>>>
>>> length of character entities (e.g < == 1 character not 4)
>>>
>>> Fortunately it looks like someone has already addressed all of that
>>> for us :)
>>>
>>> http://search.cpan.org/perldoc?HTML::Truncate
>> Also, if you want to go the opposite route of just making it plain  
>> text:
>> http://search.cpan.org/perldoc?HTML::Obliterate
>
> For stripping down to text, I usually prefer to use HTML::Parser and  
> something like their example script:
> http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.60/eg/htext

interesting, I'll have to study that

>
> HTML::Parser can usually handle improper HTML better at the expense  
> of speed.

I think it uses HTML::Truncate under the hood

> HTML::Strip seems like a good alternative to the HTML::Obliterate  
> mentioned above as well:
> http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm
>
> HTML::Strip is wrote in XS and says it's about 5 times quicker than  
> regexp. Whether that's true or not is up to someone else to test.

I doubt that in this case, naturally XS is "fast" and regex can be  
considered "slow" but Strip looks to be fairly convoluted: you have to  
do an object, set tags, call the parse method, and tell it you're done  
(why 'eof' that has nothing to do with what we are doing....). In  
other words 10 pounds of XS is still heavier than an ounce of regex :)

Plus it optionally decodes HTML entities (which *is* a bunch of  
regexes), decoding those are really 'clean up' or 'reformatting' not  
'stripping', I dunno, If I just want 100% of all HTML gone I'd almost  
bet that HTML::Obliterate would be faster than HTML::Strip, if I  
wanted to turn certain entities into their regular version I'd use  
HTML::Entities to do it, then rip out the left over HTML (including  
entities I don;t want preserved)

Now that his claim has my curiosity I went ahead and Benchmarked it  
sort of:

time perl -MHTML::Obliterate -e 'for (1..1000) { print  
HTML::Obliterate::remove_html_from_string("<p> <em> <strong>fast  
&nbsp; brutal</strong> </em> </p> howdy") . "\n";} '

time perl -MHTML::Strip -e 'my $hs=HTML::Strip->new;for (1..1000)  
{ print $hs->parse("<p> <em> <strong>fast &nbsp; brutal</strong> </em>  
</p> howdy") . "\n";$hs->eof;}'

HTML::Strip results in 'fast ? brutal   howdy' in my terminal

HTML::Obliterate results in 'fast  brutal   howdy' (i.e. it stripped  
out all HTML as advertised)

Since benchmarking is a fuzzy thing I ran both of those commands above  
10 times in a row  times and  kept the fastest time for each:

HTML::Obliterate:
   real 0m0.031s
   user 0m0.018s
   sys  0m0.008s

HTML::Strip:
   real 0m0.047s
   user 0m0.026s
   sys  0m0.010s

On a side note the command using  HTML::Strip uses appx 1/3 MB more  
memory.

Also I noticed that as I increased the size of the HTML being parsed  
the time's remained about the same relatively *but* HTML::Strip's  
memory use grew, HTML::Oblitaerate's did not. I'd say HTML::Strip  
needs to put some of it's XS mojo to better use than making misleading  
claims :)

> Thanks for the HTML::Truncate suggestion. I may end up checking that  
> out in the future.

np :) I just started using it, so far so good!

_______________________________________________
templates mailing list
templates@template-toolkit.org
http://mail.template-toolkit.org/mailman/listinfo/templates

Reply via email to