Kjetil Kjernsmo requested a front end to HTML::StripScripts that, instead of returning HTML text, would return a LibXML Document or DocumentFragment (ie a DOM tree).
I have released this as HTML::StripScripts::LibXML: http://search.cpan.org/~drtech/HTML-StripScripts-LibXML-0.10/LibXML.pm It handles messy HTML, strips out XSS, and gives you fine grained control of the HTML/XML nodes that are returned. If you are interested in this, please give it a try, and give me some feedback about how to improve it, options to add etc. The main question mark I have is what to do with encoding - suggestions welcome. Also see my question at Perl Monks: http://www.perlmonks.org/index.pl?node_id=624334 thanks Clint On Tue, 2007-06-26 at 16:34 +0200, Kjetil Kjernsmo wrote: > On Tuesday 26 June 2007 16:22, Clinton Gormley wrote: > > - used to strip XSS scripting from user submitted HTML > > Ooooh, cool! I haven't found any modules that does that well enough. > > > - outputs valid HTML (cleans up nesting, context of tags etc) > > > > - handles the exploits listed at http://ha.ckers.org/xss.html > > > Great! > > > I hope this helps others, and if anybody has any suggestions, please > > feed them back to me > > Actually, something I would feel would be very useful is if it could > return an XML::LibXML::DocumentFragment object. > > I tend to use XML::LibXML to parse user input and insert in the > document, which is then going through some XSLT, and since you've > allready parsed stuff, it seems like a waste to parse again. > > So that's my feature request! :-) > > Cheers, > > Kjetil