-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
> The following regexp strips most of the Microsoft "XML" crap, e.g. <![if > !supportEmptyParas]> : > > s/<\![^>]*>//g; Very nice. I've modified your regex a bit and extended it, here's some more code to play with (based on some other ideas from people) below. There's also Wp2Html[1], which is supposed to do quite a good job of converting the MS-HTML (and WordPerfect) back to "normal" HTML. I haven't tried it, so if someone could give it a go and let me know, I can add that to the FAQ as well. Some other tools to look at are HTML tidy[2], demoroniser[3], wv[4], and WordFilter[5]. Each has their own niche. I prefer the perl solution of course. Another alternate solution, to grab the actual data out of a Microsoft Word document directly, is using this small snippet: use strict; # of course! use Win32::OLE; # will only install on Win32 systems my $word = Win32::OLE->new('word.application'); my $doc = $word->Documents->Open('C:\file.doc'); # Your data is in $text my $text = $doc->{Text}; - ---- # Select the core attributes to ignore my @ignore_attr = qw (bgcolor background color face style link alink vlink text onblur onchange onclick ondblclick onfocus onkeydown onkeyup onload onmousedown onmousemove onmouseout onmouseover onmouseup onreset onselect onunload class xmlns:w xmlns:o xmlns ); # tags to ignore my @ignore_tags = qw(font big small body dir html div span); # tags to drop with content my @ignore_elements = qw(script style head o:p); sub un_mshtml { use HTML::TreeBuilder; my $input = shift; my $warn = 0; my $htmlex; my $h = HTML::TreeBuilder->new; $h->ignore_unknown(0); $h->warn($warn); $h->parse($input); # Drop all unwanted tags foreach (@Conf::ignore_tags) { $htmlex = 1, next if lc($_) eq "html"; while ( my $ok = $h->look_down( '_tag', "$_" ) ) { $ok->replace_with_content; } } # Drop all unwanted elements (tags w/content) foreach (@Conf::ignore_elements) { while ( my $ok = $h->look_down( '_tag', "$_" ) ) { $ok->detach; } } # Drop all unwanted attributes foreach my $attr (@Conf::ignore_attr) { while (my $ok = $h->look_down( sub { defined($_[0]->attr($attr)) } )) { $ok->attr($attr, undef); } } # Drop unwanted script code <![....]> foreach my $ok ( $h->look_down( sub { grep { /^<\s*!\[.+?\]\s*>$/ } $_[0]->content_list; } { $ok->detach_content; } my $output = $h->as_HTML( undef, " ", {} ); # params = entities to encode, indent, optional endtags $h = $h->delete(); if ($htmlex) { $output =~ s:^\s*<html>::m; $output =~ s:</html>\s*$::m; } return $output; } [1] http://www.res.bbsrc.ac.uk/wp2html/ [2] http://www.w3.org/People/Raggett/tidy/ [3] http://www.perl.com/language/misc/demoroniser [4] http://www.wvware.com [5] http://office.microsoft.com/downloads/2000/Msohtmf2.aspx d. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQE9233GkRQERnB1rkoRAnOpAJ0YBLLWfdDrCF+sqVwU2MJHbeh/LQCeIDdE jhohbaeAERgf46wtZbP7jFI= =M77X -----END PGP SIGNATURE----- _______________________________________________ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev