RE: Non-HTML data extraction from MS-HTML

David A. Desrosiers Wed, 20 Nov 2002 04:21:29 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> The following regexp strips most of the Microsoft "XML" crap, e.g. <![if
> !supportEmptyParas]> :
>
> s/<\![^>]*>//g;

        Very nice. I've modified your regex a bit and extended it, here's
some more code to play with (based on some other ideas from people) below.

        There's also Wp2Html[1], which is supposed to do quite a good job of
converting the MS-HTML (and WordPerfect) back to "normal" HTML. I haven't
tried it, so if someone could give it a go and let me know, I can add that
to the FAQ as well.

        Some other tools to look at are HTML tidy[2], demoroniser[3], wv[4],
and WordFilter[5]. Each has their own niche. I prefer the perl solution of
course.

        Another alternate solution, to grab the actual data out of a
Microsoft Word document directly, is using this small snippet:

        use strict;     # of course!
        use Win32::OLE; # will only install on Win32 systems
        my $word        = Win32::OLE->new('word.application');
        my $doc         = $word->Documents->Open('C:\file.doc');
        # Your data is in $text
        my $text        = $doc->{Text};


- ----
# Select the core attributes to ignore
my @ignore_attr = qw (bgcolor background color face style link alink vlink
                      text onblur onchange onclick ondblclick onfocus
                      onkeydown onkeyup onload onmousedown onmousemove
                      onmouseout onmouseover onmouseup onreset onselect
                      onunload class xmlns:w xmlns:o xmlns
);

# tags to ignore
my @ignore_tags = qw(font big small body dir html div span);

# tags to drop with content
my @ignore_elements = qw(script style head o:p);

sub un_mshtml {
        use HTML::TreeBuilder;

        my $input = shift;
        my $warn  = 0;
        my $htmlex;

        my $h = HTML::TreeBuilder->new;
        $h->ignore_unknown(0);
        $h->warn($warn);
        $h->parse($input);

        # Drop all unwanted tags
        foreach (@Conf::ignore_tags) {
                $htmlex = 1, next if lc($_) eq "html";
                while ( my $ok = $h->look_down( '_tag', "$_" ) ) {
                        $ok->replace_with_content;
                }
        }

        # Drop all unwanted elements (tags w/content)
        foreach (@Conf::ignore_elements) {
                while ( my $ok = $h->look_down( '_tag', "$_" ) ) {
                        $ok->detach;
                }
        }

        # Drop all unwanted attributes
        foreach my $attr (@Conf::ignore_attr) {
                while (my $ok = $h->look_down(
                        sub { defined($_[0]->attr($attr)) } ))
                {
                        $ok->attr($attr, undef);
                }
        }


        # Drop unwanted script code <![....]>
        foreach my $ok ( $h->look_down( sub {
                grep { /^<\s*!\[.+?\]\s*>$/ } $_[0]->content_list;
        }

        {
                $ok->detach_content;
        }

        my $output = $h->as_HTML( undef, " ", {} );

        # params = entities to encode, indent, optional endtags
        $h = $h->delete();
        if ($htmlex) {
                $output =~ s:^\s*<html>::m;
                $output =~ s:</html>\s*$::m;
        }
        return $output;
}


[1] http://www.res.bbsrc.ac.uk/wp2html/
[2] http://www.w3.org/People/Raggett/tidy/
[3] http://www.perl.com/language/misc/demoroniser
[4] http://www.wvware.com
[5] http://office.microsoft.com/downloads/2000/Msohtmf2.aspx

d.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE9233GkRQERnB1rkoRAnOpAJ0YBLLWfdDrCF+sqVwU2MJHbeh/LQCeIDdE
jhohbaeAERgf46wtZbP7jFI=
=M77X
-----END PGP SIGNATURE-----

_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

RE: Non-HTML data extraction from MS-HTML

Reply via email to