RE: Stripping HTML from a text file.

Hanson, Rob Thu, 04 Sep 2003 17:38:18 -0700

A simple regex will do the trick...

# untested
$text = "...";
$text =~ s|<head>.*?</head>||s;


Or something more generic...

# untested
$tag = "head";
$text =~ s|<$tag[^>]*?>.*?</$tag>||s;

This second one also allows for possible attributes in the start tag.  You
may need more than this if the HTML isn't well formed, or if there are extra
spaces in your tags.

If you want something for the command line you could do this...

(Note: for *nix, needs modification for Win [untested])
perl -e '$x=join("",<>);$x=~s|<head>.*?</head>||s' myfile.html >
newfile.html

Rob


-----Original Message-----
From: Sara [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 03, 2003 6:32 AM
To: beginperl
Subject: Stripping HTML from a text file.


I have a couple of text files with html code in them.. e.g.

---------- Text File --------------
<html>
    <head>
        <title>This is Test File</title>
    </head>
<body>
<font size=2 face=arial>This is the test file contents<br>
<p>
blah blah blah.........
</body>
</html>

-----------------------------------------

What I want to do is to remove/delete HTML code from the text file from a
certain tag upto certain tag.

For example; I want to delete the code completely that comes in between
<head> and </head> (including any style tags and embedded javascripts etc)

Any ideas?

Thanks in advance.

Sara.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Stripping HTML from a text file.

Reply via email to