have a look at peg_replace in the man, you could also get your users to save as 
filtered html which get rid of some of it, there's also a MS tool "Microsoft Office 
HTML Filter 2" that will clean it some more, it says it's for word 2000 but it works 
fine for word 2002/XP.

but your best option is to use preg_replace to swap out all the "smart tags" etc.
Paul Roberts
http://www.paul-roberts.com
[EMAIL PROTECTED]
++++++++++++++++++++++++


----- Original Message ----- 
From: "DL Neil" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Saturday, August 31, 2002 12:02 AM
Subject: Re: [PHP] Ridding myself of HTML tags


> Liam,
> If you were to stristr()/remove everything up to and including the </head>
> tag, would that take care of things?
> =dn
> 
> 
> > I've got a lil problem with HTML tags. Here's the description.
> >
> > My site accepts HTML files by upload. A lot of these files are written in
> MS
> > Word and then saved as HTML files from that. MS Word likes to put a bunch
> of
> > garbage at the beginning of the file. Now, when users upload their HTML
> > files, my script goes and striptags all of the unnecessary junk in there
> > except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the
> > beginning of the HTML file. Some of these tags span multiple lines, and my
> > script goes through line-by-line, so it won't identify these as tags. Is
> > there a simpler fashion? I don't need the junk about style sheeting and
> > stuff, because I have a style sheet that will take care of styling the
> files
> > the way they should be. I don't want the extra tags, even though they're
> > invisible to users when they web-view, because these are e-mailable files
> > (for HTML mail, it's fine; for text mail, I need to strip it down and
> that's
> > the problem).
> >
> > =================================================
> > Just in case, I've included the HTML code below:
> >
> >
> > <html xmlns:o="urn:schemas-microsoft-com:office:office"
> > xmlns:w="urn:schemas-microsoft-com:office:word"
> > xmlns="http://www.w3.org/TR/REC-html40";>
> >
> > <head>
> > <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
> > <meta name=ProgId content=Word.Document>
> > <meta name=Generator content="Microsoft Word 10">
> > <meta name=Originator content="Microsoft Word 10">
> > <link rel=File-List href="NW100_files/filelist.xml">
> > <title>Test test test</title>
> > <!--[if gte mso 9]><xml>
> >  <o:DocumentProperties>
> >   <o:Author>Liam Gibbs</o:Author>
> >   <o:LastAuthor>Liam Gibbs</o:LastAuthor>
> >   <o:Revision>1</o:Revision>
> >   <o:TotalTime>1</o:TotalTime>
> >   <o:Created>2002-08-30T18:09:00Z</o:Created>
> >   <o:LastSaved>2002-08-30T18:10:00Z</o:LastSaved>
> >   <o:Pages>1</o:Pages>
> >   <o:Words>13</o:Words>
> >   <o:Characters>79</o:Characters>
> >   <o:Company>SXIA</o:Company>
> >   <o:Lines>1</o:Lines>
> >   <o:Paragraphs>1</o:Paragraphs>
> >   <o:CharactersWithSpaces>91</o:CharactersWithSpaces>
> >   <o:Version>10.3501</o:Version>
> >  </o:DocumentProperties>
> > </xml><![endif]--><!--[if gte mso 9]><xml>
> >  <w:WordDocument>
> >   <w:SpellingState>Clean</w:SpellingState>
> >   <w:GrammarState>Clean</w:GrammarState>
> >   <w:Compatibility>
> >    <w:BreakWrappedTables/>
> >    <w:SnapToGridInCell/>
> >    <w:WrapTextWithPunct/>
> >    <w:UseAsianBreakRules/>
> >   </w:Compatibility>
> >   <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
> >  </w:WordDocument>
> > </xml><![endif]-->
> > <style>
> > <!--
> >  /* Style Definitions */
> >  p.MsoNormal, li.MsoNormal, div.MsoNormal
> > {mso-style-parent:"";
> > margin:0cm;
> > margin-bottom:.0001pt;
> > mso-pagination:widow-orphan;
> > font-size:12.0pt;
> > font-family:"Times New Roman";
> > mso-fareast-font-family:"Times New Roman";}
> > span.SpellE
> > {mso-style-name:"";
> > mso-spl-e:yes;}
> > @page Section1
> > {size:612.0pt 792.0pt;
> > margin:72.0pt 90.0pt 72.0pt 90.0pt;
> > mso-header-margin:35.4pt;
> > mso-footer-margin:35.4pt;
> > mso-paper-source:0;}
> > div.Section1
> > {page:Section1;}
> > -->
> > </style>
> > <!--[if gte mso 10]>
> > <style>
> >  /* Style Definitions */
> >  table.MsoNormalTable
> > {mso-style-name:"Table Normal";
> > mso-tstyle-rowband-size:0;
> > mso-tstyle-colband-size:0;
> > mso-style-noshow:yes;
> > mso-style-parent:"";
> > mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
> > mso-para-margin:0cm;
> > mso-para-margin-bottom:.0001pt;
> > mso-pagination:widow-orphan;
> > font-size:10.0pt;
> > font-family:"Times New Roman";}
> > </style>
> > <![endif]-->
> > </head>
> >
> > <body lang=EN-US style='tab-interval:36.0pt'>
> >
> > <div class=Section1>
> >
> > <p class=MsoNormal>Test <span class=SpellE>test</span> <span
> > class=SpellE>test</span></p>
> >
> > <p class=MsoNormal align=center style='text-align:center'><span
> > class=SpellE>Fdjfkasdjfkla</span></p>
> >
> > <p class=MsoNormal align=center style='text-align:center'><span
> > class=SpellE><b
> > style='mso-bidi-font-weight:normal'>Fdjkslafjdklaf</b></span></p>
> >
> > <p class=MsoNormal style='text-align:justify'><o:p>&nbsp;</o:p></p>
> >
> > <p class=MsoNormal style='text-align:justify'><span
> > class=SpellE>Fdasfdfasffasdfdaadfdfs</span></p>
> >
> > <p class=MsoNormal style='text-align:justify'><span
> > class=SpellE>Dfsdfs</span></p>
> >
> > <p class=MsoNormal style='text-align:justify'>Hi</p>
> >
> > <p class=MsoNormal style='text-align:justify'><o:p>&nbsp;</o:p></p>
> >
> > <p class=MsoNormal style='text-align:justify'><span
> > style='mso-tab-count:3'> </span><span
> > class=SpellE>Jfdklas</span></p>
> >
> > <p class=MsoNormal style='text-align:justify'><o:p>&nbsp;</o:p></p>
> >
> > </div>
> >
> > </body>
> >
> > </html>
> >
> > --
> > PHP General Mailing List (http://www.php.net/)
> > To unsubscribe, visit: http://www.php.net/unsub.php
> >
> >
> 
> 
> -- 
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
> 
> 
> 


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to