have a look at peg_replace in the man, you could also get your users to save as filtered html which get rid of some of it, there's also a MS tool "Microsoft Office HTML Filter 2" that will clean it some more, it says it's for word 2000 but it works fine for word 2002/XP.
but your best option is to use preg_replace to swap out all the "smart tags" etc. Paul Roberts http://www.paul-roberts.com [EMAIL PROTECTED] ++++++++++++++++++++++++ ----- Original Message ----- From: "DL Neil" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, August 31, 2002 12:02 AM Subject: Re: [PHP] Ridding myself of HTML tags > Liam, > If you were to stristr()/remove everything up to and including the </head> > tag, would that take care of things? > =dn > > > > I've got a lil problem with HTML tags. Here's the description. > > > > My site accepts HTML files by upload. A lot of these files are written in > MS > > Word and then saved as HTML files from that. MS Word likes to put a bunch > of > > garbage at the beginning of the file. Now, when users upload their HTML > > files, my script goes and striptags all of the unnecessary junk in there > > except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the > > beginning of the HTML file. Some of these tags span multiple lines, and my > > script goes through line-by-line, so it won't identify these as tags. Is > > there a simpler fashion? I don't need the junk about style sheeting and > > stuff, because I have a style sheet that will take care of styling the > files > > the way they should be. I don't want the extra tags, even though they're > > invisible to users when they web-view, because these are e-mailable files > > (for HTML mail, it's fine; for text mail, I need to strip it down and > that's > > the problem). > > > > ================================================= > > Just in case, I've included the HTML code below: > > > > > > <html xmlns:o="urn:schemas-microsoft-com:office:office" > > xmlns:w="urn:schemas-microsoft-com:office:word" > > xmlns="http://www.w3.org/TR/REC-html40"> > > > > <head> > > <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> > > <meta name=ProgId content=Word.Document> > > <meta name=Generator content="Microsoft Word 10"> > > <meta name=Originator content="Microsoft Word 10"> > > <link rel=File-List href="NW100_files/filelist.xml"> > > <title>Test test test</title> > > <!--[if gte mso 9]><xml> > > <o:DocumentProperties> > > <o:Author>Liam Gibbs</o:Author> > > <o:LastAuthor>Liam Gibbs</o:LastAuthor> > > <o:Revision>1</o:Revision> > > <o:TotalTime>1</o:TotalTime> > > <o:Created>2002-08-30T18:09:00Z</o:Created> > > <o:LastSaved>2002-08-30T18:10:00Z</o:LastSaved> > > <o:Pages>1</o:Pages> > > <o:Words>13</o:Words> > > <o:Characters>79</o:Characters> > > <o:Company>SXIA</o:Company> > > <o:Lines>1</o:Lines> > > <o:Paragraphs>1</o:Paragraphs> > > <o:CharactersWithSpaces>91</o:CharactersWithSpaces> > > <o:Version>10.3501</o:Version> > > </o:DocumentProperties> > > </xml><![endif]--><!--[if gte mso 9]><xml> > > <w:WordDocument> > > <w:SpellingState>Clean</w:SpellingState> > > <w:GrammarState>Clean</w:GrammarState> > > <w:Compatibility> > > <w:BreakWrappedTables/> > > <w:SnapToGridInCell/> > > <w:WrapTextWithPunct/> > > <w:UseAsianBreakRules/> > > </w:Compatibility> > > <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> > > </w:WordDocument> > > </xml><![endif]--> > > <style> > > <!-- > > /* Style Definitions */ > > p.MsoNormal, li.MsoNormal, div.MsoNormal > > {mso-style-parent:""; > > margin:0cm; > > margin-bottom:.0001pt; > > mso-pagination:widow-orphan; > > font-size:12.0pt; > > font-family:"Times New Roman"; > > mso-fareast-font-family:"Times New Roman";} > > span.SpellE > > {mso-style-name:""; > > mso-spl-e:yes;} > > @page Section1 > > {size:612.0pt 792.0pt; > > margin:72.0pt 90.0pt 72.0pt 90.0pt; > > mso-header-margin:35.4pt; > > mso-footer-margin:35.4pt; > > mso-paper-source:0;} > > div.Section1 > > {page:Section1;} > > --> > > </style> > > <!--[if gte mso 10]> > > <style> > > /* Style Definitions */ > > table.MsoNormalTable > > {mso-style-name:"Table Normal"; > > mso-tstyle-rowband-size:0; > > mso-tstyle-colband-size:0; > > mso-style-noshow:yes; > > mso-style-parent:""; > > mso-padding-alt:0cm 5.4pt 0cm 5.4pt; > > mso-para-margin:0cm; > > mso-para-margin-bottom:.0001pt; > > mso-pagination:widow-orphan; > > font-size:10.0pt; > > font-family:"Times New Roman";} > > </style> > > <![endif]--> > > </head> > > > > <body lang=EN-US style='tab-interval:36.0pt'> > > > > <div class=Section1> > > > > <p class=MsoNormal>Test <span class=SpellE>test</span> <span > > class=SpellE>test</span></p> > > > > <p class=MsoNormal align=center style='text-align:center'><span > > class=SpellE>Fdjfkasdjfkla</span></p> > > > > <p class=MsoNormal align=center style='text-align:center'><span > > class=SpellE><b > > style='mso-bidi-font-weight:normal'>Fdjkslafjdklaf</b></span></p> > > > > <p class=MsoNormal style='text-align:justify'><o:p> </o:p></p> > > > > <p class=MsoNormal style='text-align:justify'><span > > class=SpellE>Fdasfdfasffasdfdaadfdfs</span></p> > > > > <p class=MsoNormal style='text-align:justify'><span > > class=SpellE>Dfsdfs</span></p> > > > > <p class=MsoNormal style='text-align:justify'>Hi</p> > > > > <p class=MsoNormal style='text-align:justify'><o:p> </o:p></p> > > > > <p class=MsoNormal style='text-align:justify'><span > > style='mso-tab-count:3'> </span><span > > class=SpellE>Jfdklas</span></p> > > > > <p class=MsoNormal style='text-align:justify'><o:p> </o:p></p> > > > > </div> > > > > </body> > > > > </html> > > > > -- > > PHP General Mailing List (http://www.php.net/) > > To unsubscribe, visit: http://www.php.net/unsub.php > > > > > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php