Re: [PHP] Good HTML parser needed
On Wed, May 14, 2008 at 10:56 PM, Yi Wang [EMAIL PROTECTED] wrote: Can anyone provide some code that can't be stripped by strip_tags? On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote: On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP OP said users. Strip tags doesn't bother with tag attributes so that is a security hole. Any regex type solution will encounter the same set of issues. Htmlpurifier actually strips down and re-builds your html from the ground against a nice whitelist filtering system that you can customize to your needs. No nasty tags/attributes will get through unless you want them to. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- Regards, Wang Yi I meant if you used the allow tags parameter. If you allow say the b tag, then you could say b key=value and it would pass right through. ?php $str = bhi/bb onMouseOver='alert(/xss/);'xss/b; echo raw:\n; var_dump($str); echo strip tags:\n; var_dump(strip_tags($str)); echo allow b:\n; var_dump(strip_tags($str, 'b')); ? raw: string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47) strip tags: string 'hixss' (length=5) allow b: string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Tue, May 13, 2008 at 6:06 AM, Per Jessen [EMAIL PROTECTED] wrote: Shelley wrote: I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. Except for the last part, any XML parser will do. Sablotron, xalan, libxsl etc. /Per Jessen, Zürich ... except when the HTML is not well formed XML, as I find is often the case when accepting input from users. That last part, as you say, is kind of essential. It could be as simple as tags that don't close in HTML (e.g. img, br, hr) or it could be something much trickier to clean up such as mismatched tags, improper nesting, missing closing tags (since some browsers are too forgiving of not closing td, li or option), HTML entities that are not valid in XML, etc. In these cases, the DOM-type parsers will usually choke. You might be able to salvage something with the stream-based parsers like SAX. (I've never tried it.) Andrew
Re: [PHP] Good HTML parser needed
On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote: On Wed, May 14, 2008 at 10:56 PM, Yi Wang [EMAIL PROTECTED] wrote: Can anyone provide some code that can't be stripped by strip_tags? On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote: On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP OP said users. Strip tags doesn't bother with tag attributes so that is a security hole. Any regex type solution will encounter the same set of issues. Htmlpurifier actually strips down and re-builds your html from the ground against a nice whitelist filtering system that you can customize to your needs. No nasty tags/attributes will get through unless you want them to. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- Regards, Wang Yi I meant if you used the allow tags parameter. If you allow say the b tag, then you could say b key=value and it would pass right through. ?php $str = bhi/bb onMouseOver='alert(/xss/);'xss/b; echo raw:\n; var_dump($str); echo strip tags:\n; var_dump(strip_tags($str)); echo allow b:\n; var_dump(strip_tags($str, 'b')); ? raw: string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47) strip tags: string 'hixss' (length=5) allow b: string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47) Yes, you are right. I always used to involved plain text. Thanks! -- cheers, Yi Wang -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
Thank you all. I have made it working excellent for me now. The solution is here: http://phparch.cn On Tue, May 13, 2008 at 1:34 PM, Robert Cummings [EMAIL PROTECTED] wrote: On Tue, 2008-05-13 at 01:27 -0400, Robert Cummings wrote: On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote: Maybe I didn't use that tidy correctly. I don't want html, head, body things. Just parsed string. So strip them... ?php // ... tidy_parse_string( $html ); tidy_clean_repair(); $html = tidy_get_output(); $html = preg_replace( '#^.*body#Uis', '', $html ) $html = preg_replace( '#/body#Uis', '', $html ) //... ? Whoops... noticed some bugs there :B ?php $html = preg_replace( '#^.*body#Uis', '', $html ); $html = preg_replace( '#/body.*$#Uis', '', $html ); ? Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- Regards, Shelley
Re: [PHP] Good HTML parser needed
On Wed, 2008-05-14 at 18:50 +0800, Shelley wrote: Thank you all. I have made it working excellent for me now. The solution is here: http://phparch.cn Ah, there you go... show_body_only. I was too lazy when I used tidy a while back to look through every option, so a quick preg stripping sufficed :) Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are *standards compliant.* Set it up how you want it. -- /James On Wed, May 14, 2008 at 4:38 PM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP
Re: [PHP] Good HTML parser needed
On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP OP said users. Strip tags doesn't bother with tag attributes so that is a security hole. Any regex type solution will encounter the same set of issues. Htmlpurifier actually strips down and re-builds your html from the ground against a nice whitelist filtering system that you can customize to your needs. No nasty tags/attributes will get through unless you want them to. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
Yeah, you are right, friend. Because users' input should be in body tag only. On Wed, May 14, 2008 at 11:06 PM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 18:50 +0800, Shelley wrote: Thank you all. I have made it working excellent for me now. The solution is here: http://phparch.cn Ah, there you go... show_body_only. I was too lazy when I used tidy a while back to look through every option, so a quick preg stripping sufficed :) Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- Regards, Shelley
Re: [PHP] Good HTML parser needed
Can anyone provide some code that can't be stripped by strip_tags? On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote: On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP OP said users. Strip tags doesn't bother with tag attributes so that is a security hole. Any regex type solution will encounter the same set of issues. Htmlpurifier actually strips down and re-builds your html from the ground against a nice whitelist filtering system that you can customize to your needs. No nasty tags/attributes will get through unless you want them to. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- Regards, Wang Yi -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
Gabriel Sosa wrote: this one strip_tags('%3C%68%31%3E%68%65%6C%6C%6F%20%77%6F%72%6C%64%3C%2F%68%31%3E'); aka h1hello world/h1 using urlencode from http://ha.ckers.org/xss.html take care the possible xss saludos gabriel On Wed, May 14, 2008 at 11:56 PM, Yi Wang [EMAIL PROTECTED] wrote: Can anyone provide some code that can't be stripped by strip_tags? On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote: On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote: On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote: On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote: http://htmlpurifier.org/ -- /James This is the only real solution. That depends... if I'm the webmaster and I want to input arbitrary HTML, then htmlpurifier is unnecessary. Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP OP said users. Strip tags doesn't bother with tag attributes so that is a security hole. Any regex type solution will encounter the same set of issues. Htmlpurifier actually strips down and re-builds your html from the ground against a nice whitelist filtering system that you can customize to your needs. No nasty tags/attributes will get through unless you want them to. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- Regards, Wang Yi -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Yes, this raw string can't be stripped by strip_tags. But actually, how the string take xss? The string has been urldecoded before we use it. for example: assuming url is test.php?test_string=%3C%68%31%3E%68%65%6C%6C%6F%20%77%6F%72%6C%64%3C%2F%68%31%3E ?php var_dump( strip_tags( $_GET[ 'test_string' ] ) ); ? should be produce string(11) hello world. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
http://htmlpurifier.org/ -- /James On Tue, May 13, 2008 at 4:34 AM, Shelley [EMAIL PROTECTED] wrote: Hi all, The fact is that I have a site that allow users to post hypertext articles. However, I saw that sometimes, because of their careless input, the articles is not rendered correctly. I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. Any suggection is greatly appreciated. -- Regards, Shelley
Re: [PHP] Good HTML parser needed
Shelley wrote: I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. Except for the last part, any XML parser will do. Sablotron, xalan, libxsl etc. /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Good HTML parser needed
Hi all, The fact is that I have a site that allow users to post hypertext articles. However, I saw that sometimes, because of their careless input, the articles is not rendered correctly. I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. Any suggection is greatly appreciated. -- Regards, Shelley
Re: [PHP] Good HTML parser needed
strip_tags does the tricks. www.php.net/manual/en/function.strip-tags.php BTW, Why cn2 dot php.net blocked by the mail server? The rejected message: This is an automatically generated Delivery Status Notification Delivery to the following recipient failed permanently: php-general@lists.php.net Technical details of permanent failure: PERM_FAILURE: Gmail tried to deliver your message, but it was rejected by the recipient domain. The error that the other server returned was: 550 550-5.7.1 mail rejected by policy. SURBL hit 550-Spammy URLs in your message 550 See http://master.php.net/mail/why.php?why=SURBL. We recommend contacting the other email provider for further information about the cause of this error. Thanks for your continued support. (state 17) On 5/13/08, Shelley [EMAIL PROTECTED] wrote: Hi all, The fact is that I have a site that allow users to post hypertext articles. However, I saw that sometimes, because of their careless input, the articles is not rendered correctly. I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. Any suggection is greatly appreciated. -- Regards, Shelley -- Regards, Wang Yi -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Tue, 2008-05-13 at 11:34 +0800, Shelley wrote: Hi all, The fact is that I have a site that allow users to post hypertext articles. However, I saw that sometimes, because of their careless input, the articles is not rendered correctly. I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. http://ca3.php.net/manual/en/book.tidy.php Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
Maybe I didn't use that tidy correctly. I don't want html, head, body things. Just parsed string. On Tue, May 13, 2008 at 12:00 PM, Robert Cummings [EMAIL PROTECTED] wrote: On Tue, 2008-05-13 at 11:34 +0800, Shelley wrote: Hi all, The fact is that I have a site that allow users to post hypertext articles. However, I saw that sometimes, because of their careless input, the articles is not rendered correctly. I want to know whether there are some good HTML parsers written in PHP. That is, the parser checks whether html tags like table, tr, td, div, dt, dl, dd, script, ul, li, span, h1, h2, etc. are nested correctly. If any tags not matched, just remove them. http://ca3.php.net/manual/en/book.tidy.php Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- Regards, Shelley
Re: [PHP] Good HTML parser needed
You should pass the secend parm to the function. Like this: $allowable_tags = 'patdtable'; strip_tags( $text, $allowable_tags ); On 5/13/08, Shelley [EMAIL PROTECTED] wrote: Not that. It will just remove all html tags, you know. -- Regards, Shelley -- Regards, Wang Yi -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote: Maybe I didn't use that tidy correctly. I don't want html, head, body things. Just parsed string. So strip them... ?php // ... tidy_parse_string( $html ); tidy_clean_repair(); $html = tidy_get_output(); $html = preg_replace( '#^.*body#Uis', '', $html ) $html = preg_replace( '#/body#Uis', '', $html ) //... ? Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Good HTML parser needed
On Tue, 2008-05-13 at 01:27 -0400, Robert Cummings wrote: On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote: Maybe I didn't use that tidy correctly. I don't want html, head, body things. Just parsed string. So strip them... ?php // ... tidy_parse_string( $html ); tidy_clean_repair(); $html = tidy_get_output(); $html = preg_replace( '#^.*body#Uis', '', $html ) $html = preg_replace( '#/body#Uis', '', $html ) //... ? Whoops... noticed some bugs there :B ?php $html = preg_replace( '#^.*body#Uis', '', $html ); $html = preg_replace( '#/body.*$#Uis', '', $html ); ? Cheers, Rob. -- http://www.interjinn.com Application and Templating Framework for PHP -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php