Re: [PHP] Good HTML parser needed

2008-05-15 Thread Eric Butera
On Wed, May 14, 2008 at 10:56 PM, Yi Wang [EMAIL PROTECTED] wrote:
 Can anyone provide some code that can't be stripped by strip_tags?


 On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote:
 On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote:
  
  
On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
 On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] 
 wrote:
  http://htmlpurifier.org/
 
   --
   /James
 

 This is the only real solution.
  
That depends... if I'm the webmaster and I want to input arbitrary HTML,
then htmlpurifier is unnecessary.
  
  
  
Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP
  
  


 OP said users.  Strip tags doesn't bother with tag attributes so
  that is a security hole.  Any regex type solution will encounter the
  same set of issues.

  Htmlpurifier actually strips down and re-builds your html from the
  ground against a nice whitelist filtering system that you can
  customize to your needs.  No nasty tags/attributes will get through
  unless you want them to.


  --
  PHP General Mailing List (http://www.php.net/)
  To unsubscribe, visit: http://www.php.net/unsub.php




 --
 Regards,
 Wang Yi


I meant if you used the allow tags parameter.  If you allow say the
b tag, then you could say b key=value and it would pass right
through.

?php

$str = bhi/bb onMouseOver='alert(/xss/);'xss/b;

echo raw:\n;
var_dump($str);

echo strip tags:\n;
var_dump(strip_tags($str));

echo allow b:\n;
var_dump(strip_tags($str, 'b'));
?

raw:
string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47)
strip tags:
string 'hixss' (length=5)
allow b:
string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-15 Thread Andrew Ballard
On Tue, May 13, 2008 at 6:06 AM, Per Jessen [EMAIL PROTECTED] wrote:
 Shelley wrote:

 I want to know whether there are some good HTML parsers written in
 PHP.

 That is,
 the parser checks whether html tags like table, tr, td, div, dt, dl,
 dd, script, ul, li, span, h1, h2, etc. are nested correctly.
 If any tags not matched, just remove them.

 Except for the last part, any XML parser will do.  Sablotron, xalan,
 libxsl etc.


 /Per Jessen, Zürich

... except when the HTML is not well formed XML, as I find is often
the case when accepting input from users. That last part, as you
say, is kind of essential. It could be as simple as tags that don't
close in HTML (e.g. img, br, hr) or it could be something much
trickier to clean up such as mismatched tags, improper nesting,
missing closing tags (since some browsers are too forgiving of not
closing td, li or option), HTML entities that are not valid in
XML, etc. In these cases, the DOM-type parsers will usually choke. You
might be able to salvage something with the stream-based parsers like
SAX. (I've never tried it.)

Andrew


Re: [PHP] Good HTML parser needed

2008-05-15 Thread Yi Wang

On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote:
 On Wed, May 14, 2008 at 10:56 PM, Yi Wang [EMAIL PROTECTED] wrote:
   Can anyone provide some code that can't be stripped by strip_tags?
  
  
   On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote:
   On Wed, May 14, 2008 at 11:38 AM, Robert Cummings 
[EMAIL PROTECTED] wrote:



  On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
   On Tue, May 13, 2008 at 4:07 AM, James Dempster 
[EMAIL PROTECTED] wrote:

http://htmlpurifier.org/
   
 --
 /James
   
  
   This is the only real solution.

  That depends... if I'm the webmaster and I want to input 
arbitrary HTML,

  then htmlpurifier is unnecessary.



  Cheers,
  Rob.
  --
  http://www.interjinn.com
  Application and Templating Framework for PHP


  
  
   OP said users.  Strip tags doesn't bother with tag attributes so
that is a security hole.  Any regex type solution will encounter the
same set of issues.
  
Htmlpurifier actually strips down and re-builds your html from the
ground against a nice whitelist filtering system that you can
customize to your needs.  No nasty tags/attributes will get through
unless you want them to.
  
  
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
  
  
  
  
   --
   Regards,
   Wang Yi
  


 I meant if you used the allow tags parameter.  If you allow say the
  b tag, then you could say b key=value and it would pass right
  through.

  ?php

  $str = bhi/bb onMouseOver='alert(/xss/);'xss/b;

  echo raw:\n;
  var_dump($str);

  echo strip tags:\n;
  var_dump(strip_tags($str));

  echo allow b:\n;
  var_dump(strip_tags($str, 'b'));
  ?

  raw:
  string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47)
  strip tags:
  string 'hixss' (length=5)
  allow b:
  string 'bhi/bb onMouseOver='alert(/xss/);'xss/b' (length=47)


Yes, you are right. I always used to involved plain text.

Thanks!

--

cheers,
Yi Wang

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread Shelley
Thank you all.
I have made it working excellent for me now.
The solution is here: http://phparch.cn

On Tue, May 13, 2008 at 1:34 PM, Robert Cummings [EMAIL PROTECTED]
wrote:


 On Tue, 2008-05-13 at 01:27 -0400, Robert Cummings wrote:
  On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote:
   Maybe I didn't use that tidy correctly.
   I don't want html, head, body things. Just parsed string.
 
  So strip them...
 
  ?php
  // ...
 
  tidy_parse_string( $html );
  tidy_clean_repair();
 
  $html = tidy_get_output();
 
  $html = preg_replace( '#^.*body#Uis', '', $html )
  $html = preg_replace( '#/body#Uis', '', $html )
 
  //...
  ?

 Whoops... noticed some bugs there :B

 ?php

$html = preg_replace( '#^.*body#Uis', '', $html );
$html = preg_replace( '#/body.*$#Uis', '', $html );

 ?

 Cheers,
 Rob.
 --
 http://www.interjinn.com
 Application and Templating Framework for PHP




-- 
Regards,
Shelley


Re: [PHP] Good HTML parser needed

2008-05-14 Thread Robert Cummings

On Wed, 2008-05-14 at 18:50 +0800, Shelley wrote:
 Thank you all.
 I have made it working excellent for me now.
 The solution is here: http://phparch.cn

Ah, there you go... show_body_only. I was too lazy when I used tidy a
while back to look through every option, so a quick preg stripping
sufficed :)

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread Eric Butera
On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote:
 http://htmlpurifier.org/

  --
  /James


This is the only real solution.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread Robert Cummings

On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
 On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote:
  http://htmlpurifier.org/
 
   --
   /James
 
 
 This is the only real solution.

That depends... if I'm the webmaster and I want to input arbitrary HTML,
then htmlpurifier is unnecessary.

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread James Dempster

 Purifier will not only remove all malicious code (better known as XSS)
 with a thoroughly audited, secure yet permissive whitelist, it will also
 make sure your documents are *standards compliant.*


Set it up how you want it.
--
/James

On Wed, May 14, 2008 at 4:38 PM, Robert Cummings [EMAIL PROTECTED]
wrote:


 On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
  On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED]
 wrote:
   http://htmlpurifier.org/
  
--
/James
  
 
  This is the only real solution.

 That depends... if I'm the webmaster and I want to input arbitrary HTML,
 then htmlpurifier is unnecessary.

 Cheers,
 Rob.
 --
 http://www.interjinn.com
 Application and Templating Framework for PHP




Re: [PHP] Good HTML parser needed

2008-05-14 Thread Eric Butera
On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote:


  On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
   On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote:
http://htmlpurifier.org/
   
 --
 /James
   
  
   This is the only real solution.

  That depends... if I'm the webmaster and I want to input arbitrary HTML,
  then htmlpurifier is unnecessary.



  Cheers,
  Rob.
  --
  http://www.interjinn.com
  Application and Templating Framework for PHP



OP said users.  Strip tags doesn't bother with tag attributes so
that is a security hole.  Any regex type solution will encounter the
same set of issues.

Htmlpurifier actually strips down and re-builds your html from the
ground against a nice whitelist filtering system that you can
customize to your needs.  No nasty tags/attributes will get through
unless you want them to.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread Shelley
Yeah, you are right, friend. Because users' input should be in body tag
only.

On Wed, May 14, 2008 at 11:06 PM, Robert Cummings [EMAIL PROTECTED]
wrote:


 On Wed, 2008-05-14 at 18:50 +0800, Shelley wrote:
  Thank you all.
  I have made it working excellent for me now.
  The solution is here: http://phparch.cn

 Ah, there you go... show_body_only. I was too lazy when I used tidy a
 while back to look through every option, so a quick preg stripping
 sufficed :)

 Cheers,
 Rob.
 --
 http://www.interjinn.com
 Application and Templating Framework for PHP




-- 
Regards,
Shelley


Re: [PHP] Good HTML parser needed

2008-05-14 Thread Yi Wang
Can anyone provide some code that can't be stripped by strip_tags?


On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote:
 On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote:
  
  
On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
 On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] 
 wrote:
  http://htmlpurifier.org/
 
   --
   /James
 

 This is the only real solution.
  
That depends... if I'm the webmaster and I want to input arbitrary HTML,
then htmlpurifier is unnecessary.
  
  
  
Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP
  
  


 OP said users.  Strip tags doesn't bother with tag attributes so
  that is a security hole.  Any regex type solution will encounter the
  same set of issues.

  Htmlpurifier actually strips down and re-builds your html from the
  ground against a nice whitelist filtering system that you can
  customize to your needs.  No nasty tags/attributes will get through
  unless you want them to.


  --
  PHP General Mailing List (http://www.php.net/)
  To unsubscribe, visit: http://www.php.net/unsub.php




-- 
Regards,
Wang Yi

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-14 Thread Yi Wang

Gabriel Sosa wrote:

this one
strip_tags('%3C%68%31%3E%68%65%6C%6C%6F%20%77%6F%72%6C%64%3C%2F%68%31%3E');

aka  h1hello world/h1  using urlencode from http://ha.ckers.org/xss.html

take care the possible xss

saludos

gabriel



On Wed, May 14, 2008 at 11:56 PM, Yi Wang [EMAIL PROTECTED] wrote:

Can anyone provide some code that can't be stripped by strip_tags?


On 5/15/08, Eric Butera [EMAIL PROTECTED] wrote:

On Wed, May 14, 2008 at 11:38 AM, Robert Cummings [EMAIL PROTECTED] wrote:
 
 
   On Wed, 2008-05-14 at 11:18 -0400, Eric Butera wrote:
On Tue, May 13, 2008 at 4:07 AM, James Dempster [EMAIL PROTECTED] wrote:
 http://htmlpurifier.org/

  --
  /James

   
This is the only real solution.
 
   That depends... if I'm the webmaster and I want to input arbitrary HTML,
   then htmlpurifier is unnecessary.
 
 
 
   Cheers,
   Rob.
   --
   http://www.interjinn.com
   Application and Templating Framework for PHP
 
 


OP said users.  Strip tags doesn't bother with tag attributes so
 that is a security hole.  Any regex type solution will encounter the
 same set of issues.

 Htmlpurifier actually strips down and re-builds your html from the
 ground against a nice whitelist filtering system that you can
 customize to your needs.  No nasty tags/attributes will get through
 unless you want them to.


 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php




--
Regards,
Wang Yi

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php








Yes, this raw string can't be stripped by strip_tags. But actually, how 
the string take xss? The string has been urldecoded before we use it.


for example:

assuming url is 
test.php?test_string=%3C%68%31%3E%68%65%6C%6C%6F%20%77%6F%72%6C%64%3C%2F%68%31%3E


?php
var_dump( strip_tags( $_GET[ 'test_string' ] ) );
?

should be produce string(11) hello world.


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-13 Thread James Dempster
http://htmlpurifier.org/

--
/James

On Tue, May 13, 2008 at 4:34 AM, Shelley [EMAIL PROTECTED] wrote:

 Hi all,

 The fact is that I have a site that allow users to post hypertext
 articles.
 However, I saw that sometimes, because of their careless input,
 the articles is not rendered correctly.

 I want to know whether there are some good HTML parsers written in PHP.

 That is,
 the parser checks whether html tags like table, tr, td, div, dt, dl, dd,
 script, ul,
 li, span, h1, h2, etc. are nested correctly. If any tags not matched, just
 remove them.

 Any suggection is greatly appreciated.

 --
 Regards,
 Shelley



Re: [PHP] Good HTML parser needed

2008-05-13 Thread Per Jessen
Shelley wrote:

 I want to know whether there are some good HTML parsers written in
 PHP.
 
 That is,
 the parser checks whether html tags like table, tr, td, div, dt, dl,
 dd, script, ul, li, span, h1, h2, etc. are nested correctly. 
 If any tags not matched, just remove them.

Except for the last part, any XML parser will do.  Sablotron, xalan,
libxsl etc.  


/Per Jessen, Zürich


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Good HTML parser needed

2008-05-12 Thread Shelley
Hi all,

The fact is that I have a site that allow users to post hypertext articles.
However, I saw that sometimes, because of their careless input,
the articles is not rendered correctly.

I want to know whether there are some good HTML parsers written in PHP.

That is,
the parser checks whether html tags like table, tr, td, div, dt, dl, dd,
script, ul,
li, span, h1, h2, etc. are nested correctly. If any tags not matched, just
remove them.

Any suggection is greatly appreciated.

-- 
Regards,
Shelley


Re: [PHP] Good HTML parser needed

2008-05-12 Thread Yi Wang
strip_tags does the tricks.

www.php.net/manual/en/function.strip-tags.php

BTW,
Why cn2 dot php.net blocked by the mail server?

The rejected message:

This is an automatically generated Delivery Status Notification

Delivery to the following recipient failed permanently:

php-general@lists.php.net

Technical details of permanent failure:
PERM_FAILURE: Gmail tried to deliver your message, but it was rejected
by the recipient domain. The error that the other server returned was:
550 550-5.7.1 mail rejected by policy.  SURBL hit
550-Spammy URLs in your message
550 See http://master.php.net/mail/why.php?why=SURBL. We recommend
contacting the other email provider for further information about the
cause of this error. Thanks for your continued support. (state 17)

On 5/13/08, Shelley [EMAIL PROTECTED] wrote:
 Hi all,

  The fact is that I have a site that allow users to post hypertext articles.
  However, I saw that sometimes, because of their careless input,
  the articles is not rendered correctly.

  I want to know whether there are some good HTML parsers written in PHP.

  That is,
  the parser checks whether html tags like table, tr, td, div, dt, dl, dd,
  script, ul,
  li, span, h1, h2, etc. are nested correctly. If any tags not matched, just
  remove them.

  Any suggection is greatly appreciated.

  --
  Regards,

 Shelley



-- 
Regards,
Wang Yi

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-12 Thread Robert Cummings

On Tue, 2008-05-13 at 11:34 +0800, Shelley wrote:
 Hi all,
 
 The fact is that I have a site that allow users to post hypertext articles.
 However, I saw that sometimes, because of their careless input,
 the articles is not rendered correctly.
 
 I want to know whether there are some good HTML parsers written in PHP.
 
 That is,
 the parser checks whether html tags like table, tr, td, div, dt, dl, dd,
 script, ul,
 li, span, h1, h2, etc. are nested correctly. If any tags not matched, just
 remove them.

http://ca3.php.net/manual/en/book.tidy.php

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-12 Thread Shelley
Maybe I didn't use that tidy correctly.
I don't want html, head, body things. Just parsed string.


On Tue, May 13, 2008 at 12:00 PM, Robert Cummings [EMAIL PROTECTED]
wrote:


 On Tue, 2008-05-13 at 11:34 +0800, Shelley wrote:
  Hi all,
 
  The fact is that I have a site that allow users to post hypertext
 articles.
  However, I saw that sometimes, because of their careless input,
  the articles is not rendered correctly.
 
  I want to know whether there are some good HTML parsers written in PHP.
 
  That is,
  the parser checks whether html tags like table, tr, td, div, dt, dl, dd,
  script, ul,
  li, span, h1, h2, etc. are nested correctly. If any tags not matched,
 just
  remove them.

 http://ca3.php.net/manual/en/book.tidy.php

 Cheers,
 Rob.
 --
 http://www.interjinn.com
 Application and Templating Framework for PHP




-- 
Regards,
Shelley


Re: [PHP] Good HTML parser needed

2008-05-12 Thread Yi Wang
You should pass the secend parm to the function. Like this:

$allowable_tags = 'patdtable';
strip_tags( $text, $allowable_tags );



On 5/13/08, Shelley [EMAIL PROTECTED] wrote:
 Not that.

 It will just remove all html tags, you know.


 --
 Regards,
 Shelley


-- 
Regards,
Wang Yi

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-12 Thread Robert Cummings
On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote:
 Maybe I didn't use that tidy correctly.
 I don't want html, head, body things. Just parsed string.

So strip them...

?php
// ...

tidy_parse_string( $html );
tidy_clean_repair();

$html = tidy_get_output();

$html = preg_replace( '#^.*body#Uis', '', $html )
$html = preg_replace( '#/body#Uis', '', $html )

//...
?

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Good HTML parser needed

2008-05-12 Thread Robert Cummings

On Tue, 2008-05-13 at 01:27 -0400, Robert Cummings wrote:
 On Tue, 2008-05-13 at 12:28 +0800, Shelley wrote:
  Maybe I didn't use that tidy correctly.
  I don't want html, head, body things. Just parsed string.
 
 So strip them...
 
 ?php
 // ...
 
 tidy_parse_string( $html );
 tidy_clean_repair();
 
 $html = tidy_get_output();
 
 $html = preg_replace( '#^.*body#Uis', '', $html )
 $html = preg_replace( '#/body#Uis', '', $html )
 
 //...
 ?

Whoops... noticed some bugs there :B

?php

$html = preg_replace( '#^.*body#Uis', '', $html );
$html = preg_replace( '#/body.*$#Uis', '', $html );

?

Cheers,
Rob.
-- 
http://www.interjinn.com
Application and Templating Framework for PHP


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php