Edit report at https://bugs.php.net/bug.php?id=62032&edit=1
ID: 62032 User updated by: iamcraigcampbell at gmail dot com Reported by: iamcraigcampbell at gmail dot com Summary: filter_var incorrectly strips characters from strings after "<" Status: Open Type: Bug Package: Filter related Operating System: Mac OS X PHP Version: 5.4.3 Block user comment: N Private report: N New Comment: @anon I agree with many of your sentiments :) Just wanted to point out one thing. The issue of unclosed script tags or other tags would not be a problem assuming the output is escaped which it should be. Therefore if you had "<script" in the string it would end up being output as <script and would not cause the issues that you mentioned. As for displaying what the user typed I could see an argument either way on that. The fact still remains that this is a bug. Previous Comments: ------------------------------------------------------------------------ [2012-05-15 21:37:07] anon at anon dot anon Well I never heard of this "SANITIZE_STRING" filter before, but it looks just as retarded as it sounds, and about as retarded as strip_tags. 99.99% of the time, strip_tags just should not be used at all because it's horribly broken. The real bugs are (1) strip_tags exists, and (2) that PHP should imply that any kind of magical all-purpose "string sanitization" process could exist. @iamcraigcampbell: >Well I can understand stripping it if there is a closing > somewhere, but if >it is a < that is not followed by a matching > then it should be allowed in the string and not stripped. In that case: (1) Unclosed tags will eat extra page content, breaking page layout. (2) Pages consist of many echo statements. By your logic, "<script" is a possibly legal string to echo, but if some later string contains a ">", we need to implement a delayed-choice quantum eraser to make all the parallel universes in which the earlier echo statement occurred cease to exist. >I think it is more expected behavior to display this string as "This is NOT >good!". No. Display what users type. Don't delete text from their posts based on the quirks of what just happens to be the underlying display format on a particular day. Suppose your hypothetical forum also displays posts in another format, e.g., it has a Flash or iPhone-based app, or it tweets posts, or a few years from now we're all using a completely different markup language. Should it then also strip HTML-like tags from all text in perpetuity from all media just because HTML happened to be a relevant format to someone somewhere once upon a time, or should user-submitted text throw integrity to the wind and change depending on what kind of device someone is attempting to use to view it, whether or not that device's markup was invented when the post was made? What if someone is trying to use text that legitimately resembles an HTML tag (it happens), or, more likely, they're trying to quote or talk about HTML -- no filter can handle this. No no no no no. Display what they type and don't confuse the poor souls. I.e., use htmlspecialchars() if outputting to HTML; or if not, use whatever other escaping method is appropriate to the particular output format that still preserves the integrity of the user-typed text in that format, while making exception for the formatting markup that is legitimately supported and documented to be supported by the forum, such as markdown or bbcode syntax (and probably not HTML, since besides the fact that HTML is ugly and over-complicated for most forum post needs, strip_tags with an allowed tags parameter is the most dangerous of the lot and allows blatant abuse via attributes). And don't get me started on entities. tl;dr: no amount of wrapping it in flashy filter functions changes the fact that strip_tags confuses countless souls, is brain-damaged, and ought to be deprecated to death. ------------------------------------------------------------------------ [2012-05-15 15:06:26] iamcraigcampbell at gmail dot com @pajoye I agree with you, but there is a use case that encoding will not solve. Let's say there is a forum where users are posting. If the user posts: "This is <strong>NOT</strong> good!" and the tags get encoded then that means the HTML tags will be displayed in the forum as plain text. I think it is more expected behavior to display this string as "This is NOT good!". So one option would be encoding the < only if it is not followed by a > but that is a lot of extra work to figure that out. At the end of the day the point is that no matter how you look at it I still think this is a bug. $string = 'This is true: 2<5'; strip_tags($string); and filter_var($string, FILTER_SANITIZE_STRING); Should not strip out <5 since that is not an HTML tag. ------------------------------------------------------------------------ [2012-05-15 14:51:09] aleksey dot v dot korzun at gmail dot com How is stripping anything after < with a space is a valid operation? That seems like a lazy man's html stripper. Let's just blindly strip everything that can possibly be made into an html tag of any sort. Not. ------------------------------------------------------------------------ [2012-05-15 14:49:02] paj...@php.net > or < should be encoded then, see http://www.php.net/manual/en/filter.filters.sanitize.php btw, any option should be added using the option array or defaults, as it is the case already. ------------------------------------------------------------------------ [2012-05-15 14:45:27] iamcraigcampbell at gmail dot com So in that case I think strip_tags and filter_var are both broken. In this context: "It is true that 5<10" "It is true that 5 < 10" Neither of these are html tags so the string should not be touched regardless of if there is a space or not. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=62032 -- Edit this bug report at https://bugs.php.net/bug.php?id=62032&edit=1