Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
Hi, (Long time...) On Sat, 04 Oct 2003 21:01:46 -0400 Gerard Samuel [EMAIL PROTECTED] wrote: - Edwin - wrote: Far east languages are not necessarily in this form: #n; So, running htmlspecialchars() on, say, Japanese characters would do NO harm since , , ', , are NOT Japanese characters ;) Or, am I missing something? :) Not exactly. When storing far east languages in a database for example, thats the format its stored as. #x; Hmm... I don't think that's the way they're stored--maybe the way you see them esp. if you're doing the query on a terminal that doesn't support the encoding in question... Also, I've seen it in that form in the $_POST array from a form. I've never seen this one. For example, any Japanese characters $_POSTed would *always* show them as they were when you do a print_r($_POST). If not, most probably it has something to do with the way you display the page (e.g. wrong encoding). - E - __ Do You Yahoo!? Yahoo! BB is Broadband by Yahoo! http://bb.yahoo.co.jp/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
- Edwin - wrote: Far east languages are not necessarily in this form: #n; So, running htmlspecialchars() on, say, Japanese characters would do NO harm since , , ', , are NOT Japanese characters ;) Or, am I missing something? :) Not exactly. When storing far east languages in a database for example, thats the format its stored as. #x; Also, I've seen it in that form in the $_POST array from a form. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
* Thus wrote Gerard Samuel ([EMAIL PROTECTED]): - Edwin - wrote: Far east languages are not necessarily in this form: #n; So, running htmlspecialchars() on, say, Japanese characters would do NO harm since , , ', , are NOT Japanese characters ;) Or, am I missing something? :) Not exactly. When storing far east languages in a database for example, thats the format its stored as. #x; Also, I've seen it in that form in the $_POST array from a form. That is an html entity and is not how it is stored. How that entity gets displayed depends entirely on what encoding you have set for the page. The japanese characters (charset ISO-2022-JP) to use to display the phrase for 'Contents' is: ^[$B$3$s$F$s$D^[(B (^[ == escape character) I can store those exact characters (with the proper escape character) in a database without a problem Curt -- List Stats: http://zirzow.dyndns.org/html/mlists/php_general/ I used to think I was indecisive, but now I'm not so sure. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
* Thus wrote Gerard Samuel ([EMAIL PROTECTED]): CPT John W. Holmes wrote: From: Eugene Lee [EMAIL PROTECTED] On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : Got a problem with htmlspecialchars being too greedy, where : for example, it converts : foo; : to : amp;foo; : : Yes it displays correctly in the browser for some content, but not all. : (an example is posted below) : So I came up with this example code, but not sure if there is an : easier/better way to get the correct end result. : If there is a better way, feel free to let me know. : Thanks : : Note: I dont read/speak chinese, so if its offensive please forgive me. : : -- : ?php : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; Maybe you should run html_entity_decode() on the string first, then run encode again. The decode will take #20013; and turn it into it's actual character but not affect anything else. Then the recoding will turn it back into #20013; and also encode any other characters. John, a good idea, but unfortunately, after some tests, and re-reading the manual, it seems html_entity_decode(), only recognises html entities. Not ascii values of language characters. So Im going to push ahead with my code, and see if it breaks anything :) hmm.. take a look at http://php.net/manual/en/function.get-html-translation-table.php That will do exactly what john suggested. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Curt -- List Stats: http://zirzow.dyndns.org/html/mlists/php_general/ I used to think I was indecisive, but now I'm not so sure. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
Curt Zirzow wrote: That is an html entity and is not how it is stored. How that entity gets displayed depends entirely on what encoding you have set for the page. The japanese characters (charset ISO-2022-JP) to use to display the phrase for 'Contents' is: ^[$B$3$s$F$s$D^[(B (^[ == escape character) I can store those exact characters (with the proper escape character) in a database without a problem Then I haven't a clue as to why its stored that way, or its just the way the command line displays the content. Using postgresql on a database setup for unicode storage, Im getting - test=# select topic_title from forum_topics order by topic_time desc limit 1 offset 4; topic_title --- testing japanese #31038;#20250;#12491;#12517;#12540;#12473; (1 row) I read your other post, Ill see whats possible with |get_html_translation_table| -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Wed, Oct 01, 2003 at 02:46:14PM -0400, Gerard Samuel wrote: : : When I say that I don't know what characters Im expecting. : Im not talking about normal html entities, like amp; nbsp lt; : Im talking about chinese/japanese/korean/taiwanese alphabet, numbers : (even punctuation if applicable). : Maybe Im thinking too hard, but trying to take far east languages : alphanumeric charaters into account, : seems like overkill. : Feel free to correct me. Okay, I will. :-) There's two issues: input and output. HTML character references address the problem of displaying certain characters on a web browser. This is an output issue. When you get CJKV data, you are most likely getting it in some encoding. Different Asian languages use their own encoding sets. For example, if you get Japanese text, it will be encoded in JIS, Shift-JIS, EUC, or something Unicode. You *have* to determine the type of data and its encoding. This is an input issue. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Thu, 2 Oct 2003 01:54:43 -0500 Eugene Lee [EMAIL PROTECTED] wrote: On Wed, Oct 01, 2003 at 02:46:14PM -0400, Gerard Samuel wrote: : : When I say that I don't know what characters Im expecting. : Im not talking about normal html entities, like amp; nbsp lt; : Im talking about chinese/japanese/korean/taiwanese alphabet, numbers : (even punctuation if applicable). : Maybe Im thinking too hard, but trying to take far east languages : alphanumeric charaters into account, : seems like overkill. : Feel free to correct me. Okay, I will. :-) There's two issues: input and output. HTML character references address the problem of displaying certain characters on a web browser. This is an output issue. When you get CJKV data, you are most likely getting it in some encoding. Different Asian languages use their own encoding sets. For example, if you get Japanese text, it will be encoded in JIS, Shift-JIS, EUC, or something Unicode. You *have* to determine the type of data and its encoding. This is an input issue. Hmm... but the characters in question were already (at least in the examples used) in something Unicode (#n;). So, there's really no need to know whether it's JIS, Shift-JIS, EUC, etc. ? - E - __ Do You Yahoo!? Yahoo! BB is Broadband by Yahoo! http://bb.yahoo.co.jp/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
But, On Wed, 01 Oct 2003 14:46:14 -0400 Gerard Samuel [EMAIL PROTECTED] wrote: ...[snip]... When I say that I don't know what characters Im expecting. Im not talking about normal html entities, like amp; nbsp lt; Im talking about chinese/japanese/korean/taiwanese alphabet, numbers (even punctuation if applicable). Maybe Im thinking too hard, but trying to take far east languages alphanumeric charaters into account, seems like overkill. Feel free to correct me. Far east languages are not necessarily in this form: #n; So, running htmlspecialchars() on, say, Japanese characters would do NO harm since , , ', , are NOT Japanese characters ;) Or, am I missing something? :) - E - __ Do You Yahoo!? Yahoo! BB is Broadband by Yahoo! http://bb.yahoo.co.jp/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Thu, Oct 02, 2003 at 05:15:32PM +0900, - Edwin - wrote: : : On Thu, 2 Oct 2003 01:54:43 -0500 Eugene Lee [EMAIL PROTECTED] wrote: : : There's two issues: input and output. : : HTML character references address the problem of displaying certain : characters on a web browser. This is an output issue. : : When you get CJKV data, you are most likely getting it in some : encoding. Different Asian languages use their own encoding sets. : For example, if you get Japanese text, it will be encoded in JIS, : Shift-JIS, EUC, or something Unicode. You *have* to determine the : type of data and its encoding. This is an input issue. : : Hmm... but the characters in question were already (at least in the : examples used) in something Unicode (#n;). So, there's really : no need to know whether it's JIS, Shift-JIS, EUC, etc. The Chinese characters in question were already converted to their HTML numeric character references. That's because someone made the conscious and purposeful decision to provide the correct Unicode decimal numbers for those Chinese characters. Your concern about accepting CJKV data is vague because you don't explain the source of the data or the encoding method of the data. You must determine these critical bits of info before you can decide how to display the data. There's no guarantee that the user will send you CJKV data in nice HTML numeric character references. I'm not even sure exactly what you're trying to do. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Thu, 2 Oct 2003 03:58:48 -0500 Eugene Lee [EMAIL PROTECTED] wrote: On Thu, Oct 02, 2003 at 05:15:32PM +0900, - Edwin - wrote: : : On Thu, 2 Oct 2003 01:54:43 -0500 Eugene Lee [EMAIL PROTECTED] wrote:: : There's two issues: input and output. : : HTML character references address the problem of displaying certain: characters on a web browser. This is an output issue. : : When you get CJKV data, you are most likely getting it in some : encoding. Different Asian languages use their own encoding sets. : For example, if you get Japanese text, it will be encoded in JIS,: Shift-JIS, EUC, or something Unicode. You *have* to determine the: type of data and its encoding. This is an input issue.: : Hmm... but the characters in question were already (at least in the: examples used) in something Unicode (#n;). So, there's really: no need to know whether it's JIS, Shift-JIS, EUC, etc. The Chinese characters in question were already converted to their HTML numeric character references. That's because someone made the conscious and purposeful decision to provide the correct Unicode decimal numbers for those Chinese characters. Your concern about accepting CJKV data is vague because you don't explain the source of the data or the encoding method of the data. I can't really tell you that since *I* wasn't the OP ;) You must determine these critical bits of info before you can decide how to display the data. Unless you know it's already in unicode... There's no guarantee that the user will send you CJKV data in nice HTML numeric character references. Very true. I'm not even sure exactly what you're trying to do. Again, it's not me ;) (Wrong thread!) But, I'm wondering as well, I'm not even sure exactly what the original poster is trying to do. :) - E - __ Do You Yahoo!? Yahoo! BB is Broadband by Yahoo! http://bb.yahoo.co.jp/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
From: Eugene Lee [EMAIL PROTECTED] On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : Got a problem with htmlspecialchars being too greedy, where : for example, it converts : foo; : to : amp;foo; : : Yes it displays correctly in the browser for some content, but not all. : (an example is posted below) : So I came up with this example code, but not sure if there is an : easier/better way to get the correct end result. : If there is a better way, feel free to let me know. : Thanks : : Note: I dont read/speak chinese, so if its offensive please forgive me. : : -- : ?php : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; The problem isn't with htmlspecialchars(). It doesn't know what parts of the string are HTML character references and which parts are not. But if you're willing to dig up the numeric character references for those specific Chinese characters, then split the string into the part that needs no translation and the part that needs it. That is: $foo1encoded = '#20013;#25991;' $foo2raw = ' http://www.foo.com/index.php?foo=1bar=2'; $foo = $foo1 . htmlspecialchars(foo2raw); Maybe you should run html_entity_decode() on the string first, then run encode again. The decode will take #20013; and turn it into it's actual character but not affect anything else. Then the recoding will turn it back into #20013; and also encode any other characters. ---John Holmes... -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
CPT John W. Holmes wrote: From: Eugene Lee [EMAIL PROTECTED] On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : Got a problem with htmlspecialchars being too greedy, where : for example, it converts : foo; : to : amp;foo; : : Yes it displays correctly in the browser for some content, but not all. : (an example is posted below) : So I came up with this example code, but not sure if there is an : easier/better way to get the correct end result. : If there is a better way, feel free to let me know. : Thanks : : Note: I dont read/speak chinese, so if its offensive please forgive me. : : -- : ?php : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; The problem isn't with htmlspecialchars(). It doesn't know what parts of the string are HTML character references and which parts are not. But if you're willing to dig up the numeric character references for those specific Chinese characters, then split the string into the part that needs no translation and the part that needs it. That is: $foo1encoded = '#20013;#25991;' $foo2raw = ' http://www.foo.com/index.php?foo=1bar=2'; $foo = $foo1 . htmlspecialchars(foo2raw); Maybe you should run html_entity_decode() on the string first, then run encode again. The decode will take #20013; and turn it into it's actual character but not affect anything else. Then the recoding will turn it back into #20013; and also encode any other characters. Eugene, your example leads me to believe that one knows before hand what characters needs special attention, in order to not run it through htmlspecialchars. I would never know what characters needs special attention. John, a good idea, but unfortunately, after some tests, and re-reading the manual, it seems html_entity_decode(), only recognises html entities. Not ascii values of language characters. So Im going to push ahead with my code, and see if it breaks anything :) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Wed, Oct 01, 2003 at 02:02:08PM -0400, Gerard Samuel wrote: : CPT John W. Holmes wrote: : From: Eugene Lee [EMAIL PROTECTED] : On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : : : Got a problem with htmlspecialchars being too greedy, where : : for example, it converts : : foo; : : to : : amp;foo; [...] : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; : : The problem isn't with htmlspecialchars(). It doesn't know what parts : of the string are HTML character references and which parts are not. : But if you're willing to dig up the numeric character references for : those specific Chinese characters, then split the string into the part : that needs no translation and the part that needs it. That is: : : $foo1encoded = '#20013;#25991;' : $foo2raw = ' http://www.foo.com/index.php?foo=1bar=2'; : $foo = $foo1 . htmlspecialchars(foo2raw); : : Maybe you should run html_entity_decode() on the string first, then run : encode again. The decode will take #20013; and turn it into it's actual : character but not affect anything else. Then the recoding will turn it back : into #20013; and also encode any other characters. : : Eugene, your example leads me to believe that one knows before hand : what characters needs special attention, in order to not run it : through htmlspecialchars. I would never know what characters needs : special attention. But it seems that you do know what characters need to be converted, because you included the exact Unicode character references for those Chinese characters. You have to know your data. Or modify your code with specific assumptions about the data. For example, let's say I have a string that I got from somewhere (database, user form, text file, another web site, etc.): $foo = 'Dick amp; Jane'; When you eventually display this to someone's web browser, what do you want them to see? Dick Jane or Dick amp; Jane This really depends on the format of the data inside $foo. Is 'amp;' a character reference that you want to leave alone? Or is it a literal string that you want to convert to 'amp;amp;' for display? And the only person that knows the format of the data is you. Again, you have to know your data. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
Eugene Lee wrote: On Wed, Oct 01, 2003 at 02:02:08PM -0400, Gerard Samuel wrote: : CPT John W. Holmes wrote: : From: Eugene Lee [EMAIL PROTECTED] : On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : : : Got a problem with htmlspecialchars being too greedy, where : : for example, it converts : : foo; : : to : : amp;foo; [...] : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; : : The problem isn't with htmlspecialchars(). It doesn't know what parts : of the string are HTML character references and which parts are not. : But if you're willing to dig up the numeric character references for : those specific Chinese characters, then split the string into the part : that needs no translation and the part that needs it. That is: : : $foo1encoded = '#20013;#25991;' : $foo2raw = ' http://www.foo.com/index.php?foo=1bar=2'; : $foo = $foo1 . htmlspecialchars(foo2raw); : : Maybe you should run html_entity_decode() on the string first, then run : encode again. The decode will take #20013; and turn it into it's actual : character but not affect anything else. Then the recoding will turn it back : into #20013; and also encode any other characters. : : Eugene, your example leads me to believe that one knows before hand : what characters needs special attention, in order to not run it : through htmlspecialchars. I would never know what characters needs : special attention. But it seems that you do know what characters need to be converted, because you included the exact Unicode character references for those Chinese characters. You have to know your data. Or modify your code with specific assumptions about the data. For example, let's say I have a string that I got from somewhere (database, user form, text file, another web site, etc.): $foo = 'Dick amp; Jane'; When you eventually display this to someone's web browser, what do you want them to see? Dick Jane or Dick amp; Jane This really depends on the format of the data inside $foo. Is 'amp;' a character reference that you want to leave alone? Or is it a literal string that you want to convert to 'amp;amp;' for display? And the only person that knows the format of the data is you. Again, you have to know your data. When I say that I don't know what characters Im expecting. Im not talking about normal html entities, like amp; nbsp lt; Im talking about chinese/japanese/korean/taiwanese alphabet, numbers (even punctuation if applicable). Maybe Im thinking too hard, but trying to take far east languages alphanumeric charaters into account, seems like overkill. Feel free to correct me. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet
On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote: : : Got a problem with htmlspecialchars being too greedy, where : for example, it converts : foo; : to : amp;foo; : : Yes it displays correctly in the browser for some content, but not all. : (an example is posted below) : So I came up with this example code, but not sure if there is an : easier/better way to get the correct end result. : If there is a better way, feel free to let me know. : Thanks : : Note: I dont read/speak chinese, so if its offensive please forgive me. : : -- : ?php : : $foo = '#20013;#25991; http://www.foo.com/index.php?foo=1bar=2'; The problem isn't with htmlspecialchars(). It doesn't know what parts of the string are HTML character references and which parts are not. But if you're willing to dig up the numeric character references for those specific Chinese characters, then split the string into the part that needs no translation and the part that needs it. That is: $foo1encoded = '#20013;#25991;' $foo2raw = ' http://www.foo.com/index.php?foo=1bar=2'; $foo = $foo1 . htmlspecialchars(foo2raw); -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php