Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-09 Thread - Edwin -
Hi,

(Long time...)

On Sat, 04 Oct 2003 21:01:46 -0400
Gerard Samuel [EMAIL PROTECTED] wrote:

 - Edwin - wrote:
 
  Far east languages are not necessarily in this form: #n;
  So,
 
 running htmlspecialchars() on, say, Japanese characters would do NO
 harm since , , ', ,  are NOT Japanese characters ;)
 
 Or, am I missing something? :)
 
 Not exactly.  When storing far east languages in a database for 
 example, thats the format its stored as.
 #x;

Hmm... I don't think that's the way they're stored--maybe the way you
see them esp. if you're doing the query on a terminal that doesn't
support the encoding in question...

 Also, I've seen it in that form in the $_POST array from a form.

I've never seen this one. For example, any Japanese characters
$_POSTed would *always* show them as they were when you do a
print_r($_POST). If not, most probably it has something to do with the
way you  display the page (e.g. wrong encoding).

- E -


__
Do You Yahoo!?
Yahoo! BB is Broadband by Yahoo!
http://bb.yahoo.co.jp/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-04 Thread Gerard Samuel
- Edwin - wrote:

Far east languages are not necessarily in this form: #n; So,

running htmlspecialchars() on, say, Japanese characters would do NO
harm since , , ', ,  are NOT Japanese characters ;)
Or, am I missing something? :)

Not exactly.  When storing far east languages in a database for 
example, thats the format its stored as.
#x;
Also, I've seen it in that form in the $_POST array from a form.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-04 Thread Curt Zirzow
* Thus wrote Gerard Samuel ([EMAIL PROTECTED]):
 - Edwin - wrote:
 
 Far east languages are not necessarily in this form: #n; So,
 
 running htmlspecialchars() on, say, Japanese characters would do NO
 harm since , , ', ,  are NOT Japanese characters ;)
 
 Or, am I missing something? :)
 
 Not exactly.  When storing far east languages in a database for 
 example, thats the format its stored as.
 #x;
 Also, I've seen it in that form in the $_POST array from a form.

That is an html entity and is not how it is stored. How that entity
gets displayed depends entirely on what encoding you have set for
the page.

The japanese characters (charset ISO-2022-JP) to use to display the
phrase for 'Contents' is:

^[$B$3$s$F$s$D^[(B

(^[ == escape character)

I can store those exact characters (with the proper escape
character) in a database without a problem


Curt
-- 
List Stats: http://zirzow.dyndns.org/html/mlists/php_general/

I used to think I was indecisive, but now I'm not so sure.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-04 Thread Curt Zirzow
* Thus wrote Gerard Samuel ([EMAIL PROTECTED]):
 CPT John W. Holmes wrote:
 
 From: Eugene Lee [EMAIL PROTECTED]
 
  
 
 On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
 :
 : Got a problem with htmlspecialchars being too greedy, where
 : for example, it converts
 : foo;
 : to
 : amp;foo;
 :
 : Yes it displays correctly in the browser for some content, but not all.
 : (an example is posted below)
 : So I came up with this example code, but not sure if there is an
 : easier/better way to get the correct end result.
 : If there is a better way, feel free to let me know.
 : Thanks
 :
 : Note: I dont read/speak chinese, so if its offensive please forgive me.
 :
 : --
 : ?php
 :
 : $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';
 
 
 
 
 Maybe you should run html_entity_decode() on the string first, then run
 encode again. The decode will take #20013; and turn it into it's actual
 character but not affect anything else. Then the recoding will turn it back
 into #20013; and also encode any other characters.
 

 John, a good idea, but unfortunately, after some tests, and re-reading 
 the manual, it seems html_entity_decode(),
 only recognises html entities.  Not ascii values of language characters.
 So Im going to push ahead with my code, and see if it breaks anything :)

hmm.. take a look at 
  http://php.net/manual/en/function.get-html-translation-table.php

That will do exactly what john suggested.

 
 -- 
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php
 

Curt
-- 
List Stats: http://zirzow.dyndns.org/html/mlists/php_general/

I used to think I was indecisive, but now I'm not so sure.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-04 Thread Gerard Samuel
Curt Zirzow wrote:

That is an html entity and is not how it is stored. How that entity

gets displayed depends entirely on what encoding you have set for
the page.
The japanese characters (charset ISO-2022-JP) to use to display the
phrase for 'Contents' is:
^[$B$3$s$F$s$D^[(B

(^[ == escape character)

I can store those exact characters (with the proper escape
character) in a database without a problem
Then I haven't a clue as to why its stored that way, or its just the way 
the command line displays the content.
Using postgresql on a database setup for unicode storage, Im getting -
test=# select topic_title from forum_topics order by topic_time desc 
limit 1 offset 4;
   topic_title
---
testing japanese #31038;#20250;#12491;#12517;#12540;#12473;
(1 row)

I read your other post, Ill see whats possible with 
|get_html_translation_table|

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-02 Thread Eugene Lee
On Wed, Oct 01, 2003 at 02:46:14PM -0400, Gerard Samuel wrote:
: 
: When I say that I don't know what characters Im expecting.
: Im not talking about normal html entities, like amp; nbsp lt;
: Im talking about chinese/japanese/korean/taiwanese alphabet, numbers 
: (even punctuation if applicable).
: Maybe Im thinking too hard, but trying to take far east languages 
: alphanumeric charaters into account,
: seems like overkill.
: Feel free to correct me.

Okay, I will.  :-)

There's two issues: input and output.

HTML character references address the problem of displaying certain
characters on a web browser.  This is an output issue.

When you get CJKV data, you are most likely getting it in some encoding.
Different Asian languages use their own encoding sets.  For example, if
you get Japanese text, it will be encoded in JIS, Shift-JIS, EUC, or
something Unicode.  You *have* to determine the type of data and its
encoding.  This is an input issue.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-02 Thread - Edwin -
On Thu, 2 Oct 2003 01:54:43 -0500
Eugene Lee [EMAIL PROTECTED] wrote:

 On Wed, Oct 01, 2003 at 02:46:14PM -0400, Gerard Samuel wrote:
 : 
 : When I say that I don't know what characters Im expecting.
 : Im not talking about normal html entities, like amp; nbsp lt;
 : Im talking about chinese/japanese/korean/taiwanese alphabet,
 numbers : (even punctuation if applicable).
 : Maybe Im thinking too hard, but trying to take far east languages 
 : alphanumeric charaters into account,
 : seems like overkill.
 : Feel free to correct me.
 
 Okay, I will.  :-)
 
 There's two issues: input and output.
 
 HTML character references address the problem of displaying certain
 characters on a web browser.  This is an output issue.
 
 When you get CJKV data, you are most likely getting it in some
 encoding. Different Asian languages use their own encoding sets. 
 For example, if you get Japanese text, it will be encoded in JIS,
 Shift-JIS, EUC, or something Unicode.  You *have* to determine the
 type of data and its encoding.  This is an input issue.

Hmm... but the characters in question were already (at least in the
examples used) in something Unicode (#n;). So, there's really
no need to know whether it's JIS, Shift-JIS, EUC, etc.

?

- E -
__
Do You Yahoo!?
Yahoo! BB is Broadband by Yahoo!
http://bb.yahoo.co.jp/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-02 Thread - Edwin -
But,

On Wed, 01 Oct 2003 14:46:14 -0400
Gerard Samuel [EMAIL PROTECTED] wrote:

...[snip]...

 When I say that I don't know what characters Im expecting.
 Im not talking about normal html entities, like amp; nbsp lt;
 Im talking about chinese/japanese/korean/taiwanese alphabet, numbers
 
 (even punctuation if applicable).
 Maybe Im thinking too hard, but trying to take far east languages 
 alphanumeric charaters into account,
 seems like overkill.
 Feel free to correct me.

Far east languages are not necessarily in this form: #n; So,
running htmlspecialchars() on, say, Japanese characters would do NO
harm since , , ', ,  are NOT Japanese characters ;)

Or, am I missing something? :)

- E -
__
Do You Yahoo!?
Yahoo! BB is Broadband by Yahoo!
http://bb.yahoo.co.jp/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-02 Thread Eugene Lee
On Thu, Oct 02, 2003 at 05:15:32PM +0900, - Edwin - wrote:
: 
: On Thu, 2 Oct 2003 01:54:43 -0500 Eugene Lee [EMAIL PROTECTED] wrote:
:  
:  There's two issues: input and output.
:  
:  HTML character references address the problem of displaying certain
:  characters on a web browser.  This is an output issue.
:  
:  When you get CJKV data, you are most likely getting it in some
:  encoding. Different Asian languages use their own encoding sets. 
:  For example, if you get Japanese text, it will be encoded in JIS,
:  Shift-JIS, EUC, or something Unicode.  You *have* to determine the
:  type of data and its encoding.  This is an input issue.
: 
: Hmm... but the characters in question were already (at least in the
: examples used) in something Unicode (#n;). So, there's really
: no need to know whether it's JIS, Shift-JIS, EUC, etc.

The Chinese characters in question were already converted to their HTML
numeric character references.  That's because someone made the conscious
and purposeful decision to provide the correct Unicode decimal numbers
for those Chinese characters.  Your concern about accepting CJKV data is
vague because you don't explain the source of the data or the encoding
method of the data.  You must determine these critical bits of info
before you can decide how to display the data.  There's no guarantee
that the user will send you CJKV data in nice HTML numeric character
references.  I'm not even sure exactly what you're trying to do.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-02 Thread - Edwin -
On Thu, 2 Oct 2003 03:58:48 -0500
Eugene Lee [EMAIL PROTECTED] wrote:

 On Thu, Oct 02, 2003 at 05:15:32PM +0900, - Edwin - wrote:
 : 
 : On Thu, 2 Oct 2003 01:54:43 -0500 Eugene Lee [EMAIL PROTECTED]
 wrote::  
 :  There's two issues: input and output.
 :  
 :  HTML character references address the problem of displaying
 certain:  characters on a web browser.  This is an output issue.
 :  
 :  When you get CJKV data, you are most likely getting it in some
 :  encoding. Different Asian languages use their own encoding sets.
 
 :  For example, if you get Japanese text, it will be encoded in
 JIS,:  Shift-JIS, EUC, or something Unicode.  You *have* to
 determine the:  type of data and its encoding.  This is an input
 issue.: 
 : Hmm... but the characters in question were already (at least in
 the: examples used) in something Unicode (#n;). So, there's
 really: no need to know whether it's JIS, Shift-JIS, EUC, etc.
 
 The Chinese characters in question were already converted to their
 HTML numeric character references.  That's because someone made the
 conscious and purposeful decision to provide the correct Unicode
 decimal numbers for those Chinese characters.  Your concern about
 accepting CJKV data is vague because you don't explain the source of
 the data or the encoding method of the data.

I can't really tell you that since *I* wasn't the OP ;)

 You must determine these critical bits of info
 before you can decide how to display the data.

Unless you know it's already in unicode...

 There's no guarantee
 that the user will send you CJKV data in nice HTML numeric character
 references.

Very true.

 I'm not even sure exactly what you're trying to do.

Again, it's not me ;) (Wrong thread!)

But, I'm wondering as well, I'm not even sure exactly what the
original poster is trying to do. :)

- E -
__
Do You Yahoo!?
Yahoo! BB is Broadband by Yahoo!
http://bb.yahoo.co.jp/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-01 Thread CPT John W. Holmes
From: Eugene Lee [EMAIL PROTECTED]

 On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
 :
 : Got a problem with htmlspecialchars being too greedy, where
 : for example, it converts
 : foo;
 : to
 : amp;foo;
 :
 : Yes it displays correctly in the browser for some content, but not all.
 : (an example is posted below)
 : So I came up with this example code, but not sure if there is an
 : easier/better way to get the correct end result.
 : If there is a better way, feel free to let me know.
 : Thanks
 :
 : Note: I dont read/speak chinese, so if its offensive please forgive me.
 :
 : --
 : ?php
 :
 : $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';

 The problem isn't with htmlspecialchars().  It doesn't know what parts
 of the string are HTML character references and which parts are not.
 But if you're willing to dig up the numeric character references for
 those specific Chinese characters, then split the string into the part
 that needs no translation and the part that needs it.  That is:

 $foo1encoded = '#20013;#25991;'
 $foo2raw = '  http://www.foo.com/index.php?foo=1bar=2';
 $foo = $foo1 . htmlspecialchars(foo2raw);

Maybe you should run html_entity_decode() on the string first, then run
encode again. The decode will take #20013; and turn it into it's actual
character but not affect anything else. Then the recoding will turn it back
into #20013; and also encode any other characters.

---John Holmes...

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-01 Thread Gerard Samuel
CPT John W. Holmes wrote:

From: Eugene Lee [EMAIL PROTECTED]

 

On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
:
: Got a problem with htmlspecialchars being too greedy, where
: for example, it converts
: foo;
: to
: amp;foo;
:
: Yes it displays correctly in the browser for some content, but not all.
: (an example is posted below)
: So I came up with this example code, but not sure if there is an
: easier/better way to get the correct end result.
: If there is a better way, feel free to let me know.
: Thanks
:
: Note: I dont read/speak chinese, so if its offensive please forgive me.
:
: --
: ?php
:
: $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';
The problem isn't with htmlspecialchars().  It doesn't know what parts
of the string are HTML character references and which parts are not.
But if you're willing to dig up the numeric character references for
those specific Chinese characters, then split the string into the part
that needs no translation and the part that needs it.  That is:
$foo1encoded = '#20013;#25991;'
$foo2raw = '  http://www.foo.com/index.php?foo=1bar=2';
$foo = $foo1 . htmlspecialchars(foo2raw);
   

Maybe you should run html_entity_decode() on the string first, then run
encode again. The decode will take #20013; and turn it into it's actual
character but not affect anything else. Then the recoding will turn it back
into #20013; and also encode any other characters.
Eugene, your example leads me to believe that one knows before hand what 
characters needs special attention,
in order to not run it through htmlspecialchars.  I would never know 
what characters needs special attention.
John, a good idea, but unfortunately, after some tests, and re-reading 
the manual, it seems html_entity_decode(),
only recognises html entities.  Not ascii values of language characters.
So Im going to push ahead with my code, and see if it breaks anything :)

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-01 Thread Eugene Lee
On Wed, Oct 01, 2003 at 02:02:08PM -0400, Gerard Samuel wrote:
: CPT John W. Holmes wrote:
: From: Eugene Lee [EMAIL PROTECTED]
: On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
: :
: : Got a problem with htmlspecialchars being too greedy, where
: : for example, it converts
: : foo;
: : to
: : amp;foo;
[...]
: : $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';
: 
: The problem isn't with htmlspecialchars().  It doesn't know what parts
: of the string are HTML character references and which parts are not.
: But if you're willing to dig up the numeric character references for
: those specific Chinese characters, then split the string into the part
: that needs no translation and the part that needs it.  That is:
: 
: $foo1encoded = '#20013;#25991;'
: $foo2raw = '  http://www.foo.com/index.php?foo=1bar=2';
: $foo = $foo1 . htmlspecialchars(foo2raw);
: 
: Maybe you should run html_entity_decode() on the string first, then run
: encode again. The decode will take #20013; and turn it into it's actual
: character but not affect anything else. Then the recoding will turn it back
: into #20013; and also encode any other characters.
: 
: Eugene, your example leads me to believe that one knows before hand
: what characters needs special attention, in order to not run it
: through htmlspecialchars.  I would never know what characters needs
: special attention.

But it seems that you do know what characters need to be converted,
because you included the exact Unicode character references for those
Chinese characters.  You have to know your data.  Or modify your code
with specific assumptions about the data.

For example, let's say I have a string that I got from somewhere
(database, user form, text file, another web site, etc.):

$foo = 'Dick amp; Jane';

When you eventually display this to someone's web browser, what do you
want them to see?

Dick  Jane
or
Dick amp; Jane

This really depends on the format of the data inside $foo.  Is 'amp;'
a character reference that you want to leave alone?  Or is it a literal
string that you want to convert to 'amp;amp;' for display?  And the
only person that knows the format of the data is you.  Again, you have
to know your data.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-10-01 Thread Gerard Samuel
Eugene Lee wrote:

On Wed, Oct 01, 2003 at 02:02:08PM -0400, Gerard Samuel wrote:
: CPT John W. Holmes wrote:
: From: Eugene Lee [EMAIL PROTECTED]
: On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
: :
: : Got a problem with htmlspecialchars being too greedy, where
: : for example, it converts
: : foo;
: : to
: : amp;foo;
[...]
: : $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';
: 
: The problem isn't with htmlspecialchars().  It doesn't know what parts
: of the string are HTML character references and which parts are not.
: But if you're willing to dig up the numeric character references for
: those specific Chinese characters, then split the string into the part
: that needs no translation and the part that needs it.  That is:
: 
: $foo1encoded = '#20013;#25991;'
: $foo2raw = '  http://www.foo.com/index.php?foo=1bar=2';
: $foo = $foo1 . htmlspecialchars(foo2raw);
: 
: Maybe you should run html_entity_decode() on the string first, then run
: encode again. The decode will take #20013; and turn it into it's actual
: character but not affect anything else. Then the recoding will turn it back
: into #20013; and also encode any other characters.
: 
: Eugene, your example leads me to believe that one knows before hand
: what characters needs special attention, in order to not run it
: through htmlspecialchars.  I would never know what characters needs
: special attention.

But it seems that you do know what characters need to be converted,
because you included the exact Unicode character references for those
Chinese characters.  You have to know your data.  Or modify your code
with specific assumptions about the data.
For example, let's say I have a string that I got from somewhere
(database, user form, text file, another web site, etc.):
	$foo = 'Dick amp; Jane';

When you eventually display this to someone's web browser, what do you
want them to see?
Dick  Jane
or
Dick amp; Jane
This really depends on the format of the data inside $foo.  Is 'amp;'
a character reference that you want to leave alone?  Or is it a literal
string that you want to convert to 'amp;amp;' for display?  And the
only person that knows the format of the data is you.  Again, you have
to know your data.
When I say that I don't know what characters Im expecting.
Im not talking about normal html entities, like amp; nbsp lt;
Im talking about chinese/japanese/korean/taiwanese alphabet, numbers 
(even punctuation if applicable).
Maybe Im thinking too hard, but trying to take far east languages 
alphanumeric charaters into account,
seems like overkill.
Feel free to correct me.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Attempt at putting greedy htmlspecialchars on a diet

2003-09-30 Thread Eugene Lee
On Wed, Oct 01, 2003 at 01:12:16AM -0400, Gerard Samuel wrote:
: 
: Got a problem with htmlspecialchars being too greedy, where
: for example, it converts
: foo;
: to
: amp;foo;
: 
: Yes it displays correctly in the browser for some content, but not all.  
: (an example is posted below)
: So I came up with this example code, but not sure if there is an 
: easier/better way to get the correct end result.
: If there is a better way, feel free to let me know.
: Thanks
: 
: Note: I dont read/speak chinese, so if its offensive please forgive me.
: 
: --
: ?php
: 
: $foo = '#20013;#25991;  http://www.foo.com/index.php?foo=1bar=2';

The problem isn't with htmlspecialchars().  It doesn't know what parts
of the string are HTML character references and which parts are not.
But if you're willing to dig up the numeric character references for
those specific Chinese characters, then split the string into the part
that needs no translation and the part that needs it.  That is:

$foo1encoded = '#20013;#25991;'
$foo2raw = '  http://www.foo.com/index.php?foo=1bar=2';
$foo = $foo1 . htmlspecialchars(foo2raw);

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php