[PHP] Re: languages and PHP

2007-10-02 Thread tedd

At 11:09 AM +0100 10/2/07, Colin Guthrie wrote:

tedd wrote:

 Isn't UTF-8 the big fish here?

 Sure there' UTF-16 and larger, but everything else is a subset of UTF-8,
 is it not?

 So, what's the problem if you get a character defined by ISO -- it's
 still within the UTF-8 super-group, right?


Individual characters are sometimes OK, but it's the sequence of
characters that could be invalid.

UTF-8 works by using special bits at the MSB end of the byte to say, "I
can't represent this character in one byte, I need to use 2 bytes (or 3
bytes)" (and maybe also 4? can't remember of the top of my head).

In a multi-byte sequence the MSB end of all the bytes must follow a
pre-defined scheme. If they do not they are syntactically invalid UTF-8.

So it's more than just individual characters, the order of them is
important.

Hope that explains it (although probably a bad explanation as I'm very
tired right now!).

Col



Ah, I see what you're saying. I've run into that before when studying 
Unicode. The mb_ series of functions deal with larger than ASCII 
coding, but I don't know of any that deals with character 
sequence/combinations or right/left readings. That's all Greek to me, 
pardon the pun.


Cheers,

tedd

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-10-02 Thread Per Jessen
Colin Guthrie wrote:

> UTF-8 works by using special bits at the MSB end of the byte to say,
> "I can't represent this character in one byte, I need to use 2 bytes
> (or 3 bytes)" (and maybe also 4? can't remember of the top of my
> head).

Yep, a UTF8 character is 1 to 4 bytes. 


/Per Jessen, Zürich

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: languages and PHP

2007-10-02 Thread Colin Guthrie
tedd wrote:
> Isn't UTF-8 the big fish here?
> 
> Sure there' UTF-16 and larger, but everything else is a subset of UTF-8,
> is it not?
> 
> So, what's the problem if you get a character defined by ISO -- it's
> still within the UTF-8 super-group, right?

Individual characters are sometimes OK, but it's the sequence of
characters that could be invalid.

UTF-8 works by using special bits at the MSB end of the byte to say, "I
can't represent this character in one byte, I need to use 2 bytes (or 3
bytes)" (and maybe also 4? can't remember of the top of my head).

In a multi-byte sequence the MSB end of all the bytes must follow a
pre-defined scheme. If they do not they are syntactically invalid UTF-8.

So it's more than just individual characters, the order of them is
important.

Hope that explains it (although probably a bad explanation as I'm very
tired right now!).

Col

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-29 Thread tedd

At 12:36 PM -0500 9/28/07, Edward Vermillion wrote:
My question was more mental prodding than anything else. The OP had 
a function to convert incoming text into UTF-8 before they did 
anything with it. A couple of folks said that was unnecessary, if 
you set your form to UTF-8 your incoming data will be in UTF-8 
already.


I was just trying to make the point that if you expect your incoming 
data to be in a certain state in your code you should make sure that 
it is in that state before you act on it, since you can't guarantee 
it's source. Checking to make sure the incoming data is in it's 
expected state is not a waste of time (or unnecessary, or whatever 
term of derision they picked) but is actually good coding practice.


I pretty much gave up on the thread when I got the reply along the 
lines of "if it breaks something it's their problem, not mine".


Ed


I still don't see the problem: If you are receiving in UTF-8 and 
someone sends you something UTF-8 or less, than you can catch it. If, 
on the other hand you are set up for a lessor charset, then there's 
no way you can be assured that what they send, you can catch.


If given the choice, use the super-group.

This is too obvious, I must be missing something.

Cheers,

tedd

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-28 Thread Per Jessen
Edward Vermillion wrote:
 
> On Sep 28, 2007, at 1:05 PM, Per Jessen wrote:
>>
>> Ed, your question was a good one, but so was my answer.  In my case,
>> I don't cater to an open community, but to a closed one.  If you're
>> not authenticated, you're not getting anywhere to start with.  If you
>> somehow manage to bypass that, and attempt to submit data I don't
>> expect, my priority is the survival of my application, nothing else.
>>
> 
> But that was my point. Your way, your app may disintegrate at some
> uncontrolled point. 

As long as it is only the app, it's not a real problem. If it affects
apache, it's a different issue.  If the app throws a couple of
unexpected exceptions or something, no big deal. 

> At least if your checking/validating your input then
> you can take control of the situation and insure the "survival of your
> application". Otherwise who knows where it will break and what it will
> mean when it does.

I agree, but to check for unwanted charactersets and do conversions and
what have you, is way overkill IMOH.

> And just because the community is closed, don't drop your guard on
> basic security practices. You don't control what comes into your site,
> you can only react to it.

I agree - like I said, authentication is required.


/Per

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-28 Thread Edward Vermillion


On Sep 28, 2007, at 1:05 PM, Per Jessen wrote:


Edward Vermillion wrote:


I pretty much gave up on the thread when I got the reply along the
lines of "if it breaks something it's their problem, not mine".


Ed, your question was a good one, but so was my answer.  In my case, I
don't cater to an open community, but to a closed one.  If you're not
authenticated, you're not getting anywhere to start with.  If you
somehow manage to bypass that, and attempt to submit data I don't
expect, my priority is the survival of my application, nothing else.




But that was my point. Your way, your app may disintegrate at some  
uncontrolled point. At least if your checking/validating your input  
then you can take control of the situation and insure the "survival  
of your application". Otherwise who knows where it will break and  
what it will mean when it does.


And just because the community is closed, don't drop your guard on  
basic security practices. You don't control what comes into your  
site, you can only react to it.


Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-28 Thread Per Jessen
Edward Vermillion wrote:

> I pretty much gave up on the thread when I got the reply along the
> lines of "if it breaks something it's their problem, not mine".

Ed, your question was a good one, but so was my answer.  In my case, I
don't cater to an open community, but to a closed one.  If you're not
authenticated, you're not getting anywhere to start with.  If you
somehow manage to bypass that, and attempt to submit data I don't
expect, my priority is the survival of my application, nothing else. 


/Per Jessen, Zürich

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-28 Thread Edward Vermillion


On Sep 28, 2007, at 11:34 AM, tedd wrote:


At 2:01 PM -0500 9/27/07, Edward Vermillion wrote:
So back to my original question, what breaks if you're *expecting*  
UTF-8 and you don't *get* UTF-8?


Ed


Isn't UTF-8 the big fish here?

Sure there' UTF-16 and larger, but everything else is a subset of  
UTF-8, is it not?


So, what's the problem if you get a character defined by ISO --  
it's still within the UTF-8 super-group, right?


The only problem I see here is IF the user has the char set to  
display the glyph correctly -- OR am I off on something else that  
you guys aren't even discussing?




Probably very relevant to the original question, but...

My question was more mental prodding than anything else. The OP had a  
function to convert incoming text into UTF-8 before they did anything  
with it. A couple of folks said that was unnecessary, if you set your  
form to UTF-8 your incoming data will be in UTF-8 already.


I was just trying to make the point that if you expect your incoming  
data to be in a certain state in your code you should make sure that  
it is in that state before you act on it, since you can't guarantee  
it's source. Checking to make sure the incoming data is in it's  
expected state is not a waste of time (or unnecessary, or whatever  
term of derision they picked) but is actually good coding practice.


I pretty much gave up on the thread when I got the reply along the  
lines of "if it breaks something it's their problem, not mine".


Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-28 Thread tedd

At 2:01 PM -0500 9/27/07, Edward Vermillion wrote:
So back to my original question, what breaks if you're *expecting* 
UTF-8 and you don't *get* UTF-8?


Ed


Isn't UTF-8 the big fish here?

Sure there' UTF-16 and larger, but everything else is a subset of 
UTF-8, is it not?


So, what's the problem if you get a character defined by ISO -- it's 
still within the UTF-8 super-group, right?


The only problem I see here is IF the user has the char set to 
display the glyph correctly -- OR am I off on something else that you 
guys aren't even discussing?


Cheers,

tedd


--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread Per Jessen
Edward Vermillion wrote:

> ... and you can guarantee that any data coming into your site comes
> from your form?!? WOW!!!
> 
> ;)
> 
> So back to my original question, what breaks if you're *expecting*
> UTF-8 and you don't *get* UTF-8?

As long as my server isn't vulnerable to it, I couldn't care less. I'm
certainly not going to be that defensive and write my code to deal with
such situations.  If someone POSTs or GETs something not in UTF-8 from
outside my site, and I can't deal with it, it's their problem, not
mine. 

And with PHP being the solid platform it is, it wouldn't dream of taking
down my server just because I get KOI8-R when I expected UTF-8 :-)


/Per Jessen, Zürich

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread Edward Vermillion


On Sep 27, 2007, at 1:49 PM, Per Jessen wrote:


Edward Vermillion wrote:


But what happens if you get data that's *not* UTF-8? Just because
your html/form is set to UTF-8 doesn't mean that all your incoming
data will be UTF-8.


Yes it does. If your HTML page was sent in UTF-8, any request
originating from that page will also be in UTF8.




... and you can guarantee that any data coming into your site comes  
from your form?!? WOW!!!


;)

So back to my original question, what breaks if you're *expecting*  
UTF-8 and you don't *get* UTF-8?


Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread mike
On 9/27/07, Edward Vermillion <[EMAIL PROTECTED]> wrote:

> But what happens if you get data that's *not* UTF-8? Just because
> your html/form is set to UTF-8 doesn't mean that all your incoming
> data will be UTF-8.

just my experience, but as long as it has the meta tag w/ utf-8 in it,
the browser sends (and receives) utf-8. i can store the strings in
mysql without modification or character set conversion, it works like
a charm.

the only thing that might need help then is doing string modifications
like urlencoding, or replacement on the utf-8 characters... i haven't
had to do that yet, but otherwise the end-to-end utf-8 solution has
worked like a charm for me.

but yes, it does require browsers+utf8, running it out of that context
may or may not work depending on what you're trying to do with the
data..

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread Per Jessen
Edward Vermillion wrote:

> But what happens if you get data that's *not* UTF-8? Just because
> your html/form is set to UTF-8 doesn't mean that all your incoming
> data will be UTF-8.

Yes it does. If your HTML page was sent in UTF-8, any request
originating from that page will also be in UTF8. 


/Per Jessen, Zürich

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread Per Jessen
Colin Guthrie wrote:

> Per Jessen wrote:
>> I work almost exclusively in UTF-8 (language irrelevant), but I've
>> never had to do any of the above.  The mb_convert_encoding()
>> fromUTF-8 to UTF-8 doesn't seem to make much sense?
> 
> I agree. Provided you HTML is dished out with UTF-8 in the doctype
> definiton etc. then all forms are automatically sent in UTF-8

Quite so.

> Also for multilingual content (labels etc.) in PHP have a look at the
> gettext extension. This will let you code in your default language
> with a minimal wrapper and translate to other languages with catalog
> files.

I haven't done much with gettext yet - I use apache language content
negotiation, separate source-files per language, and try hard to keep
my PHP code language neutral.


/Per Jessen, Zürich

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: languages and PHP

2007-09-27 Thread Edward Vermillion


On Sep 27, 2007, at 10:09 AM, Colin Guthrie wrote:


Per Jessen wrote:

David Christopher Zentgraf wrote:


Your biggest problem will be if you accept any kind of user input
which could be in any kind of language.
Depending on your server configuration you'll probably have some
serious cleaning and filtering to do.
I often have to employ this line for example:
foreach (array_keys($_POST) as $key) $clean[$key] =
mb_convert_encoding($_POST[$key], "UTF-8");

Trying to make sure that you'll receive UTF-8 helps as well:



I work almost exclusively in UTF-8 (language irrelevant), but I've  
never

had to do any of the above.  The mb_convert_encoding() from UTF-8 to
UTF-8 doesn't seem to make much sense?


I agree. Provided you HTML is dished out with UTF-8 in the doctype
definiton etc. then all forms are automatically sent in UTF-8



But what happens if you get data that's *not* UTF-8? Just because  
your html/form is set to UTF-8 doesn't mean that all your incoming  
data will be UTF-8.


Ed

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Re: languages and PHP

2007-09-27 Thread Colin Guthrie
Per Jessen wrote:
> David Christopher Zentgraf wrote:
> 
>> Your biggest problem will be if you accept any kind of user input
>> which could be in any kind of language.
>> Depending on your server configuration you'll probably have some
>> serious cleaning and filtering to do.
>> I often have to employ this line for example:
>> foreach (array_keys($_POST) as $key) $clean[$key] =
>> mb_convert_encoding($_POST[$key], "UTF-8");
>>
>> Trying to make sure that you'll receive UTF-8 helps as well:
>> > accept-charset="utf-8">
> 
> I work almost exclusively in UTF-8 (language irrelevant), but I've never
> had to do any of the above.  The mb_convert_encoding() from UTF-8 to
> UTF-8 doesn't seem to make much sense? 

I agree. Provided you HTML is dished out with UTF-8 in the doctype
definiton etc. then all forms are automatically sent in UTF-8

Also for multilingual content (labels etc.) in PHP have a look at the
gettext extension. This will let you code in your default language with
a minimal wrapper and translate to other languages with catalog files.

Be careful about breaking strings tho' as the xgettext crawler file will
not be able to extract strings properly. Also do not use variable
substituion but rather use printf/sprintf e.g.:
  $var = sprintf(_('Here is %d example of a translatable string.'), 1);

That way the string you translate is nice and standard.(here _() is the
name of the gettext function which is quite common).

You may leave this up to your templating engine (e.g. no labels in
code), so it may not matter.

Also if you are writing modular code, ensure you think about the gettext
domains first and you'll probably want to pass the domain in with
*every* string in your system to ensure it's modular design (e.g. each
module has it's own gettext domain). If you want a good example of
modular gettext usage, I'd recommend looking at the source of gallery2.

HTH.

Col

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php