[PHP] Re: languages and PHP
At 11:09 AM +0100 10/2/07, Colin Guthrie wrote: tedd wrote: Isn't UTF-8 the big fish here? Sure there' UTF-16 and larger, but everything else is a subset of UTF-8, is it not? So, what's the problem if you get a character defined by ISO -- it's still within the UTF-8 super-group, right? Individual characters are sometimes OK, but it's the sequence of characters that could be invalid. UTF-8 works by using special bits at the MSB end of the byte to say, "I can't represent this character in one byte, I need to use 2 bytes (or 3 bytes)" (and maybe also 4? can't remember of the top of my head). In a multi-byte sequence the MSB end of all the bytes must follow a pre-defined scheme. If they do not they are syntactically invalid UTF-8. So it's more than just individual characters, the order of them is important. Hope that explains it (although probably a bad explanation as I'm very tired right now!). Col Ah, I see what you're saying. I've run into that before when studying Unicode. The mb_ series of functions deal with larger than ASCII coding, but I don't know of any that deals with character sequence/combinations or right/left readings. That's all Greek to me, pardon the pun. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Colin Guthrie wrote: > UTF-8 works by using special bits at the MSB end of the byte to say, > "I can't represent this character in one byte, I need to use 2 bytes > (or 3 bytes)" (and maybe also 4? can't remember of the top of my > head). Yep, a UTF8 character is 1 to 4 bytes. /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Re: languages and PHP
tedd wrote: > Isn't UTF-8 the big fish here? > > Sure there' UTF-16 and larger, but everything else is a subset of UTF-8, > is it not? > > So, what's the problem if you get a character defined by ISO -- it's > still within the UTF-8 super-group, right? Individual characters are sometimes OK, but it's the sequence of characters that could be invalid. UTF-8 works by using special bits at the MSB end of the byte to say, "I can't represent this character in one byte, I need to use 2 bytes (or 3 bytes)" (and maybe also 4? can't remember of the top of my head). In a multi-byte sequence the MSB end of all the bytes must follow a pre-defined scheme. If they do not they are syntactically invalid UTF-8. So it's more than just individual characters, the order of them is important. Hope that explains it (although probably a bad explanation as I'm very tired right now!). Col -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
At 12:36 PM -0500 9/28/07, Edward Vermillion wrote: My question was more mental prodding than anything else. The OP had a function to convert incoming text into UTF-8 before they did anything with it. A couple of folks said that was unnecessary, if you set your form to UTF-8 your incoming data will be in UTF-8 already. I was just trying to make the point that if you expect your incoming data to be in a certain state in your code you should make sure that it is in that state before you act on it, since you can't guarantee it's source. Checking to make sure the incoming data is in it's expected state is not a waste of time (or unnecessary, or whatever term of derision they picked) but is actually good coding practice. I pretty much gave up on the thread when I got the reply along the lines of "if it breaks something it's their problem, not mine". Ed I still don't see the problem: If you are receiving in UTF-8 and someone sends you something UTF-8 or less, than you can catch it. If, on the other hand you are set up for a lessor charset, then there's no way you can be assured that what they send, you can catch. If given the choice, use the super-group. This is too obvious, I must be missing something. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Edward Vermillion wrote: > On Sep 28, 2007, at 1:05 PM, Per Jessen wrote: >> >> Ed, your question was a good one, but so was my answer. In my case, >> I don't cater to an open community, but to a closed one. If you're >> not authenticated, you're not getting anywhere to start with. If you >> somehow manage to bypass that, and attempt to submit data I don't >> expect, my priority is the survival of my application, nothing else. >> > > But that was my point. Your way, your app may disintegrate at some > uncontrolled point. As long as it is only the app, it's not a real problem. If it affects apache, it's a different issue. If the app throws a couple of unexpected exceptions or something, no big deal. > At least if your checking/validating your input then > you can take control of the situation and insure the "survival of your > application". Otherwise who knows where it will break and what it will > mean when it does. I agree, but to check for unwanted charactersets and do conversions and what have you, is way overkill IMOH. > And just because the community is closed, don't drop your guard on > basic security practices. You don't control what comes into your site, > you can only react to it. I agree - like I said, authentication is required. /Per -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
On Sep 28, 2007, at 1:05 PM, Per Jessen wrote: Edward Vermillion wrote: I pretty much gave up on the thread when I got the reply along the lines of "if it breaks something it's their problem, not mine". Ed, your question was a good one, but so was my answer. In my case, I don't cater to an open community, but to a closed one. If you're not authenticated, you're not getting anywhere to start with. If you somehow manage to bypass that, and attempt to submit data I don't expect, my priority is the survival of my application, nothing else. But that was my point. Your way, your app may disintegrate at some uncontrolled point. At least if your checking/validating your input then you can take control of the situation and insure the "survival of your application". Otherwise who knows where it will break and what it will mean when it does. And just because the community is closed, don't drop your guard on basic security practices. You don't control what comes into your site, you can only react to it. Ed -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Edward Vermillion wrote: > I pretty much gave up on the thread when I got the reply along the > lines of "if it breaks something it's their problem, not mine". Ed, your question was a good one, but so was my answer. In my case, I don't cater to an open community, but to a closed one. If you're not authenticated, you're not getting anywhere to start with. If you somehow manage to bypass that, and attempt to submit data I don't expect, my priority is the survival of my application, nothing else. /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
On Sep 28, 2007, at 11:34 AM, tedd wrote: At 2:01 PM -0500 9/27/07, Edward Vermillion wrote: So back to my original question, what breaks if you're *expecting* UTF-8 and you don't *get* UTF-8? Ed Isn't UTF-8 the big fish here? Sure there' UTF-16 and larger, but everything else is a subset of UTF-8, is it not? So, what's the problem if you get a character defined by ISO -- it's still within the UTF-8 super-group, right? The only problem I see here is IF the user has the char set to display the glyph correctly -- OR am I off on something else that you guys aren't even discussing? Probably very relevant to the original question, but... My question was more mental prodding than anything else. The OP had a function to convert incoming text into UTF-8 before they did anything with it. A couple of folks said that was unnecessary, if you set your form to UTF-8 your incoming data will be in UTF-8 already. I was just trying to make the point that if you expect your incoming data to be in a certain state in your code you should make sure that it is in that state before you act on it, since you can't guarantee it's source. Checking to make sure the incoming data is in it's expected state is not a waste of time (or unnecessary, or whatever term of derision they picked) but is actually good coding practice. I pretty much gave up on the thread when I got the reply along the lines of "if it breaks something it's their problem, not mine". Ed -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
At 2:01 PM -0500 9/27/07, Edward Vermillion wrote: So back to my original question, what breaks if you're *expecting* UTF-8 and you don't *get* UTF-8? Ed Isn't UTF-8 the big fish here? Sure there' UTF-16 and larger, but everything else is a subset of UTF-8, is it not? So, what's the problem if you get a character defined by ISO -- it's still within the UTF-8 super-group, right? The only problem I see here is IF the user has the char set to display the glyph correctly -- OR am I off on something else that you guys aren't even discussing? Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Edward Vermillion wrote: > ... and you can guarantee that any data coming into your site comes > from your form?!? WOW!!! > > ;) > > So back to my original question, what breaks if you're *expecting* > UTF-8 and you don't *get* UTF-8? As long as my server isn't vulnerable to it, I couldn't care less. I'm certainly not going to be that defensive and write my code to deal with such situations. If someone POSTs or GETs something not in UTF-8 from outside my site, and I can't deal with it, it's their problem, not mine. And with PHP being the solid platform it is, it wouldn't dream of taking down my server just because I get KOI8-R when I expected UTF-8 :-) /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
On Sep 27, 2007, at 1:49 PM, Per Jessen wrote: Edward Vermillion wrote: But what happens if you get data that's *not* UTF-8? Just because your html/form is set to UTF-8 doesn't mean that all your incoming data will be UTF-8. Yes it does. If your HTML page was sent in UTF-8, any request originating from that page will also be in UTF8. ... and you can guarantee that any data coming into your site comes from your form?!? WOW!!! ;) So back to my original question, what breaks if you're *expecting* UTF-8 and you don't *get* UTF-8? Ed -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
On 9/27/07, Edward Vermillion <[EMAIL PROTECTED]> wrote: > But what happens if you get data that's *not* UTF-8? Just because > your html/form is set to UTF-8 doesn't mean that all your incoming > data will be UTF-8. just my experience, but as long as it has the meta tag w/ utf-8 in it, the browser sends (and receives) utf-8. i can store the strings in mysql without modification or character set conversion, it works like a charm. the only thing that might need help then is doing string modifications like urlencoding, or replacement on the utf-8 characters... i haven't had to do that yet, but otherwise the end-to-end utf-8 solution has worked like a charm for me. but yes, it does require browsers+utf8, running it out of that context may or may not work depending on what you're trying to do with the data.. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Edward Vermillion wrote: > But what happens if you get data that's *not* UTF-8? Just because > your html/form is set to UTF-8 doesn't mean that all your incoming > data will be UTF-8. Yes it does. If your HTML page was sent in UTF-8, any request originating from that page will also be in UTF8. /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
Colin Guthrie wrote: > Per Jessen wrote: >> I work almost exclusively in UTF-8 (language irrelevant), but I've >> never had to do any of the above. The mb_convert_encoding() >> fromUTF-8 to UTF-8 doesn't seem to make much sense? > > I agree. Provided you HTML is dished out with UTF-8 in the doctype > definiton etc. then all forms are automatically sent in UTF-8 Quite so. > Also for multilingual content (labels etc.) in PHP have a look at the > gettext extension. This will let you code in your default language > with a minimal wrapper and translate to other languages with catalog > files. I haven't done much with gettext yet - I use apache language content negotiation, separate source-files per language, and try hard to keep my PHP code language neutral. /Per Jessen, Zürich -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: languages and PHP
On Sep 27, 2007, at 10:09 AM, Colin Guthrie wrote: Per Jessen wrote: David Christopher Zentgraf wrote: Your biggest problem will be if you accept any kind of user input which could be in any kind of language. Depending on your server configuration you'll probably have some serious cleaning and filtering to do. I often have to employ this line for example: foreach (array_keys($_POST) as $key) $clean[$key] = mb_convert_encoding($_POST[$key], "UTF-8"); Trying to make sure that you'll receive UTF-8 helps as well: I work almost exclusively in UTF-8 (language irrelevant), but I've never had to do any of the above. The mb_convert_encoding() from UTF-8 to UTF-8 doesn't seem to make much sense? I agree. Provided you HTML is dished out with UTF-8 in the doctype definiton etc. then all forms are automatically sent in UTF-8 But what happens if you get data that's *not* UTF-8? Just because your html/form is set to UTF-8 doesn't mean that all your incoming data will be UTF-8. Ed -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Re: languages and PHP
Per Jessen wrote: > David Christopher Zentgraf wrote: > >> Your biggest problem will be if you accept any kind of user input >> which could be in any kind of language. >> Depending on your server configuration you'll probably have some >> serious cleaning and filtering to do. >> I often have to employ this line for example: >> foreach (array_keys($_POST) as $key) $clean[$key] = >> mb_convert_encoding($_POST[$key], "UTF-8"); >> >> Trying to make sure that you'll receive UTF-8 helps as well: >> > accept-charset="utf-8"> > > I work almost exclusively in UTF-8 (language irrelevant), but I've never > had to do any of the above. The mb_convert_encoding() from UTF-8 to > UTF-8 doesn't seem to make much sense? I agree. Provided you HTML is dished out with UTF-8 in the doctype definiton etc. then all forms are automatically sent in UTF-8 Also for multilingual content (labels etc.) in PHP have a look at the gettext extension. This will let you code in your default language with a minimal wrapper and translate to other languages with catalog files. Be careful about breaking strings tho' as the xgettext crawler file will not be able to extract strings properly. Also do not use variable substituion but rather use printf/sprintf e.g.: $var = sprintf(_('Here is %d example of a translatable string.'), 1); That way the string you translate is nice and standard.(here _() is the name of the gettext function which is quite common). You may leave this up to your templating engine (e.g. no labels in code), so it may not matter. Also if you are writing modular code, ensure you think about the gettext domains first and you'll probably want to pass the domain in with *every* string in your system to ensure it's modular design (e.g. each module has it's own gettext domain). If you want a good example of modular gettext usage, I'd recommend looking at the source of gallery2. HTH. Col -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php