Re: [PHP-DEV] Unicode support
Hello, I think that Rowan is right: PHP users need to manipulate grapheme clusters first (and code points in some rare situations). The fact that most of us live in a world were NFC composes all our characters only hides this reality. A typical use case is a template engine: nearly all string manipulations there need grapheme awareness: cutting strings for getting excerpt, inserting a space between every character, changing the case, etc. A typical use case for a PHP app. An other use case is if you want to implement text indexing in PHP: you need to normalize before indexing, handle case folding, and thus think in terms of graphemes. I'm not sure this is frequent in PHP though. Like already said, alongside with grapheme clusters, we should also deals with string matching: collations are out of scope, but normalization and case folding is in. Please do not forget the turkish alphabet https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php also... This is required IMHO to have what user expects for str_replace, strpos, strcmp, etc. I wrote a quite successful PHP lib to deal with this in PHP: https://github.com/nicolas-grekas/Patchwork-UTF8 My experience from this is the following: - dealing with grapheme clusters in current PHP is ok with grapheme_*() functions, but these require intl. It would be great to have them (or an equivalent) in core, - NFC normalization of all input is required to deal with string comparisons, so having Normalizer in core looks required also, - almost everybody uses mbstring when dealing with utf8 strings, but almost all cases should use a grapheme_*() instead. To be clear, I am suggesting that we aim to be the language which gets this right, where other languages get it wrong. Thank you for explaining this. I also think it could do better. I think Unicode-aware strrev() shouldn't be too complicated to do. Perl 6 identified the subject very well and invented what they call NFG, which is NFC + dynamic internal code points for non-composable grapheme clusters: http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html Maybe worth looking at? Cheers, Nicolas
Re: [PHP-DEV] Unicode support
Good point. That's what i meant by border-line case. Could you possibly point me to a specific example of such false positive? I'm interested in well-formed UTF-8 string. I believe noël test is ill-formed UTF-8 and doesn't conform to shortest-form requirement. You're confusing two concepts here: well-formed UTF-8 represents any single code point with the smallest number of bytes, but it makes no requirements about what code points are represented. Representing ë as two code points is perfectly valid Unicode, and would in fact be required under NFD. That most input sources would prefer the combined form seems like a weak assumption to base a library on; it only takes one popular third-party to routinely return data in NFD for the problems to start showing up. It's pretty meaningless to say you support Unicode, but only the easy bits. You might as well just tag each string with one of the pages of ISO-8859. As far as i'm concerned Unicode specification does not require to implement all annexes or even support entire character set to be conformant. I think there are always trade-offs involved, depending on what is more important for you. Sure, but there are certain user expectations of what Unicode support means. Handling Korean characters in a meaningfulmeaningful way would definitely be on that list. As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding array support to a language. To put it a different way, in what situation would you actively want to know the number of code points in a string, rather than either the number of bytes in its UTF8 representation, or the number of graphemes?
Re: [PHP-DEV] Unicode support
On 15/10/14 10:04, Rowan Collins wrote: Rowan, As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding array support to a language. I'm sorry for not making myself clear. What i'm essentially saying is that i think noël test is synthetic and impractical, it's also solvable with requirement of NFC strings at input and this is not implementation defect. I also believe that Hangul is most likely to be precomposed and will work alright. And i have another opinion on UTF-8 shortest-form. This is my personal opinion of course. That aside. I think requirements is what i was asking about, i'm assuming that your standpoint is that string modification routines are at least required to take into account entire characters, not only code points. Am i correct? What is confusing me is that i think you're seeing it as a major implementation defect. To avoid arguable implementations, i've made short example in Java: System.out.println(new StringBuffer(noël).reverse().toString()); It does produce string l̈eon as i would expect. Precomposed noël also works as i would expect producing string lëon. What do you think, is this implementation issue or solely requirements issue? -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
Aleksey Tulinov wrote (on 15/10/2014): On 15/10/14 10:04, Rowan Collins wrote: Rowan, As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding array support to a language. I'm sorry for not making myself clear. What i'm essentially saying is that i think noël test is synthetic and impractical I remain unconvinced on that, and it's just one example. There are plenty of forms which don't have a combined form, otherwise there would be no need for combining diacritics to exist in the first place. it's also solvable with requirement of NFC strings at input and this is not implementation defect. I also believe that Hangul is most likely to be precomposed and will work alright. Requiring a particular normal form on input is not something a programming language can do. The only way you can guarantee NFC form is by performing the normalisation. And i have another opinion on UTF-8 shortest-form. There's no need for opinion there, we can consult the standard. http://www.unicode.org/versions/Unicode6.0.0/ D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] The Unicode Standard uses 8-bit code units in the UTF-8 encoding form [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit sequence that maps to a single Unicode scalar value. D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7. Before the Unicode Standard, Version 3.1, the problematic “non-shortest form” byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7. In short: UTF-8 defines a mapping between sequences of 8-bit code units to abstract Unicode scalar values. Every Unicode scalar value maps to a single unique sequence of code units, but all Unicode scalar values can be represented. Since U+0308 COMBINING DIAERESIS is a valid Unicode scalar value, a UTF-8 string representing that value can be well-formed. It is only alternative representations of the same Unicode scalar value which must be in shortest form. There may be standards for interchange in particular situations which enforce additional constraints, such as that all strings should be in NFC, but the applicability or correct implementation of such standards is not something that you can use to define handling in an entire programming language. That aside. I think requirements is what i was asking about, i'm assuming that your standpoint is that string modification routines are at least required to take into account entire characters, not only code points. Am i correct? Yes, I think that at least some functions should be available which work on characters as users would define them, such as length and perhaps safe truncation. What is confusing me is that i think you're seeing it as a major implementation defect. To avoid arguable implementations, i've made short example in Java: System.out.println(new StringBuffer(noël).reverse().toString()); It does produce string l̈eon as i would expect. Why do you expect that? Is this a result which would ever be useful? To be clear, I am suggesting that we aim to be the language which gets this right, where other languages get it wrong. Precomposed noël also works as i would expect producing string lëon. What do you think, is this implementation issue or solely requirements issue? Well, you can only define an implementation defect with respect to the original requirement. If the requirement was to reverse characters, as most users would understand that term, then moving the diacritic to a different letter fails that requirement, because a user would not consider a diacritic a separate character. If the requirement was to reverse code points, regardless of their meaning, then the implementation is fine, but I would argue that the requirement failed to capture what most users would actually want. Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 15/10/14 15:58, Rowan Collins wrote: Rowan, What is confusing me is that i think you're seeing it as a major implementation defect. To avoid arguable implementations, i've made short example in Java: System.out.println(new StringBuffer(noël).reverse().toString()); It does produce string l̈eon as i would expect. Why do you expect that? Is this a result which would ever be useful? I think expect it to work this way because i know that this is a good trade-off between performance and produced result. It also leaves a possibility to do it better if i need to. To be clear, I am suggesting that we aim to be the language which gets this right, where other languages get it wrong. Thank you for explaining this. I also think it could do better. I think Unicode-aware strrev() shouldn't be too complicated to do. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
[PHP-DEV] Unicode support
Hey, I can't find any recent discussion in this mailing list on this topic, i think that most close one is http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I was also reading papers like that: http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.html Latter is referring to difficulties like excess memory usage and rewrite the language. I'm developing an open-source Unicode implementation library (nunicode), and it doesn't consume any heap at all, it also works on native binary strings, as PHP does. Hence i thinks that maybe it could help with at least these two problems. But i hardly understand if my work is even applicable here. My library is a rather pragmatic implementation, it's conformant to Unicode 7.0 and ISO/IEC 14651, but it does not implement the whole Unicode specification. I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions: 1. Is there a need for more Unicode support in PHP? 2. What is currently missing in that regard? 3. Is this a good place to ask such questions? Thanks. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14 October 2014 10:04, Aleksey Tulinov aleksey.tuli...@gmail.com wrote: Hey, I can't find any recent discussion in this mailing list on this topic, i think that most close one is http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I was also reading papers like that: http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.html Latter is referring to difficulties like excess memory usage and rewrite the language. I'm developing an open-source Unicode implementation library (nunicode), and it doesn't consume any heap at all, it also works on native binary strings, as PHP does. Hence i thinks that maybe it could help with at least these two problems. On the face of it, this implies a rather large performance hit and a tendency to overflow the stack much more readily, do you have any details on these elements? But i hardly understand if my work is even applicable here. My library is a rather pragmatic implementation, it's conformant to Unicode 7.0 and ISO/IEC 14651, but it does not implement the whole Unicode specification. I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions: The only additional thing I can find quickly is something Pierre put together earlier this year, when PHP6 (now 7) discussions were started: https://wiki.php.net/ideas/php6/unicode 1. Is there a need for more Unicode support in PHP? 2. What is currently missing in that regard? 3. Is this a good place to ask such questions? My *personal* view on questions 1 and 2 is no and nothing respectively, but I think this is not a popular opinion (and those answers are a vast oversimplification of the issues). This is certainly a good place to ask those questions, though. Thanks. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14 Oct 2014, at 10:04, Aleksey Tulinov aleksey.tuli...@gmail.com wrote: I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions: 1. Is there a need for more Unicode support in PHP? Yes. 2. What is currently missing in that regard? Unicode string support. 3. Is this a good place to ask such questions? Yes. If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring It would add a UString class to PHP for Unicode strings. This would make Unicode text manipulation much easier than it is now. And both internal and userland code which accepts strings would already be compatible as it has a __toString method, but new code could also choose to accept UStrings directly. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/14 14:00, Chris Wright wrote: Chris, Latter is referring to difficulties like excess memory usage and rewrite the language. I'm developing an open-source Unicode implementation library (nunicode), and it doesn't consume any heap at all, it also works on native binary strings, as PHP does. Hence i thinks that maybe it could help with at least these two problems. On the face of it, this implies a rather large performance hit and a tendency to overflow the stack much more readily, do you have any details on these elements? I can't really tell if hit is going to be large before understanding what final result would be, at least approximately. I can tell that internal complexity of nunicode is O(1) everywhere. I'm comparing performance to ICU and nunicode mostly outperforms it. I've compiled some numbers here: https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations Regarding stack, i'm not sure if get the point. As far as i'm concerned, library does not have recursive calls, it does not have internal representation and does not allocate on stack aggressively. Everything works on immutable binary strings, stack will be used mostly for function calls. But honestly, i feel like i'm not answering your question at all. Could you possibly clarify it? I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions: The only additional thing I can find quickly is something Pierre put together earlier this year, when PHP6 (now 7) discussions were started: https://wiki.php.net/ideas/php6/unicode Thank you, this is exactly what i was looking for. I would appreciate if someone would comment on the following: Some of the keys point we need to take care of are: 1) UTF-8 storage 2) UTF-8 support for almost (if not all) existing string APIs 3) Performance As of today, I did not find any library covering at least two of these key points. I think i could claim that nunicode is covering at least two key points, maybe all of them, but i'm not sure about point 2). API do include operations on strings, but this API is simply following standard string functions (UTF equivalents of strcoll(), strchr(), strstr(), etc). Does that sound good or not? -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14 October 2014 16:09, Aleksey Tulinov aleksey.tuli...@gmail.com wrote: On 14/10/14 14:00, Chris Wright wrote: Chris, Latter is referring to difficulties like excess memory usage and rewrite the language. I'm developing an open-source Unicode implementation library (nunicode), and it doesn't consume any heap at all, it also works on native binary strings, as PHP does. Hence i thinks that maybe it could help with at least these two problems. On the face of it, this implies a rather large performance hit and a tendency to overflow the stack much more readily, do you have any details on these elements? I can't really tell if hit is going to be large before understanding what final result would be, at least approximately. I can tell that internal complexity of nunicode is O(1) everywhere. I'm comparing performance to ICU and nunicode mostly outperforms it. I've compiled some numbers here: https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations Great, thanks for this Regarding stack, i'm not sure if get the point. As far as i'm concerned, library does not have recursive calls, it does not have internal representation and does not allocate on stack aggressively. Everything works on immutable binary strings, stack will be used mostly for function calls. But honestly, i feel like i'm not answering your question at all. Could you possibly clarify it? My apologies, this was a case of typing before thinking properly. I was envisaging very large stack frames due to large char arrays being allocated on the stack but when I actually apply my brain to what you are doing I realise that this isn't going to be the case. Carry on. I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions: The only additional thing I can find quickly is something Pierre put together earlier this year, when PHP6 (now 7) discussions were started: https://wiki.php.net/ideas/php6/unicode Thank you, this is exactly what i was looking for. I would appreciate if someone would comment on the following: Some of the keys point we need to take care of are: 1) UTF-8 storage 2) UTF-8 support for almost (if not all) existing string APIs 3) Performance As of today, I did not find any library covering at least two of these key points. I think i could claim that nunicode is covering at least two key points, maybe all of them, but i'm not sure about point 2). API do include operations on strings, but this API is simply following standard string functions (UTF equivalents of strcoll(), strchr(), strstr(), etc). Does that sound good or not? -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/14 16:50, Andrea Faulds wrote: If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring It would add a UString class to PHP for Unicode strings. This would make Unicode text manipulation much easier than it is now. And both internal and userland code which accepts strings would already be compatible as it has a __toString method, but new code could also choose to accept UStrings directly. Looking at it now. UString and repo linked in its description are very good read indeed. Thank you. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/2014 14:50, Andrea Faulds wrote: 2. What is currently missing in that regard? Unicode string support. I know that was probably deliberately flippant, but I think there is a genuine question to be asked here. A lot of people talk about Unicode support like they talk about XPath support; but XPath is an API you can adhere to, Unicode is a whole lot more (and less) than that. What it probably means to most people is string functions which do what I expect with a vast range of obscure Unicode code point sequences. Those expectations need to be documented *before* an API is written, rather than writing a whole load of functions which use a Unicode library, but don't actually provide the tools that people need. If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring It looks like a good prototype, but glancing at the documentation, I'm not clear exactly what the assumptions of some of the functions are. There's a lot of talk of characters, which is a *very* slippery notion in Unicode; charAt() returns a single code point, and $length returns a number of code points. This makes me wonder if it will pass the noël test [1] - does a combining diacritic move onto a different letter when you run -reverse()? As I've mentioned before, a lot of the time what people actually want to deal with is grapheme clusters - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string noël, would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.) A good Unicode string API should probably give clear labels and choices for such things - $string-codePointAt(3) is not the same as $string-graphemeAt(3), $string-codePointCount is not the same as $string-graphemeCount, and so forth. A single property $length seems more user-friendly, until the user finds it means something different to what they wanted. Similarly, an automatic __toString() function is handy, but what encoding does it output, and why? UTF-8? The same encoding that the string was constructed with? If I know that my database is expecting UTF-8, I probably want to say $string-getByteString('UTF-8'). I may also want to say $string-getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string-substring(0, 20)-getByteString('UTF-8') nor substr( $string-getByteString('UTF-8'), 0, 20 ) can do. In short, we can only abstract so much - supporting Unicode automatically means supporting its complexity, not just pretending it's a really big version of ASCII. [1] http://mortoray.com/2013/11/27/the-string-type-is-broken/ -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14 Oct 2014, at 19:01, Rowan Collins rowan.coll...@gmail.com wrote: If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring It looks like a good prototype, but glancing at the documentation, I'm not clear exactly what the assumptions of some of the functions are. There's a lot of talk of characters, which is a *very* slippery notion in Unicode; charAt() returns a single code point, and $length returns a number of code points. This makes me wonder if it will pass the noël test [1] - does a combining diacritic move onto a different letter when you run -reverse()? As I've mentioned before, a lot of the time what people actually want to deal with is grapheme clusters - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string noël, would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.) A good Unicode string API should probably give clear labels and choices for such things - $string-codePointAt(3) is not the same as $string-graphemeAt(3), $string-codePointCount is not the same as $string-graphemeCount, and so forth. A single property $length seems more user-friendly, until the user finds it means something different to what they wanted. This is true. It ought to talk about code points but doesn’t. Length is primarily needed for iterating through strings and the like. If you went length in characters, you probably need to implement your own algorithm, as it really depends on your specific use case. It will, however, always produce valid UTF8 strings for output. That’s better than standard string functions which can mangle UTF8. Similarly, an automatic __toString() function is handy, but what encoding does it output, and why? UTF-8? The same encoding that the string was constructed with? Always UTF-8. If I know that my database is expecting UTF-8, I probably want to say $string-getByteString('UTF-8’). You can do that. I may also want to say $string-getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string-substring(0, 20)-getByteString('UTF-8') nor substr( $string-getByteString('UTF-8'), 0, 20 ) can do. I’m not sure quite how you’d do that. There might be a function in mbstring for that. In short, we can only abstract so much - supporting Unicode automatically means supporting its complexity, not just pretending it's a really big version of ASCII. Sure. But just handling code points safely is hard enough as it is. This handles that. It doesn’t handle characters, sure, but it’s a start. And for many applications, you do not need to handle characters. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/14 21:01, Rowan Collins wrote: Rowan, As I've mentioned before, a lot of the time what people actually want to deal with is grapheme clusters - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string noël, would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.) Very good point. I'll give another example: is there a substring s in string Maße? If it's case-sensitive search, when there is no such substring, but if it's case-insensitive search, then ß folds into ss and substring s appears. This works both ways. For instance, if someone wants to split string MASSE after ß in case-insensitive manner, one approach might be: 1) find ß position, it's +2; 2) split string at +3. Result would be two strings: MAS and SE. Back to combining characters, i dig the idea of introducing graphemes, but i think French person would write word noël using precomposed character. I'm using French keyboard at https://translate.google.com/#fr/. ë is Shift + ^, then e, it produces precomposed U+00EB. If script doesn't have precomposed equivalent, then this grapheme will always be in the same decomposed form and collation will work. Substring search will also work, because needle will be decomposed in the same way as haystack. There are some border-line cases possible, but are they really practical in a scope of Unicode support in a programming language? Any ideas? P.S. Point about documentation taken. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On Tue, 2014-10-14 at 23:18 +0300, Aleksey Tulinov wrote: Very good point. I'll give another example: is there a substring s in string Maße? If it's case-sensitive search, when there is no such substring, but if it's case-insensitive search, then ß folds into ss and substring s appears. In Unicode 5.1 there is ẞ U+1E9E LATIN CAPITAL LETTER SHARP S. (The point of this post mostly is to show that there is another dimension making this even more complicated, again - different Unicode versions) johannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/2014 21:18, Aleksey Tulinov wrote: Back to combining characters, i dig the idea of introducing graphemes, but i think French person would write word noël using precomposed character. I'm using French keyboard at https://translate.google.com/#fr/. ë is Shift + ^, then e, it produces precomposed U+00EB. You don't even need to rely on the input method using the combined form, Unicode includes an algorithm for normalisation to this form (where such composites are coded), known as NFC. If script doesn't have precomposed equivalent, then this grapheme will always be in the same decomposed form and collation will work. Substring search will also work, because needle will be decomposed in the same way as haystack. No, it won't. You won't get false negatives as long as both strings are normalised to the same form (whether that is NFC or NFD), but you will get false positives. For instance, searching for the substring e would not match a combined ë, but it would match an uncombined sequence with e at its base (e.g. with two diacritics). Normalising to NFD (fully de-composed) would at least mean that e consistently matched all graphemes with e at their base, but is not a lossless operation, so performing it implicitly is probably not a good idea. All of which ignores the questions of length and string reversal, which I think are much more important in this respect. There are some border-line cases possible, but are they really practical in a scope of Unicode support in a programming language? As I understand it, the entirety of the Korean writing system is an edge case in this respect - it uses 3 code points for each grapheme, and cutting one of those graphemes apart leaves you with gibberish. It's pretty meaningless to say you support Unicode, but only the easy bits. You might as well just tag each string with one of the pages of ISO-8859. -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/2014 20:51, Andrea Faulds wrote: If you went length in characters, you probably need to implement your own algorithm, as it really depends on your specific use case. I disagree, Unicode has very well-defined algorithms for these things, and the average PHP developer (or even PHP framework developer) is unlikely to do better. It will, however, always produce valid UTF8 strings for output. That’s better than standard string functions which can mangle UTF8. They will be valid UTF-8 sequences, but they may not be meaningful strings - you might truncate halfway along a set of combining diacritics, or worse, halfway through a Korean syllable character (3 codepoints, 1 grapheme). I may also want to say $string-getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string-substring(0, 20)-getByteString('UTF-8') nor substr( $string-getByteString('UTF-8'), 0, 20 ) can do. I’m not sure quite how you’d do that. Nor am I, that's why I want the library to do it for me! :P More seriously, a simple algorithm is easy enough to design - serialize your abstract string into bytes one grapheme at a time, tracking the current and previous lengths. If current length exceeds the maximum, track back to the previous length and return; otherwise, continue until all graphemes are serialized. Sure. But just handling code points safely is hard enough as it is. This handles that. It doesn’t handle characters, sure, but it’s a start. And for many applications, you do not need to handle characters. We already have mbstring and intl for doing various things a bit better; the goal of more centralised support should be to do them as well as possible, not be just another variation that doesn't quite get there. -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/14 10:04, Aleksey Tulinov wrote: 1. Is there a need for more Unicode support in PHP? 2. What is currently missing in that regard? 3. Is this a good place to ask such questions? I need to ask ... Is this discussion only about improving support for UTF8 content in PHP? What is the current state of play with regards function and variable names? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 14/10/14 23:48, Johannes Schlüter wrote: On Tue, 2014-10-14 at 23:18 +0300, Aleksey Tulinov wrote: Very good point. I'll give another example: is there a substring s in string Maße? If it's case-sensitive search, when there is no such substring, but if it's case-insensitive search, then ß folds into ss and substring s appears. In Unicode 5.1 there is ẞ U+1E9E LATIN CAPITAL LETTER SHARP S. (The point of this post mostly is to show that there is another dimension making this even more complicated, again - different Unicode versions) It's still in Unicode 7.0. According to Unicode character database ß uppercase is SS, ẞ lowercase is ß, both casefolds into ss. Thus upper(lower(ẞ)) should produce SS. There is another dimension indeed. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Unicode support
On 15/10/14 00:04, Rowan Collins wrote: Rowan, Back to combining characters, i dig the idea of introducing graphemes, but i think French person would write word noël using precomposed character. I'm using French keyboard at https://translate.google.com/#fr/. ë is Shift + ^, then e, it produces precomposed U+00EB. You don't even need to rely on the input method using the combined form, Unicode includes an algorithm for normalisation to this form (where such composites are coded), known as NFC. The problem with NFC is that it's not only composition, but decomposition + reordering + re-composition. I know about NFC quick check, but the issue is if check fails and string need transformation, this would be very challenging, if not impossible, to do while keeping string immutable and without introducing internal representation of that string. Internal representation and string modifications brings overhead which might eventually render implementation unusable for a range of applications. On the other side, language specific characters which can be precomposed, are likely to be precomposed. If script doesn't have precomposed equivalent, then this grapheme will always be in the same decomposed form and collation will work. Substring search will also work, because needle will be decomposed in the same way as haystack. No, it won't. You won't get false negatives as long as both strings are normalised to the same form (whether that is NFC or NFD), but you will get false positives. For instance, searching for the substring e would not match a combined ë, but it would match an uncombined sequence with e at its base (e.g. with two diacritics). Normalising to NFD (fully de-composed) would at least mean that e consistently matched all graphemes with e at their base, but is not a lossless operation, so performing it implicitly is probably not a good idea. Good point. That's what i meant by border-line case. Could you possibly point me to a specific example of such false positive? I'm interested in well-formed UTF-8 string. I believe noël test is ill-formed UTF-8 and doesn't conform to shortest-form requirement. It's pretty meaningless to say you support Unicode, but only the easy bits. You might as well just tag each string with one of the pages of ISO-8859. As far as i'm concerned Unicode specification does not require to implement all annexes or even support entire character set to be conformant. I think there are always trade-offs involved, depending on what is more important for you. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
[PHP-DEV] Unicode support for *printf()
Hello all. Attached is the patch which adds Unicode support to *printf() functions stack. We (Andrei and me) made several assumptions that are worth mentioning: sprintf() and vsprintf(): - use runtime_encoding when dealing with Unicode data. printf() and vprintf(): - the result data is converted to output_encoding when formatting is done; - return _number of bytes_ outputted. fprintf() and vfprintf(): - use runtime_encoding, as all conversions are done by underlying streams API; - both functions return the number returned by streams API (which seems to be number of bytes). I did not run any benchmarks yet, but I don't expect it to cause any major slowdown. I would like to hear your comments before applying the patch, so don't hesitate to post them. -- Wbr, Antony Dovgal Index: ext/standard/formatted_print.c === RCS file: /repository/php-src/ext/standard/formatted_print.c,v retrieving revision 1.88 diff -u -p -d -r1.88 formatted_print.c --- ext/standard/formatted_print.c 7 Dec 2006 20:45:21 - 1.88 +++ ext/standard/formatted_print.c 11 Dec 2006 22:44:21 - @@ -40,6 +40,9 @@ #define MAX_FLOAT_DIGITS 38 #define MAX_FLOAT_PRECISION 40 +#define PHP_OUTPUT 0 +#define PHP_RUNTIME 1 + #if 0 /* trick to control varargs functions through cpp */ # define PRINTF_DEBUG(arg) php_printf arg @@ -50,7 +53,10 @@ static char hexchars[] = 0123456789abcdef; static char HEXCHARS[] = 0123456789ABCDEF; +static UChar u_hexchars[] = {0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66}; +static UChar u_HEXCHARS[] = {0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46}; +/* php_sprintf_appendchar() {{{ */ inline static void php_sprintf_appendchar(char **buffer, int *pos, int *size, char add TSRMLS_DC) { @@ -62,8 +68,21 @@ php_sprintf_appendchar(char **buffer, in PRINTF_DEBUG((sprintf: appending '%c', pos=\n, add, *pos)); (*buffer)[(*pos)++] = add; } +/* }}} */ +/* php_u_sprintf_appendchar() {{{ */ +inline static void +php_u_sprintf_appendchar(UChar **buffer, int *pos, int *size, UChar add TSRMLS_DC) +{ + if ((*pos + 1) = *size) { + *size = 1; + *buffer = eurealloc(*buffer, *size); + } + (*buffer)[(*pos)++] = add; +} +/* }}} */ +/* php_sprintf_appendstring() {{{ */ inline static void php_sprintf_appendstring(char **buffer, int *pos, int *size, char *add, int min_width, int max_width, char padding, @@ -112,10 +131,57 @@ php_sprintf_appendstring(char **buffer, } } } +/* }}} */ +/* php_u_sprintf_appendstring() {{{ */ +inline static void +php_u_sprintf_appendstring(UChar **buffer, int *pos, int *size, UChar *add, + int min_width, int max_width, UChar padding, + int alignment, int len, int neg, int expprec, int always_sign) +{ + register int npad; + int req_size; + int copy_len; + + copy_len = (expprec ? MIN(max_width, len) : len); + npad = min_width - copy_len; + + if (npad 0) { + npad = 0; + } + + req_size = *pos + MAX(min_width, copy_len) + 1; + + if (req_size *size) { + while (req_size *size) { + *size = 1; + } + *buffer = eurealloc(*buffer, *size); + } + if (alignment == ALIGN_RIGHT) { + if ((neg || always_sign) padding == 0x30 /* '0' */) { + (*buffer)[(*pos)++] = (neg) ? 0x2D /* '-' */ : 0x2B /* '+' */; + add++; + len--; + copy_len--; + } + while (npad-- 0) { + (*buffer)[(*pos)++] = padding; + } + } + u_memcpy((*buffer)[*pos], add, copy_len + 1); + *pos += copy_len; + if (alignment == ALIGN_LEFT) { + while (npad--) { + (*buffer)[(*pos)++] = padding; + } + } +} +/* }}} */ +/* php_sprintf_appendint() {{{ */ inline static void -php_sprintf_appendint(char **buffer, int *pos, int *size, long number, +php_sprintf_appendint(char **buffer, int *pos, int *size, long number, int width, char padding, int alignment, int always_sign) { @@ -155,7 +221,49 @@ php_sprintf_appendint(char **buffer, int padding, alignment, (NUM_BUF_SIZE - 1) - i, neg, 0, always_sign); } +/* }}} */ +/* php_u_sprintf_appendint() {{{ */ +inline static void +php_u_sprintf_appendint(UChar **buffer, int *pos, int