Re: [PHP-DEV] Unicode support

2014-10-16 Thread Nicolas Grekas
Hello,

I think that Rowan is right: PHP users need to manipulate grapheme clusters
first (and code points in some rare situations). The fact that most of us
live in a world were NFC composes all our characters only hides this
reality.

A typical use case is a template engine: nearly all string manipulations
there need grapheme awareness: cutting strings for getting excerpt,
inserting a space between every character, changing the case, etc. A
typical use case for a PHP app.
An other use case is if you want to implement text indexing in PHP: you
need to normalize before indexing, handle case folding, and thus think in
terms of graphemes. I'm not sure this is frequent in PHP though.

Like already said, alongside with grapheme clusters, we should also deals
with string matching: collations are out of scope, but normalization and
case folding is in. Please do not forget the turkish alphabet
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php
also...
This is required IMHO to have what user expects for str_replace, strpos,
strcmp, etc.

I wrote a quite successful PHP lib to deal with this in PHP:
https://github.com/nicolas-grekas/Patchwork-UTF8

My experience from this is the following:
- dealing with grapheme clusters in current PHP is ok with grapheme_*()
functions, but these require intl. It would be great to have them (or an
equivalent) in core,
- NFC normalization of all input is required to deal with string
comparisons, so having Normalizer in core looks required also,
- almost everybody uses mbstring when dealing with utf8 strings, but almost
all cases should use a grapheme_*() instead.


 To be clear, I am suggesting that we aim to be the language which gets
 this right, where other languages get it wrong.


 Thank you for explaining this. I also think it could do better. I think
 Unicode-aware strrev() shouldn't be too complicated to do.



Perl 6 identified the subject very well and invented what they call NFG,
which is NFC + dynamic internal code points for non-composable grapheme
clusters:
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html

Maybe worth looking at?

Cheers,
Nicolas


Re: [PHP-DEV] Unicode support

2014-10-15 Thread Rowan Collins

Good point. That's what i meant by border-line case. Could you possibly

point me to a specific example of such false positive? I'm interested
in 
well-formed UTF-8 string. I believe noël test is ill-formed UTF-8
and 
doesn't conform to shortest-form requirement.

You're confusing two concepts here: well-formed UTF-8 represents any single 
code point with the smallest number of bytes, but it makes no requirements 
about what code points are represented. Representing  ë  as two code points 
is perfectly valid Unicode, and would in fact be required under NFD.

That most input sources would prefer the combined form seems like a weak 
assumption to base a library on; it only takes one popular third-party to 
routinely return data in NFD for the problems to start showing up.

 It's pretty meaningless to say you support Unicode, but only the easy
 bits. You might as well just tag each string with one of the pages of
 ISO-8859.


As far as i'm concerned Unicode specification does not require to 
implement all annexes or even support entire character set to be 
conformant. I think there are always trade-offs involved, depending on 
what is more important for you.

Sure, but there are certain user expectations of what Unicode support means. 
Handling Korean characters in a meaningfulmeaningful way would definitely be on 
that list.

As I said at the top of my first post, the important thing is to capture what 
those requirements actually are. Just as you'd choose what array functions were 
needed if you were adding array support to a language.

To put it a different way, in what situation would you actively want to know 
the number of code points in a string, rather than either the number of bytes 
in its UTF8 representation, or the number of graphemes?

Re: [PHP-DEV] Unicode support

2014-10-15 Thread Aleksey Tulinov

On 15/10/14 10:04, Rowan Collins wrote:

Rowan,


As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding array support to a language.



I'm sorry for not making myself clear. What i'm essentially saying is 
that i think noël test is synthetic and impractical, it's also 
solvable with requirement of NFC strings at input and this is not 
implementation defect. I also believe that Hangul is most likely to be 
precomposed and will work alright. And i have another opinion on UTF-8 
shortest-form.


This is my personal opinion of course.

That aside.

I think requirements is what i was asking about, i'm assuming that your 
standpoint is that string modification routines are at least required to 
take into account entire characters, not only code points. Am i correct?


What is confusing me is that i think you're seeing it as a major 
implementation defect. To avoid arguable implementations, i've made 
short example in Java:


System.out.println(new StringBuffer(noël).reverse().toString());

It does produce string l̈eon as i would expect. Precomposed noël also 
works as i would expect producing string lëon. What do you think, is 
this implementation issue or solely requirements issue?


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-15 Thread Rowan Collins

Aleksey Tulinov wrote (on 15/10/2014):

On 15/10/14 10:04, Rowan Collins wrote:

Rowan,


As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding array support to a language.



I'm sorry for not making myself clear. What i'm essentially saying is 
that i think noël test is synthetic and impractical


I remain unconvinced on that, and it's just one example. There are 
plenty of forms which don't have a combined form, otherwise there would 
be no need for combining diacritics to exist in the first place.


it's also solvable with requirement of NFC strings at input and this 
is not implementation defect. I also believe that Hangul is most 
likely to be precomposed and will work alright.


Requiring a particular normal form on input is not something a 
programming language can do. The only way you can guarantee NFC form is 
by performing the normalisation.



And i have another opinion on UTF-8 shortest-form.


There's no need for opinion there, we can consult the standard. 
http://www.unicode.org/versions/Unicode6.0.0/


 D76 Unicode scalar value: Any Unicode code point except 
high-surrogate and low-surrogate

code points.
 D79 A Unicode encoding form assigns each Unicode scalar value to a 
unique code unit

sequence.
 D77 Code unit: The minimal bit combination that can represent a unit 
of encoded text
for processing or interchange. [...] The Unicode Standard uses 8-bit 
code units in the UTF-8 encoding form [...]
 D79 A Unicode encoding form assigns each Unicode scalar value to a 
unique code unit

sequence.
 D85a Minimal well-formed code unit subsequence: A well-formed Unicode 
code unit

sequence that maps to a single Unicode scalar value.
 D92 UTF-8 encoding form: The Unicode encoding form that assigns each 
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as 
specified in

Table 3-6 and Table 3-7.
 Before the Unicode Standard, Version 3.1, the problematic 
“non-shortest form”

byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.

In short: UTF-8 defines a mapping between sequences of 8-bit code 
units to abstract Unicode scalar values. Every Unicode scalar value 
maps to a single unique sequence of code units, but all Unicode scalar 
values can be represented. Since U+0308 COMBINING DIAERESIS is a valid 
Unicode scalar value, a UTF-8 string representing that value can be 
well-formed. It is only alternative representations of the same Unicode 
scalar value which must be in shortest form.


There may be standards for interchange in particular situations which 
enforce additional constraints, such as that all strings should be in 
NFC, but the applicability or correct implementation of such standards 
is not something that you can use to define handling in an entire 
programming language.




That aside.

I think requirements is what i was asking about, i'm assuming that 
your standpoint is that string modification routines are at least 
required to take into account entire characters, not only code points. 
Am i correct?


Yes, I think that at least some functions should be available which work 
on characters as users would define them, such as length and perhaps 
safe truncation.




What is confusing me is that i think you're seeing it as a major 
implementation defect. To avoid arguable implementations, i've made 
short example in Java:


System.out.println(new StringBuffer(noël).reverse().toString());

It does produce string l̈eon as i would expect. 


Why do you expect that? Is this a result which would ever be useful?

To be clear, I am suggesting that we aim to be the language which gets 
this right, where other languages get it wrong.


Precomposed noël also works as i would expect producing string 
lëon. What do you think, is this implementation issue or solely 
requirements issue?


Well, you can only define an implementation defect with respect to the 
original requirement. If the requirement was to reverse characters, as 
most users would understand that term, then moving the diacritic to a 
different letter fails that requirement, because a user would not 
consider a diacritic a separate character.


If the requirement was to reverse code points, regardless of their 
meaning, then the implementation is fine, but I would argue that the 
requirement failed to capture what most users would actually want.


Regards,
--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-15 Thread Aleksey Tulinov

On 15/10/14 15:58, Rowan Collins wrote:

Rowan,


What is confusing me is that i think you're seeing it as a major
implementation defect. To avoid arguable implementations, i've made
short example in Java:

System.out.println(new StringBuffer(noël).reverse().toString());

It does produce string l̈eon as i would expect.


Why do you expect that? Is this a result which would ever be useful?



I think expect it to work this way because i know that this is a good 
trade-off between performance and produced result. It also leaves a 
possibility to do it better if i need to.



To be clear, I am suggesting that we aim to be the language which gets
this right, where other languages get it wrong.



Thank you for explaining this. I also think it could do better. I think 
Unicode-aware strrev() shouldn't be too complicated to do.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

Hey,

I can't find any recent discussion in this mailing list on this topic, i 
think that most close one is 
http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I 
was also reading papers like that: 
http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.html


Latter is referring to difficulties like excess memory usage and 
rewrite the language. I'm developing an open-source Unicode 
implementation library (nunicode), and it doesn't consume any heap at 
all, it also works on native binary strings, as PHP does. Hence i thinks 
that maybe it could help with at least these two problems.


But i hardly understand if my work is even applicable here. My library 
is a rather pragmatic implementation, it's conformant to Unicode 7.0 and 
ISO/IEC 14651, but it does not implement the whole Unicode specification.


I would appreciate if someone would point me to a good read or explain 
collective opinion on this topic. I'm basically interested in the 
following questions:


1. Is there a need for more Unicode support in PHP?
2. What is currently missing in that regard?
3. Is this a good place to ask such questions?

Thanks.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Chris Wright
On 14 October 2014 10:04, Aleksey Tulinov aleksey.tuli...@gmail.com wrote:
 Hey,

 I can't find any recent discussion in this mailing list on this topic, i
 think that most close one is
 http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I was
 also reading papers like that:
 http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.html

 Latter is referring to difficulties like excess memory usage and rewrite
 the language. I'm developing an open-source Unicode implementation library
 (nunicode), and it doesn't consume any heap at all, it also works on native
 binary strings, as PHP does. Hence i thinks that maybe it could help with at
 least these two problems.

On the face of it, this implies a rather large performance hit and a
tendency to overflow the stack much more readily, do you have any
details on these elements?

 But i hardly understand if my work is even applicable here. My library is a
 rather pragmatic implementation, it's conformant to Unicode 7.0 and ISO/IEC
 14651, but it does not implement the whole Unicode specification.

 I would appreciate if someone would point me to a good read or explain
 collective opinion on this topic. I'm basically interested in the following
 questions:

The only additional thing I can find quickly is something Pierre put
together earlier this year, when PHP6 (now 7) discussions were
started:
https://wiki.php.net/ideas/php6/unicode

 1. Is there a need for more Unicode support in PHP?
 2. What is currently missing in that regard?
 3. Is this a good place to ask such questions?

My *personal* view on questions 1 and 2 is no and nothing
respectively, but I think this is not a popular opinion (and those
answers are a vast oversimplification of the issues).

This is certainly a good place to ask those questions, though.

 Thanks.

 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Andrea Faulds

On 14 Oct 2014, at 10:04, Aleksey Tulinov aleksey.tuli...@gmail.com wrote:

 I would appreciate if someone would point me to a good read or explain 
 collective opinion on this topic. I'm basically interested in the following 
 questions:
 
 1. Is there a need for more Unicode support in PHP?

Yes.

 2. What is currently missing in that regard?

Unicode string support.

 3. Is this a good place to ask such questions?

Yes.


If you want to see a pragmatic, actually working, work-in-progress attempt at 
better PHP unicode support, see this: https://github.com/krakjoe/ustring

It would add a UString class to PHP for Unicode strings. This would make 
Unicode text manipulation much easier than it is now. And both internal and 
userland code which accepts strings would already be compatible as it has a 
__toString method, but new code could also choose to accept UStrings directly.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

On 14/10/14 14:00, Chris Wright wrote:

Chris,


Latter is referring to difficulties like excess memory usage and rewrite
the language. I'm developing an open-source Unicode implementation library
(nunicode), and it doesn't consume any heap at all, it also works on native
binary strings, as PHP does. Hence i thinks that maybe it could help with at
least these two problems.


On the face of it, this implies a rather large performance hit and a
tendency to overflow the stack much more readily, do you have any
details on these elements?



I can't really tell if hit is going to be large before understanding 
what final result would be, at least approximately.


I can tell that internal complexity of nunicode is O(1) everywhere. I'm 
comparing performance to ICU and nunicode mostly outperforms it. I've 
compiled some numbers here: 
https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations


Regarding stack, i'm not sure if get the point. As far as i'm concerned, 
library does not have recursive calls, it does not have internal 
representation and does not allocate on stack aggressively. Everything 
works on immutable binary strings, stack will be used mostly for 
function calls.


But honestly, i feel like i'm not answering your question at all. Could 
you possibly clarify it?



I would appreciate if someone would point me to a good read or explain
collective opinion on this topic. I'm basically interested in the following
questions:


The only additional thing I can find quickly is something Pierre put
together earlier this year, when PHP6 (now 7) discussions were
started:
https://wiki.php.net/ideas/php6/unicode



Thank you, this is exactly what i was looking for.

I would appreciate if someone would comment on the following:

 Some of the keys point we need to take care of are:

 1) UTF-8 storage
 2) UTF-8 support for almost (if not all) existing string APIs
 3) Performance

 As of today, I did not find any library covering at least two of 
these key points.


I think i could claim that nunicode is covering at least two key points, 
maybe all of them, but i'm not sure about point 2). API do include 
operations on strings, but this API is simply following standard string 
functions (UTF equivalents of strcoll(), strchr(), strstr(), etc). Does 
that sound good or not?


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Chris Wright
On 14 October 2014 16:09, Aleksey Tulinov aleksey.tuli...@gmail.com wrote:
 On 14/10/14 14:00, Chris Wright wrote:

 Chris,

 Latter is referring to difficulties like excess memory usage and
 rewrite
 the language. I'm developing an open-source Unicode implementation
 library
 (nunicode), and it doesn't consume any heap at all, it also works on
 native
 binary strings, as PHP does. Hence i thinks that maybe it could help with
 at
 least these two problems.


 On the face of it, this implies a rather large performance hit and a
 tendency to overflow the stack much more readily, do you have any
 details on these elements?


 I can't really tell if hit is going to be large before understanding what
 final result would be, at least approximately.

 I can tell that internal complexity of nunicode is O(1) everywhere. I'm
 comparing performance to ICU and nunicode mostly outperforms it. I've
 compiled some numbers here:
 https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations

Great, thanks for this

 Regarding stack, i'm not sure if get the point. As far as i'm concerned,
 library does not have recursive calls, it does not have internal
 representation and does not allocate on stack aggressively. Everything works
 on immutable binary strings, stack will be used mostly for function calls.

 But honestly, i feel like i'm not answering your question at all. Could you
 possibly clarify it?

My apologies, this was a case of typing before thinking properly. I
was envisaging very large stack frames due to large char arrays being
allocated on the stack but when I actually apply my brain to what you
are doing I realise that this isn't going to be the case.

Carry on.

 I would appreciate if someone would point me to a good read or explain
 collective opinion on this topic. I'm basically interested in the
 following
 questions:


 The only additional thing I can find quickly is something Pierre put
 together earlier this year, when PHP6 (now 7) discussions were
 started:
 https://wiki.php.net/ideas/php6/unicode


 Thank you, this is exactly what i was looking for.

 I would appreciate if someone would comment on the following:

 Some of the keys point we need to take care of are:

 1) UTF-8 storage
 2) UTF-8 support for almost (if not all) existing string APIs
 3) Performance

 As of today, I did not find any library covering at least two of these key
 points.

 I think i could claim that nunicode is covering at least two key points,
 maybe all of them, but i'm not sure about point 2). API do include
 operations on strings, but this API is simply following standard string
 functions (UTF equivalents of strcoll(), strchr(), strstr(), etc). Does that
 sound good or not?

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

On 14/10/14 16:50, Andrea Faulds wrote:


If you want to see a pragmatic, actually working, work-in-progress attempt at 
better PHP unicode support, see this: https://github.com/krakjoe/ustring

It would add a UString class to PHP for Unicode strings. This would make 
Unicode text manipulation much easier than it is now. And both internal and 
userland code which accepts strings would already be compatible as it has a 
__toString method, but new code could also choose to accept UStrings directly.



Looking at it now. UString and repo linked in its description are very 
good read indeed. Thank you.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Rowan Collins

On 14/10/2014 14:50, Andrea Faulds wrote:

2. What is currently missing in that regard?

Unicode string support.


I know that was probably deliberately flippant, but I think there is a 
genuine question to be asked here. A lot of people talk about Unicode 
support like they talk about XPath support; but XPath is an API you 
can adhere to, Unicode is a whole lot more (and less) than that.


What it probably means to most people is string functions which do what 
I expect with a vast range of obscure Unicode code point sequences. 
Those expectations need to be documented *before* an API is written, 
rather than writing a whole load of functions which use a Unicode 
library, but don't actually provide the tools that people need.



If you want to see a pragmatic, actually working, work-in-progress attempt at 
better PHP unicode support, see this: https://github.com/krakjoe/ustring


It looks like a good prototype, but glancing at the documentation, I'm 
not clear exactly what the assumptions of some of the functions are.


There's a lot of talk of characters, which is a *very* slippery notion 
in Unicode; charAt() returns a single code point, and $length returns a 
number of code points. This makes me wonder if it will pass the noël 
test [1] - does a combining diacritic move onto a different letter when 
you run -reverse()?


As I've mentioned before, a lot of the time what people actually want to 
deal with is grapheme clusters - the kind of thing that you'd think of 
as a character if you were writing by hand. Most people, if asked the 
length of the string noël, would answer 4, but there may be 5 code 
points. (That's not just a case of normalisation choices; most 
combinations of letter+diacritic have no single code point, that's why 
the combining forms exist.)


A good Unicode string API should probably give clear labels and choices 
for such things - $string-codePointAt(3) is not the same as 
$string-graphemeAt(3), $string-codePointCount is not the same as 
$string-graphemeCount, and so forth. A single property $length seems 
more user-friendly, until the user finds it means something different to 
what they wanted.


Similarly, an automatic __toString() function is handy, but what 
encoding does it output, and why? UTF-8? The same encoding that the 
string was constructed with?


If I know that my database is expecting UTF-8, I probably want to say 
$string-getByteString('UTF-8'). I may also want to say 
$string-getByteStringWithMaxLength('UTF-8', 20) to fit an exact number 
of graphemes into a 20-byte binary space; something that neither 
$string-substring(0, 20)-getByteString('UTF-8') nor substr( 
$string-getByteString('UTF-8'), 0, 20 ) can do.


In short, we can only abstract so much - supporting Unicode 
automatically means supporting its complexity, not just pretending it's 
a really big version of ASCII.


[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Andrea Faulds

On 14 Oct 2014, at 19:01, Rowan Collins rowan.coll...@gmail.com wrote:

 
 If you want to see a pragmatic, actually working, work-in-progress attempt 
 at better PHP unicode support, see this: https://github.com/krakjoe/ustring
 
 It looks like a good prototype, but glancing at the documentation, I'm not 
 clear exactly what the assumptions of some of the functions are.
 
 There's a lot of talk of characters, which is a *very* slippery notion in 
 Unicode; charAt() returns a single code point, and $length returns a number 
 of code points. This makes me wonder if it will pass the noël test [1] - 
 does a combining diacritic move onto a different letter when you run 
 -reverse()?
 
 As I've mentioned before, a lot of the time what people actually want to deal 
 with is grapheme clusters - the kind of thing that you'd think of as a 
 character if you were writing by hand. Most people, if asked the length of 
 the string noël, would answer 4, but there may be 5 code points. (That's 
 not just a case of normalisation choices; most combinations of 
 letter+diacritic have no single code point, that's why the combining forms 
 exist.)
 
 A good Unicode string API should probably give clear labels and choices for 
 such things - $string-codePointAt(3) is not the same as 
 $string-graphemeAt(3), $string-codePointCount is not the same as 
 $string-graphemeCount, and so forth. A single property $length seems more 
 user-friendly, until the user finds it means something different to what they 
 wanted.

This is true. It ought to talk about code points but doesn’t. Length is 
primarily needed for iterating through strings and the like. If you went length 
in characters, you probably need to implement your own algorithm, as it really 
depends on your specific use case.

It will, however, always produce valid UTF8 strings for output. That’s better 
than standard string functions which can mangle UTF8.

 Similarly, an automatic __toString() function is handy, but what encoding 
 does it output, and why? UTF-8? The same encoding that the string was 
 constructed with?

Always UTF-8.

 If I know that my database is expecting UTF-8, I probably want to say 
 $string-getByteString('UTF-8’).

You can do that.

 I may also want to say $string-getByteStringWithMaxLength('UTF-8', 20) to 
 fit an exact number of graphemes into a 20-byte binary space; something that 
 neither $string-substring(0, 20)-getByteString('UTF-8') nor substr( 
 $string-getByteString('UTF-8'), 0, 20 ) can do.

I’m not sure quite how you’d do that. There might be a function in mbstring for 
that.

 In short, we can only abstract so much - supporting Unicode automatically 
 means supporting its complexity, not just pretending it's a really big 
 version of ASCII.

Sure. But just handling code points safely is hard enough as it is. This 
handles that. It doesn’t handle characters, sure, but it’s a start. And for 
many applications, you do not need to handle characters.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

On 14/10/14 21:01, Rowan Collins wrote:

Rowan,


As I've mentioned before, a lot of the time what people actually want to
deal with is grapheme clusters - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string noël, would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)



Very good point. I'll give another example: is there a substring s in 
string Maße? If it's case-sensitive search, when there is no such 
substring, but if it's case-insensitive search, then ß folds into ss 
and substring s appears.


This works both ways. For instance, if someone wants to split string 
MASSE after ß in case-insensitive manner, one approach might be: 1) 
find ß position, it's +2; 2) split string at +3. Result would be two 
strings: MAS and SE.


Back to combining characters, i dig the idea of introducing graphemes, 
but i think French person would write word noël using precomposed 
character. I'm using French keyboard at 
https://translate.google.com/#fr/. ë is Shift + ^, then e, it 
produces precomposed U+00EB.


If script doesn't have precomposed equivalent, then this grapheme will 
always be in the same decomposed form and collation will work. Substring 
search will also work, because needle will be decomposed in the same way 
as haystack. There are some border-line cases possible, but are they 
really practical in a scope of Unicode support in a programming language?


Any ideas?

P.S. Point about documentation taken.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Johannes Schlüter
On Tue, 2014-10-14 at 23:18 +0300, Aleksey Tulinov wrote:
 Very good point. I'll give another example: is there a substring s in 
 string Maße? If it's case-sensitive search, when there is no such 
 substring, but if it's case-insensitive search, then ß folds into ss 
 and substring s appears.

In Unicode 5.1 there is ẞ U+1E9E LATIN CAPITAL LETTER SHARP S.

(The point of this post mostly is to show that there is another
dimension making this even more complicated, again - different Unicode
versions)

johannes



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Rowan Collins

On 14/10/2014 21:18, Aleksey Tulinov wrote:
Back to combining characters, i dig the idea of introducing graphemes, 
but i think French person would write word noël using precomposed 
character. I'm using French keyboard at 
https://translate.google.com/#fr/. ë is Shift + ^, then e, it 
produces precomposed U+00EB.


You don't even need to rely on the input method using the combined form, 
Unicode includes an algorithm for normalisation to this form (where such 
composites are coded), known as NFC.


If script doesn't have precomposed equivalent, then this grapheme will 
always be in the same decomposed form and collation will work. 
Substring search will also work, because needle will be decomposed in 
the same way as haystack.


No, it won't. You won't get false negatives as long as both strings are 
normalised to the same form (whether that is NFC or NFD), but you will 
get false positives. For instance, searching for the substring e would 
not match a combined ë, but it would match an uncombined sequence with e 
at its base (e.g. with two diacritics).


Normalising to NFD (fully de-composed) would at least mean that e 
consistently matched all graphemes with e at their base, but is not a 
lossless operation, so performing it implicitly is probably not a good idea.


All of which ignores the questions of length and string reversal, which 
I think are much more important in this respect.


There are some border-line cases possible, but are they really 
practical in a scope of Unicode support in a programming language? 


As I understand it, the entirety of the Korean writing system is an 
edge case in this respect - it uses 3 code points for each grapheme, 
and cutting one of those graphemes apart leaves you with gibberish.


It's pretty meaningless to say you support Unicode, but only the easy 
bits. You might as well just tag each string with one of the pages of 
ISO-8859.


--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Rowan Collins

On 14/10/2014 20:51, Andrea Faulds wrote:

If you went length in characters, you probably need to implement your own 
algorithm, as it really depends on your specific use case.


I disagree, Unicode has very well-defined algorithms for these things, 
and the average PHP developer (or even PHP framework developer) is 
unlikely to do better.



It will, however, always produce valid UTF8 strings for output. That’s better 
than standard string functions which can mangle UTF8.


They will be valid UTF-8 sequences, but they may not be meaningful 
strings - you might truncate halfway along a set of combining 
diacritics, or worse, halfway through a Korean syllable character (3 
codepoints, 1 grapheme).





I may also want to say $string-getByteStringWithMaxLength('UTF-8', 20) to fit an exact 
number of graphemes into a 20-byte binary space; something that neither 
$string-substring(0, 20)-getByteString('UTF-8') nor substr( 
$string-getByteString('UTF-8'), 0, 20 ) can do.

I’m not sure quite how you’d do that.


Nor am I, that's why I want the library to do it for me! :P

More seriously, a simple algorithm is easy enough to design - serialize 
your abstract string into bytes one grapheme at a time, tracking the 
current and previous lengths. If current length exceeds the maximum, 
track back to the previous length and return; otherwise, continue until 
all graphemes are serialized.



Sure. But just handling code points safely is hard enough as it is. This 
handles that. It doesn’t handle characters, sure, but it’s a start. And for 
many applications, you do not need to handle characters.


We already have mbstring and intl for doing various things a bit 
better; the goal of more centralised support should be to do them as 
well as possible, not be just another variation that doesn't quite get 
there.


--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Lester Caine
On 14/10/14 10:04, Aleksey Tulinov wrote:
 1. Is there a need for more Unicode support in PHP?
 2. What is currently missing in that regard?
 3. Is this a good place to ask such questions?

I need to ask ...

Is this discussion only about improving support for UTF8 content in PHP?
What is the current state of play with regards function and variable names?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

On 14/10/14 23:48, Johannes Schlüter wrote:



On Tue, 2014-10-14 at 23:18 +0300, Aleksey Tulinov wrote:

Very good point. I'll give another example: is there a substring s in
string Maße? If it's case-sensitive search, when there is no such
substring, but if it's case-insensitive search, then ß folds into ss
and substring s appears.


In Unicode 5.1 there is ẞ U+1E9E LATIN CAPITAL LETTER SHARP S.

(The point of this post mostly is to show that there is another
dimension making this even more complicated, again - different Unicode
versions)



It's still in Unicode 7.0. According to Unicode character database ß 
uppercase is SS, ẞ lowercase is ß, both casefolds into ss. Thus 
upper(lower(ẞ)) should produce SS. There is another dimension indeed.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Unicode support

2014-10-14 Thread Aleksey Tulinov

On 15/10/14 00:04, Rowan Collins wrote:

Rowan,


Back to combining characters, i dig the idea of introducing graphemes,
but i think French person would write word noël using precomposed
character. I'm using French keyboard at
https://translate.google.com/#fr/. ë is Shift + ^, then e, it
produces precomposed U+00EB.


You don't even need to rely on the input method using the combined form,
Unicode includes an algorithm for normalisation to this form (where such
composites are coded), known as NFC.



The problem with NFC is that it's not only composition, but 
decomposition + reordering + re-composition. I know about NFC quick 
check, but the issue is if check fails and string need transformation, 
this would be very challenging, if not impossible, to do while keeping 
string immutable and without introducing internal representation of that 
string.


Internal representation and string modifications brings overhead which 
might eventually render implementation unusable for a range of applications.


On the other side, language specific characters which can be 
precomposed, are likely to be precomposed.



If script doesn't have precomposed equivalent, then this grapheme will
always be in the same decomposed form and collation will work.
Substring search will also work, because needle will be decomposed in
the same way as haystack.


No, it won't. You won't get false negatives as long as both strings are
normalised to the same form (whether that is NFC or NFD), but you will
get false positives. For instance, searching for the substring e would
not match a combined ë, but it would match an uncombined sequence with e
at its base (e.g. with two diacritics).

Normalising to NFD (fully de-composed) would at least mean that e
consistently matched all graphemes with e at their base, but is not a
lossless operation, so performing it implicitly is probably not a good
idea.



Good point. That's what i meant by border-line case. Could you possibly 
point me to a specific example of such false positive? I'm interested in 
well-formed UTF-8 string. I believe noël test is ill-formed UTF-8 and 
doesn't conform to shortest-form requirement.



It's pretty meaningless to say you support Unicode, but only the easy
bits. You might as well just tag each string with one of the pages of
ISO-8859.



As far as i'm concerned Unicode specification does not require to 
implement all annexes or even support entire character set to be 
conformant. I think there are always trade-offs involved, depending on 
what is more important for you.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] Unicode support for *printf()

2006-12-11 Thread Antony Dovgal

Hello all.

Attached is the patch which adds Unicode support to *printf() functions stack.
We (Andrei and me) made several assumptions that are worth mentioning:

sprintf() and vsprintf(): 
- use runtime_encoding when dealing with Unicode data.


printf() and vprintf():
- the result data is converted to output_encoding when formatting is done;
- return _number of bytes_ outputted.

fprintf() and vfprintf():
- use runtime_encoding, as all conversions are done by underlying streams API;
- both functions return the number returned by streams API (which seems to be 
number of bytes).

I did not run any benchmarks yet, but I don't expect it to cause any major 
slowdown.
I would like to hear your comments before applying the patch, so don't hesitate 
to post them.

--
Wbr, 
Antony Dovgal
Index: ext/standard/formatted_print.c
===
RCS file: /repository/php-src/ext/standard/formatted_print.c,v
retrieving revision 1.88
diff -u -p -d -r1.88 formatted_print.c
--- ext/standard/formatted_print.c  7 Dec 2006 20:45:21 -   1.88
+++ ext/standard/formatted_print.c  11 Dec 2006 22:44:21 -
@@ -40,6 +40,9 @@
 #define MAX_FLOAT_DIGITS 38
 #define MAX_FLOAT_PRECISION 40
 
+#define PHP_OUTPUT 0
+#define PHP_RUNTIME 1
+
 #if 0
 /* trick to control varargs functions through cpp */
 # define PRINTF_DEBUG(arg) php_printf arg
@@ -50,7 +53,10 @@
 static char hexchars[] = 0123456789abcdef;
 static char HEXCHARS[] = 0123456789ABCDEF;
 
+static UChar u_hexchars[] = {0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 
0x38, 0x39, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66};
+static UChar u_HEXCHARS[] = {0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 
0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46};
 
+/* php_sprintf_appendchar() {{{ */
 inline static void
 php_sprintf_appendchar(char **buffer, int *pos, int *size, char add TSRMLS_DC)
 {
@@ -62,8 +68,21 @@ php_sprintf_appendchar(char **buffer, in
PRINTF_DEBUG((sprintf: appending '%c', pos=\n, add, *pos));
(*buffer)[(*pos)++] = add;
 }
+/* }}} */
 
+/* php_u_sprintf_appendchar() {{{ */
+inline static void
+php_u_sprintf_appendchar(UChar **buffer, int *pos, int *size, UChar add 
TSRMLS_DC)
+{
+   if ((*pos + 1) = *size) {
+   *size = 1;
+   *buffer = eurealloc(*buffer, *size);
+   }
+   (*buffer)[(*pos)++] = add;
+}
+/* }}} */
 
+/* php_sprintf_appendstring() {{{ */
 inline static void
 php_sprintf_appendstring(char **buffer, int *pos, int *size, char *add,
   int min_width, int 
max_width, char padding,
@@ -112,10 +131,57 @@ php_sprintf_appendstring(char **buffer, 
}
}
 }
+/* }}} */
 
+/* php_u_sprintf_appendstring() {{{ */
+inline static void
+php_u_sprintf_appendstring(UChar **buffer, int *pos, int *size, UChar *add,
+  int min_width, int 
max_width, UChar padding,
+  int alignment, int len, int 
neg, int expprec, int always_sign)
+{
+   register int npad;
+   int req_size;
+   int copy_len;
+
+   copy_len = (expprec ? MIN(max_width, len) : len);
+   npad = min_width - copy_len;
+
+   if (npad  0) {
+   npad = 0;
+   }
+   
+   req_size = *pos + MAX(min_width, copy_len) + 1;
+
+   if (req_size  *size) {
+   while (req_size  *size) {
+   *size = 1;
+   }
+   *buffer = eurealloc(*buffer, *size);
+   }
+   if (alignment == ALIGN_RIGHT) {
+   if ((neg || always_sign)  padding == 0x30 /* '0' */) {
+   (*buffer)[(*pos)++] = (neg) ? 0x2D /* '-' */ : 0x2B /* 
'+' */;
+   add++;
+   len--;
+   copy_len--;
+   }
+   while (npad--  0) {
+   (*buffer)[(*pos)++] = padding;
+   }
+   }
+   u_memcpy((*buffer)[*pos], add, copy_len + 1);
+   *pos += copy_len;
+   if (alignment == ALIGN_LEFT) {
+   while (npad--) {
+   (*buffer)[(*pos)++] = padding;
+   }
+   }
+}
+/* }}} */
 
+/* php_sprintf_appendint() {{{ */ 
 inline static void
-php_sprintf_appendint(char **buffer, int *pos, int *size, long number,
+php_sprintf_appendint(char **buffer, int *pos, int *size, long number, 
int width, char padding, int 
alignment, 
int always_sign)
 {
@@ -155,7 +221,49 @@ php_sprintf_appendint(char **buffer, int
 padding, alignment, 
(NUM_BUF_SIZE - 1) - i,
 neg, 0, always_sign);
 }
+/* }}} */
 
+/* php_u_sprintf_appendint() {{{ */ 
+inline static void
+php_u_sprintf_appendint(UChar **buffer, int *pos, int