Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Tim Starling
On 25/11/21 11:34 pm, Paul Crovella wrote:
> On Thu, Nov 25, 2021 at 3:14 AM Tim Starling  wrote:
>> On 25/11/21 7:57 pm, Côme Chilliet wrote:
>>
>>> To reuse the example from the RFC, if I want to convert a UTF string to 
>>> uppercase using Turkish rules and get dotted capital I, what should I use?
>> For case-insensitive comparison you can use Collator. But for display
>> you just have to do it yourself. For the Turkish Wikipedia and other
>> Turkic language websites we are currently using str_replace().
> Any particular reason not to use transliterators? https://3v4l.org/I038T

Thanks, I missed that.

You would need to do your own mapping from language code to
transliterator name, since it only has converters for az/tr, el, lt
and "Any", with no fallbacks. For example if you did
Transliterator::create("en-Upper")->transliterate('a') you would get a
fatal error.

Presumably if I submitted a PR adding wrappers for u_strToUpper()
etc., it would not be rejected on the basis that we already have
transliterators.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Dan Ackroyd
On Thu, 25 Nov 2021 at 05:05, Tim Starling  wrote:
>
> Voting is now open for my RFC on locale-independent case conversion.
>

It seems popular, and likely to pass, but I voted no as the "Backward
Incompatible Changes" section is missing which makes it hard to
evaluate the impact.

cheers
Dan
Ack

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Paul Crovella
On Thu, Nov 25, 2021 at 3:14 AM Tim Starling  wrote:
>
> On 25/11/21 7:57 pm, Côme Chilliet wrote:
>
> > To reuse the example from the RFC, if I want to convert a UTF string to 
> > uppercase using Turkish rules and get dotted capital I, what should I use?
>
> For case-insensitive comparison you can use Collator. But for display
> you just have to do it yourself. For the Turkish Wikipedia and other
> Turkic language websites we are currently using str_replace().

Any particular reason not to use transliterators? https://3v4l.org/I038T

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Nicolas Grekas
Le jeu. 25 nov. 2021 à 12:23, Tim Starling  a
écrit :

> On 25/11/21 8:58 pm, Nicolas Grekas wrote:
>
>
> The RFC says:
> > because they also use isdigit() and isspace(),
>
> Does that mean "too much work needed"? I would totally understand that of
> course but I hope someone could do these last miles.
>
> Yes.
>
> > and because they are intended for natural language processing
>
> I definitely do not agree with this argument and it should be removed from
> the RFC to me as it might add confusion in the future.
>
> Done.
>

Great, thanks!


Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Tim Starling
On 25/11/21 8:58 pm, Nicolas Grekas wrote:
>
> The RFC says:
> > because they also use isdigit() and isspace(),
>
> Does that mean "too much work needed"? I would totally understand
> that of course but I hope someone could do these last miles.
>
Yes.

> > and because they are intended for natural language processing
>
> I definitely do not agree with this argument and it should be
> removed from the RFC to me as it might add confusion in the future.

Done.

-- Tim Starling



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Tim Starling
On 25/11/21 7:57 pm, Côme Chilliet wrote:
> Hello,
>
> The RFC is missing information about alternatives:
> Do all of these function have an mbstring version?

The following functions have an mbstring version: strtolower,
strtoupper, stristr, stripos, strripos.

mb_convert_case() provides functionality equivalent to lcfirst,
ucfirst and ucwords.

There is no mbstring version of str_ireplace, that is
https://bugs.php.net/bug.php?id=75225

There is no mbstring equivalent for the array sorting functions with
SORT_FLAG_CASE, but there is Collator::asort() in intl.

> Are those locale dependant or have an option for it?

The mbstring functions are locale-independent.

Unfortunately there do not seem to be PHP wrappers for the family of
case conversion functions in ICU's ustring.h. There is
IntlChar::tolower() and IntlChar::toupper(), but they provide
locale-independent case conversion, equvialent to mbstring. It's not
ideal to change the case of a string character by character, since
some languages have multi-character mappings. ICU calls this
context-sensitive case conversion.

Considering the lack of wide character support or context-sensitive
case conversion in the existing strtoupper/strtolower, I would
consider this missing functionality rather than functionality which I
am removing.

> To reuse the example from the RFC, if I want to convert a UTF string to 
> uppercase using Turkish rules and get dotted capital I, what should I use?

For case-insensitive comparison you can use Collator. But for display
you just have to do it yourself. For the Turkish Wikipedia and other
Turkic language websites we are currently using str_replace().

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Nicolas Grekas
Le jeu. 25 nov. 2021 à 11:34, Christoph M. Becker  a
écrit :

> On 25.11.2021 at 10:58, Nicolas Grekas wrote:
>
> > Le jeu. 25 nov. 2021 à 10:47, Tim Starling  a
> > écrit :
> >
> >> and because they are intended for natural language processing
> >
> > I definitely do not agree with this argument and it should be removed
> from
> > the RFC to me as it might add confusion in the future.
>
> Yeah, the PHP manual says[1]:
>
> | This function implements a comparison algorithm that orders
> | alphanumeric strings in the way a human being would, this is described
> | as a "natural ordering".
>
> [1] 
>

Yep, yet "natural language processing" means processing sentences we write
as humans, e.g. processing this very message. natcase sorting functions are
not done for that. They're useful to sort items in a list - typically file
names - in a way that makes sense to humans. This is very different from
"natural language processing". Having "natsort" vary by locale doesn't make
more sense than having "sort()" vary by locale. That's my point. The
argument doesn't stand against implementing locale-insensitivity for these
functions and I think the RFC shouldn't make it (the argument.)

Nicolas


Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Christoph M. Becker
On 25.11.2021 at 10:58, Nicolas Grekas wrote:

> Le jeu. 25 nov. 2021 à 10:47, Tim Starling  a
> écrit :
>
>> and because they are intended for natural language processing
>
> I definitely do not agree with this argument and it should be removed from
> the RFC to me as it might add confusion in the future.

Yeah, the PHP manual says[1]:

| This function implements a comparison algorithm that orders
| alphanumeric strings in the way a human being would, this is described
| as a "natural ordering".

[1] 

--
Christoph M. Becker

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Nicolas Grekas
Le jeu. 25 nov. 2021 à 10:47, Tim Starling  a
écrit :

> On 25/11/21 7:55 pm, Nicolas Grekas wrote:
>
>
> I voted yes because I want to see this happen but I raised a point in
> https://externals.io/message/116141#116259 and didn't get an answer:
>
> Despite their name, I never used natcase functions for natural language
>> processing. I use them eg to sort lists of files in a directory, to
>> account
>> for numbers mainly. But that's not what I would call natural language
>> processing. I'm not aware of anyone using them for that actually. I'm
>> wondering if it's a good idea to postpone migrating them to an
>> hypothetical
>> future as to me, the whole reasoning of the RFC applies to them.
>>
>
> I wish the strnatcasecmp() and natcasesort() function, but also the
> SORT_NATURAL flag were also covered by this RFC.
> Is that possible? Would it make sense?
>
>
> I'm not going to migrate those functions at this time. It's just a project
> scope decision.
>

Why?

The RFC says:
> because they also use isdigit() and isspace(),

Does that mean "too much work needed"? I would totally understand that of
course but I hope someone could do these last miles.

> and because they are intended for natural language processing

I definitely do not agree with this argument and it should be removed from
the RFC to me as it might add confusion in the future.

Nicolas


Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Tim Starling
On 25/11/21 7:55 pm, Nicolas Grekas wrote:
>
> I voted yes because I want to see this happen but I raised a point
> in https://externals.io/message/116141#116259
>  and didn't get an answer:
>
> Despite their name, I never used natcase functions for natural
> language
> processing. I use them eg to sort lists of files in a directory,
> to account
> for numbers mainly. But that's not what I would call natural
> language
> processing. I'm not aware of anyone using them for that
> actually. I'm
> wondering if it's a good idea to postpone migrating them to an
> hypothetical
> future as to me, the whole reasoning of the RFC applies to them.
>
>
> I wish the strnatcasecmp() and natcasesort() function, but also the
> SORT_NATURAL flag were also covered by this RFC.
> Is that possible? Would it make sense?


I'm not going to migrate those functions at this time. It's just a
project scope decision.

-- Tim Starling



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Dusk
On Nov 25, 2021, at 01:08, Hans Henrik Bergan  wrote:
> btw why is this code *not* getting dotted capital i on 3v4l?
> https://3v4l.org/D1WG1#v7.4.26
> it gets ["res_hex"]=> string(2) "49"
> 
>  setlocale(LC_ALL, "Turkish");

Because "Turkish" isn't a locale. "tr_TR" is.

https://3v4l.org/GD91W#v7.4.26

Notice that the output doesn't show up correctly, as it is not UTF-8. (Which is 
part of the problem addressed by this RFC.)
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Hans Henrik Bergan
btw why is this code *not* getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"

$str,
"str_hex"=>bin2hex($str),
"res"=>$res,
"res_hex"=>bin2hex($res),
]);
?>

On Thu, 25 Nov 2021 at 09:57, Côme Chilliet  wrote:

> Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :
> > Voting is now open for my RFC on locale-independent case conversion.
> >
> > https://wiki.php.net/rfc/strtolower-ascii
>
> Hello,
>
> The RFC is missing information about alternatives:
> Do all of these function have an mbstring version?
> Are those locale dependant or have an option for it?
>
> To reuse the example from the RFC, if I want to convert a UTF string to
> uppercase using Turkish rules and get dotted capital I, what should I use?
>
> Côme
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>
>


Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Côme Chilliet
Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :
> Voting is now open for my RFC on locale-independent case conversion.
> 
> https://wiki.php.net/rfc/strtolower-ascii

Hello,

The RFC is missing information about alternatives:
Do all of these function have an mbstring version?
Are those locale dependant or have an option for it?

To reuse the example from the RFC, if I want to convert a UTF string to 
uppercase using Turkish rules and get dotted capital I, what should I use?

Côme

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-25 Thread Nicolas Grekas
Le jeu. 25 nov. 2021 à 06:05, Tim Starling  a
écrit :

> Voting is now open for my RFC on locale-independent case conversion.
>
> https://wiki.php.net/rfc/strtolower-ascii
>
> Voting will close in two weeks, on 2021-12-09.
>

Hi Tim,

I voted yes because I want to see this happen but I raised a point in
https://externals.io/message/116141#116259 and didn't get an answer:

Despite their name, I never used natcase functions for natural language
> processing. I use them eg to sort lists of files in a directory, to account
> for numbers mainly. But that's not what I would call natural language
> processing. I'm not aware of anyone using them for that actually. I'm
> wondering if it's a good idea to postpone migrating them to an hypothetical
> future as to me, the whole reasoning of the RFC applies to them.
>

I wish the strnatcasecmp() and natcasesort() function, but also the
SORT_NATURAL flag were also covered by this RFC.
Is that possible? Would it make sense?

Nicolas


[PHP-DEV] [VOTE] Locale-independent case conversion

2021-11-24 Thread Tim Starling
Voting is now open for my RFC on locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

Voting will close in two weeks, on 2021-12-09.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php