On 08/03/2020 14:08, Dan Ackroyd wrote:
Related to this discussion, please could someone remind me why the
mbstring extension is an extension and not part of core PHP?
I realise at the time it was introduced, UTF-8 was far less widely
used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg
But now UTF-8 is pretty much the default for the vast majority of
projects, so does that decision to keep it as an optional extension
still hold up?
From what I can make out, mbstring was not actually built for Unicode
string-handling, but for what we would now consider "legacy encodings".
Its original niche seems to have been support for various Japanese text
encodings, and UTF-8 support was added relatively late.
That has some implications for its design:
- every function takes encoding as a parameter, and defaults to a
run-time global setting
- on the other hand, there is no support for locales in functions which
would benefit, e.g. mb_convert_case, mb_stripos
- Unicode is treated as just another character encoding, so there is no
support for concepts like normalisation, graphemes, character
properties, etc
- instead, there are lots of niche functions for CJK languages like
mb_convert_kana and mb_strwidth
It also includes some things which probably wouldn't pass review if
proposed today:
- a lot of global state, with combined get-or-set functions like
mb_detect_order(), mb_substitute_character(), etc
- mb_send_mail seems oddly specific, and has its own concept of
"language" not shared by anything else
- there's an entire regex implementation, with its own API and some
compatibility with the removed ereg_* functions; I believe the preg_*
functions included in core already support UTF-8
For handling of Unicode, ext/intl is generally superior, with a more
structured API based on Unicode-specific concepts, rather than
attempting to map them to concepts used in older character encodings.
There may be a need for a more user-friendly subset of this (a "UString"
class is a common suggestion), but it shouldn't look like ext/mbstring,
IMHO.
I believe both extensions require fairly large external libraries, which
probably justifies them being optional. From what I've read, ICU, which
ext/intl is built on, would have been bundled with PHP 6, but its size
and performance contributed to the failure of that project.
Regards,
--
Rowan Tommins (né Collins)
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php