Joost Verburg wrote:
Abdelrazak Younes wrote:
utf8 can use up to 6 characters. 4 bytes would not have been enough to
store 4 bytes of data plus the protocol necessary to decode utf8.
Unicode does not use the full 32 bits. There are only 2^20+2^16 code
points, so actually the whole of Unicode would fit in 21 bits (there are
no such integers of course). Therefore it is possible to encode Unicode
data in 1-4 bytes as UTF-8.
By the way, UCS-4 and UTF-32 can be taken to be identical. So it is
better to only use the names UTF-8, UTF-16 and UTF-32. Names like "UTF-8
to UCS-4" confuse people.
(In history when the Unicode specification did not yet contain this code
point limit there was indeed a difference between UCS-4 and UTF-32 and
the theoretical possibility of having characters that use more than 4
bytes in UTF-8. This is no longer the case.)
Hopefully this makes things clear.
Indeed. Thanks a lot.
Abdel.