Re: Surrogates and noncharacters

2015-05-12 Thread Philippe Verdy
Even if UTF-8 initially started as part of some Unix standardization
process, it was for the prupose of allowing interchanges across systems.
The networking concept was already there (otherwise it would not have been
part of the emerging *nix standardization processes, and would have
remained a proprietary encoding in local systems).

At the same time, The Internet was also about to emerge as a worldwide
network, but Internet was still very limited and full of restrictions,
accessible only from a few (very costly) gateways in other countries, and
not even with the IP protocol but with many specific protocols (may be you
remember the time of CompuServe, billed only in US dollars and only via
international payments and costly bank processing fees; you also had to
call an international phone number before a few national phone numbers
appeared, cooperated by CompuServe and some national or regional services

At that time, the Telcos were not even interested to participate and all
wanted to develop their own national or regional networks with their own
protocols and national standards; real competition in telecommunications
only started just before Y2K, with the deregulation in North America and
some parts of Europe, in fact just in the EEA, before progressively going
worldwide when the initial competitors started to restructure/split/merge
and aligning their too many technical standards with the need of a common
interoperable one that would worlk in all their new local branches). In
fact the worldwide Internet would not have become THE global network
without the reorganisation of older dereregulated national telcos and the
end of their monopoles.

The development of the Internet, and the development of the UCS, were
then completely made in parallel. Both were appearing to replace former
national standards in the same domains previously operated by the former
monopoles in telecommunications (and that also needed computing and data
standards, not just networking standards).

In the early time of Internet, the IP protocol was still not really adapted
as the universal internetworking protocol (other competitors were also
proposed by private companies, notably Token-Ring by IBM, and the X21-X25
family promoted essentially by European telcos (which prefered realtime
protocols with warrantied/reserved bandwidth, and commutation by packets
instead of by frames of variable sizes).

Even today, there are some remaining parts of the X* network family, but
only for short-distance private links: e.g. with ATM (in xDSL
technologies), or for local buses within electronic devices (under the 1
meter limit), or within some critical missions (realtime constraints used
for networking equipements in aircrafts, that have their own standard, wit
ha few of them developped recently as adaptation of Internet technologies
over channels in a realtime network, generally not structured in a mesh
but with a star topology and dedicated bandwidths).

If you want to look for remaining text encoding standards that are still
not based on the UCS, look into aircraft technologies, and military
equipements (there's also the GSM family of protocols, which continues to
keep many legacy proprietary standards, with poor adaptation to Internet
technologies and the UCS...)

The situation is starting to change now in aircraft/military technology too
(first Airbus in Europe, now also adopted by its major US competitors) and
mobile networks (4G), with the full integration of the the IEEE Ethernet
standard, that allows a more natural and straightforward integration of IP
protocols and the UCS standards with it (even if compatibility is kept by
reserving a space for former protocols, something that the IEEE Ethernet
standard has already facilitated for the Internet we know now, both in
worldwide communications, and in private LANs)...


2015-05-12 17:58 GMT+02:00 Hans Aberg haber...@telia.com:


  On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote:
 
  Indeed, that is why UTF-8 was invented for use in Unix-like
 environments.
 
  Not the main reason: communication protocols, and data storage is also
 based on 8-bit code units (even if storage group them by much larger
 blocks).

 There is some history here:
   https://en.wikipedia.org/wiki/UTF-8#History





Re: Surrogates and noncharacters

2015-05-12 Thread Steffen Nurpmeso
Hans Aberg haber...@telia.com wrote:
 | On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote:
 | Indeed, that is why UTF-8 was invented for use in Unix-like environments.
 | 
 | Not the main reason: communication protocols, and data storage \
 | is also based on 8-bit code units (even if storage group \
 | them by much larger blocks).
 |
 |There is some history here:
 |  https://en.wikipedia.org/wiki/UTF-8#History

What happened was this:

  http://doc.cat-v.org/bell_labs/utf-8_history

--steffen


FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread Mark Davis ☕️
http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/


Re: Surrogates and noncharacters

2015-05-12 Thread Hans Aberg

 On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote:
 
 Indeed, that is why UTF-8 was invented for use in Unix-like environments.
 
 Not the main reason: communication protocols, and data storage is also based 
 on 8-bit code units (even if storage group them by much larger blocks).

There is some history here:
  https://en.wikipedia.org/wiki/UTF-8#History





Re: FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread Karl Williamson

On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote:

http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
//


And a critique:

http://languagelog.ldc.upenn.edu/nll/?p=18844


Re: FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread dzo
And a tangent, picking up on a complaint that Swahili wasn't represented on one 
of the 7 WaPost graphics:

http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html

Two other recent posts on this blog (Beyond Niamey) critique the Africa part 
of a set of graphics/maps of Second Most Spoken Languages Worldwide (on the 
Olivet Nazarene University site) - another thought-provoking effort that could 
inform better if redone. 

Don Osborn


--Original Message--
From: Karl Williamson
Sender: Unicode
To: Mark Davis ☕️
To: Unicode Public
Subject: Re: FYI: The world’s languages, in 7 maps and charts
Sent: May 12, 2015 6:19 PM

On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote:
 http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
 //

And a critique:

http://languagelog.ldc.upenn.edu/nll/?p=18844

Sent via BlackBerry by ATT



Re: Surrogates and noncharacters

2015-05-12 Thread Philippe Verdy
2015-05-11 23:53 GMT+02:00 Hans Aberg haber...@telia.com:

 It is perfectly fine considering the Unicode code points as abstract
 integers, with UTF-32 and UTF-8 encodings that translate them into byte
 sequences in a computer. The code points that conflict with UTF-16 might
 have been merely declared not in use until UTF-16 has been fallen out of
 use, replaced by UTF-8 and UTF-32.


The deprecation of UTF-16 and UTF-32 as encoding *schemes* (charsets in
MIME) is already very advanced. But they will certinaly not likely
disappear as encoding *forms* for internal use in binary APIs and in
several very popular programming languages: Java, Javascript, even C++ on
Windows platforms (where it is the 8-bit interface, based on legacy code
pages and with poor support of the UTF-8 encoding scheme as a Windows
code page, is the one that is now being phased out), C#, J#...

UTF-8 will also remain for long as the prefered internal encoding for
Python, PHP (even if Python introduced also a 16-bit native datatype).

In all cases, programming languages are not based on any Unicode encoding
forms but on more or less opaque streams of code units using datatypes that
are not constrained by Unicode (because their character or byte
datatype is also used for binary I/O and for supporting also the conversion
of various binary structures, including executable code, and also because
even this datatype is not necessarily 8-bit but may be larger and not even
an even multiple of 8-bits)

One is going check that the code points are valid Unicode values somewhere,
 so it is hard to see to point of restricting UTF-8 to align it with UTF-16.


What I meant when starting discussing in this thread was just to obsolete
the unnecessary definitions of x-bit strings from TUS. The stadnard does
not need these definitions and if we want it to be really open to various
architectures, languages, protocols, all that is needed is only the
definition of code units specific to each standard UTF (encoding form or
encoding scheme when splitting code units to smaller code units and
ordering them, by only determining this order and the minimum set of
distinct values that these code units must support: we should not speak
about bits, just about sets of distinct elements with a sufficient
cardinality).

So let's jsut speak about UTF-8 code units, UTF-16 code units, UTF-32
code units (not just code units and not even Unicode code units, which
is also a non-sense given the existence of standardized compression schemes
defining also their own XXX code units).

If the expressions 16-bit code units has been used, it's purely for
internal use as a shortcut for the complete name, and these shortcuts are
not part of the external entities to standardize (they are not precise
enough and cannot be used safely out of their local context): consider
these definitions just as private ones (same meaning as in OOP) boxed as
internals to the TUS seen as a blackbox.

It's not the focus of TUS to discuss what are strings: it's just the
mater of each integration platform that wants to use TUS.

In summary, the definitions in TUS should be split in two parts: those that
are public and needed by external references (in other standards), and
those that are private (many of them do not have even to be within the
generic section of the standard, they should be listed in the appropriate
sections needing them locally, and also clearly separating the public and
private interfaces.

In all cases, the public interfaces msut define precise and anambiguous
terms, bound to the standard or section of the standard defining them. Even
if later within that section a shortcut will be used as a convenience (to
make the text easier to read). We need scopes for these definitions (and
shorter aliases must be made private).


Re: Surrogates and noncharacters

2015-05-12 Thread Hans Aberg

 On 12 May 2015, at 15:45, Philippe Verdy verd...@wanadoo.fr wrote:
 
 
 
 2015-05-11 23:53 GMT+02:00 Hans Aberg haber...@telia.com:
 It is perfectly fine considering the Unicode code points as abstract 
 integers, with UTF-32 and UTF-8 encodings that translate them into byte 
 sequences in a computer. The code points that conflict with UTF-16 might 
 have been merely declared not in use until UTF-16 has been fallen out of 
 use, replaced by UTF-8 and UTF-32.
 
 The deprecation of UTF-16 and UTF-32 as encoding *schemes* (charsets in 
 MIME) is already very advanced. 

UTF-32 is usable for internal use in programs.

 But they will certinaly not likely disappear as encoding *forms* for internal 
 use in binary APIs and in several very popular programming languages: Java, 
 Javascript, even C++ on Windows platforms (where it is the 8-bit interface, 
 based on legacy code pages and with poor support of the UTF-8 encoding 
 scheme as a Windows code page, is the one that is now being phased out), 
 C#, J#…

That is legacy, which may remain for long. For example, C/C++ trigraphs are 
only removed now, since long just a bother for compiler implementation. Java is 
very old, designed around 32-bit programming with limits on function code size, 
which was a limitation in pre-PPC CPU that went out of use in the early 1990s.

 UTF-8 will also remain for long as the prefered internal encoding for Python, 
 PHP (even if Python introduced also a 16-bit native datatype).
 
 In all cases, programming languages are not based on any Unicode encoding 
 forms but on more or less opaque streams of code units using datatypes that 
 are not constrained by Unicode (because their character or byte datatype 
 is also used for binary I/O and for supporting also the conversion of various 
 binary structures, including executable code, and also because even this 
 datatype is not necessarily 8-bit but may be larger and not even an even 
 multiple of 8-bits)

Indeed, that is why UTF-8 was invented for use in Unix-like environments.





Re: Surrogates and noncharacters

2015-05-12 Thread Philippe Verdy
2015-05-12 15:56 GMT+02:00 Hans Aberg haber...@telia.com:


 Indeed, that is why UTF-8 was invented for use in Unix-like environments.


Not the main reason: communication protocols, and data storage is also
based on 8-bit code units (even if storage group them by much larger
blocks).

UTF-8 is the default choice for all Internet protocols because all these
protocols are based on these units

This last remark is true except at lower levels, on the link interfaces and
on physical links where the unit is the bit or sometimes even smaller units
with fractions of bits, grouped into frames that not only transport data
bits but also specific items needed by the physical constraints, such as
maintaining the mean polarity, restricting the frequency bandwidth,
reducing noise in lateral bands, synchronizing clocks for data sampling,
reducing the power usage, allowing adaptation of bandwidth by insertion of
new parallel streams in the same shared band, allowing changing the framing
format in the case where the signal-noise ratio is degraded by using some
additional signals normally not used by the normal data stream, or adapting
to the degradation of the transport medium, or to some emergency situations
(or sometimes to local legal requirements) that require reducing the usage
to leave space for priority traffic (e.g. air regulation or military use)...

Each time the transport medium has to be shared with third parties (this is
the case for infrastructure networks or for the radiofrequencies in the
public airspace which may also be shared internationally), or if the medium
is known to have a slowly degrading quality (e.g. SSD storage), the
transport and storage protocols never use the whole bandwidth available and
reserve some regulatory space for specific signalisation that may be needed
to allow the current usages to be autoadapted: the physical format of
datastreams can change at any time, and what was initially encoded one way
will then be encoded another way (such things also occur extremely locally,
for example on databuses within computers, for example between the various
electronic chips on the same motherboard, or that could be plugged to it as
optional extensions ! Electronic devices are full of bus adapters that have
to manage the priority between concurrent traffics that are unpredictable,
and with changing environment conditions such as the current state of power
sources).

Programmers however only see the result on the upper layer data frames
where they manage bits, then they can create streams of bytes, that are
usable for transport protocols and interchange over a larger network or
computing system.

But for the worldwide network (Internet), everything is based on 8-bit
bytes that are the minimal units of information (and also the maximal
units: over larger units are not portable, not interoperable over the
global network) in all related protocols (including for negociating options
in these protocols): UTF-8 is then THE universal encoding that will
interoperate everywhere on the Internet, even if locally (in connected
hosts), other encoding may be used (which ''may'' be more efficiently
processed) after a simple conversion (this does not necessarly requires
changing the size of code units used in local protocols and interfaces, for
example there could exist some reencoding, or data compression or
expansion).