Re: Surrogates and noncharacters
Even if UTF-8 initially started as part of some Unix standardization process, it was for the prupose of allowing interchanges across systems. The networking concept was already there (otherwise it would not have been part of the emerging *nix standardization processes, and would have remained a proprietary encoding in local systems). At the same time, The Internet was also about to emerge as a worldwide network, but Internet was still very limited and full of restrictions, accessible only from a few (very costly) gateways in other countries, and not even with the IP protocol but with many specific protocols (may be you remember the time of CompuServe, billed only in US dollars and only via international payments and costly bank processing fees; you also had to call an international phone number before a few national phone numbers appeared, cooperated by CompuServe and some national or regional services At that time, the Telcos were not even interested to participate and all wanted to develop their own national or regional networks with their own protocols and national standards; real competition in telecommunications only started just before Y2K, with the deregulation in North America and some parts of Europe, in fact just in the EEA, before progressively going worldwide when the initial competitors started to restructure/split/merge and aligning their too many technical standards with the need of a common interoperable one that would worlk in all their new local branches). In fact the worldwide Internet would not have become THE global network without the reorganisation of older dereregulated national telcos and the end of their monopoles. The development of the Internet, and the development of the UCS, were then completely made in parallel. Both were appearing to replace former national standards in the same domains previously operated by the former monopoles in telecommunications (and that also needed computing and data standards, not just networking standards). In the early time of Internet, the IP protocol was still not really adapted as the universal internetworking protocol (other competitors were also proposed by private companies, notably Token-Ring by IBM, and the X21-X25 family promoted essentially by European telcos (which prefered realtime protocols with warrantied/reserved bandwidth, and commutation by packets instead of by frames of variable sizes). Even today, there are some remaining parts of the X* network family, but only for short-distance private links: e.g. with ATM (in xDSL technologies), or for local buses within electronic devices (under the 1 meter limit), or within some critical missions (realtime constraints used for networking equipements in aircrafts, that have their own standard, wit ha few of them developped recently as adaptation of Internet technologies over channels in a realtime network, generally not structured in a mesh but with a star topology and dedicated bandwidths). If you want to look for remaining text encoding standards that are still not based on the UCS, look into aircraft technologies, and military equipements (there's also the GSM family of protocols, which continues to keep many legacy proprietary standards, with poor adaptation to Internet technologies and the UCS...) The situation is starting to change now in aircraft/military technology too (first Airbus in Europe, now also adopted by its major US competitors) and mobile networks (4G), with the full integration of the the IEEE Ethernet standard, that allows a more natural and straightforward integration of IP protocols and the UCS standards with it (even if compatibility is kept by reserving a space for former protocols, something that the IEEE Ethernet standard has already facilitated for the Internet we know now, both in worldwide communications, and in private LANs)... 2015-05-12 17:58 GMT+02:00 Hans Aberg haber...@telia.com: On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote: Indeed, that is why UTF-8 was invented for use in Unix-like environments. Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). There is some history here: https://en.wikipedia.org/wiki/UTF-8#History
Re: Surrogates and noncharacters
Hans Aberg haber...@telia.com wrote: | On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote: | Indeed, that is why UTF-8 was invented for use in Unix-like environments. | | Not the main reason: communication protocols, and data storage \ | is also based on 8-bit code units (even if storage group \ | them by much larger blocks). | |There is some history here: | https://en.wikipedia.org/wiki/UTF-8#History What happened was this: http://doc.cat-v.org/bell_labs/utf-8_history --steffen
FYI: The world’s languages, in 7 maps and charts
http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
Re: Surrogates and noncharacters
On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote: Indeed, that is why UTF-8 was invented for use in Unix-like environments. Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). There is some history here: https://en.wikipedia.org/wiki/UTF-8#History
Re: FYI: The world’s languages, in 7 maps and charts
On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote: http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ // And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844
Re: FYI: The world’s languages, in 7 maps and charts
And a tangent, picking up on a complaint that Swahili wasn't represented on one of the 7 WaPost graphics: http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html Two other recent posts on this blog (Beyond Niamey) critique the Africa part of a set of graphics/maps of Second Most Spoken Languages Worldwide (on the Olivet Nazarene University site) - another thought-provoking effort that could inform better if redone. Don Osborn --Original Message-- From: Karl Williamson Sender: Unicode To: Mark Davis ☕️ To: Unicode Public Subject: Re: FYI: The world’s languages, in 7 maps and charts Sent: May 12, 2015 6:19 PM On 05/12/2015 03:05 PM, Mark Davis ☕️ wrote: http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ // And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844 Sent via BlackBerry by ATT
Re: Surrogates and noncharacters
2015-05-11 23:53 GMT+02:00 Hans Aberg haber...@telia.com: It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. The deprecation of UTF-16 and UTF-32 as encoding *schemes* (charsets in MIME) is already very advanced. But they will certinaly not likely disappear as encoding *forms* for internal use in binary APIs and in several very popular programming languages: Java, Javascript, even C++ on Windows platforms (where it is the 8-bit interface, based on legacy code pages and with poor support of the UTF-8 encoding scheme as a Windows code page, is the one that is now being phased out), C#, J#... UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype). In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their character or byte datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits) One is going check that the code points are valid Unicode values somewhere, so it is hard to see to point of restricting UTF-8 to align it with UTF-16. What I meant when starting discussing in this thread was just to obsolete the unnecessary definitions of x-bit strings from TUS. The stadnard does not need these definitions and if we want it to be really open to various architectures, languages, protocols, all that is needed is only the definition of code units specific to each standard UTF (encoding form or encoding scheme when splitting code units to smaller code units and ordering them, by only determining this order and the minimum set of distinct values that these code units must support: we should not speak about bits, just about sets of distinct elements with a sufficient cardinality). So let's jsut speak about UTF-8 code units, UTF-16 code units, UTF-32 code units (not just code units and not even Unicode code units, which is also a non-sense given the existence of standardized compression schemes defining also their own XXX code units). If the expressions 16-bit code units has been used, it's purely for internal use as a shortcut for the complete name, and these shortcuts are not part of the external entities to standardize (they are not precise enough and cannot be used safely out of their local context): consider these definitions just as private ones (same meaning as in OOP) boxed as internals to the TUS seen as a blackbox. It's not the focus of TUS to discuss what are strings: it's just the mater of each integration platform that wants to use TUS. In summary, the definitions in TUS should be split in two parts: those that are public and needed by external references (in other standards), and those that are private (many of them do not have even to be within the generic section of the standard, they should be listed in the appropriate sections needing them locally, and also clearly separating the public and private interfaces. In all cases, the public interfaces msut define precise and anambiguous terms, bound to the standard or section of the standard defining them. Even if later within that section a shortcut will be used as a convenience (to make the text easier to read). We need scopes for these definitions (and shorter aliases must be made private).
Re: Surrogates and noncharacters
On 12 May 2015, at 15:45, Philippe Verdy verd...@wanadoo.fr wrote: 2015-05-11 23:53 GMT+02:00 Hans Aberg haber...@telia.com: It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. The deprecation of UTF-16 and UTF-32 as encoding *schemes* (charsets in MIME) is already very advanced. UTF-32 is usable for internal use in programs. But they will certinaly not likely disappear as encoding *forms* for internal use in binary APIs and in several very popular programming languages: Java, Javascript, even C++ on Windows platforms (where it is the 8-bit interface, based on legacy code pages and with poor support of the UTF-8 encoding scheme as a Windows code page, is the one that is now being phased out), C#, J#… That is legacy, which may remain for long. For example, C/C++ trigraphs are only removed now, since long just a bother for compiler implementation. Java is very old, designed around 32-bit programming with limits on function code size, which was a limitation in pre-PPC CPU that went out of use in the early 1990s. UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype). In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their character or byte datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits) Indeed, that is why UTF-8 was invented for use in Unix-like environments.
Re: Surrogates and noncharacters
2015-05-12 15:56 GMT+02:00 Hans Aberg haber...@telia.com: Indeed, that is why UTF-8 was invented for use in Unix-like environments. Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). UTF-8 is the default choice for all Internet protocols because all these protocols are based on these units This last remark is true except at lower levels, on the link interfaces and on physical links where the unit is the bit or sometimes even smaller units with fractions of bits, grouped into frames that not only transport data bits but also specific items needed by the physical constraints, such as maintaining the mean polarity, restricting the frequency bandwidth, reducing noise in lateral bands, synchronizing clocks for data sampling, reducing the power usage, allowing adaptation of bandwidth by insertion of new parallel streams in the same shared band, allowing changing the framing format in the case where the signal-noise ratio is degraded by using some additional signals normally not used by the normal data stream, or adapting to the degradation of the transport medium, or to some emergency situations (or sometimes to local legal requirements) that require reducing the usage to leave space for priority traffic (e.g. air regulation or military use)... Each time the transport medium has to be shared with third parties (this is the case for infrastructure networks or for the radiofrequencies in the public airspace which may also be shared internationally), or if the medium is known to have a slowly degrading quality (e.g. SSD storage), the transport and storage protocols never use the whole bandwidth available and reserve some regulatory space for specific signalisation that may be needed to allow the current usages to be autoadapted: the physical format of datastreams can change at any time, and what was initially encoded one way will then be encoded another way (such things also occur extremely locally, for example on databuses within computers, for example between the various electronic chips on the same motherboard, or that could be plugged to it as optional extensions ! Electronic devices are full of bus adapters that have to manage the priority between concurrent traffics that are unpredictable, and with changing environment conditions such as the current state of power sources). Programmers however only see the result on the upper layer data frames where they manage bits, then they can create streams of bytes, that are usable for transport protocols and interchange over a larger network or computing system. But for the worldwide network (Internet), everything is based on 8-bit bytes that are the minimal units of information (and also the maximal units: over larger units are not portable, not interoperable over the global network) in all related protocols (including for negociating options in these protocols): UTF-8 is then THE universal encoding that will interoperate everywhere on the Internet, even if locally (in connected hosts), other encoding may be used (which ''may'' be more efficiently processed) after a simple conversion (this does not necessarly requires changing the size of code units used in local protocols and interfaces, for example there could exist some reencoding, or data compression or expansion).