2015-05-12 15:56 GMT+02:00 Hans Aberg <[email protected]>: > > Indeed, that is why UTF-8 was invented for use in Unix-like environments. >
Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). UTF-8 is the default choice for all Internet protocols because all these protocols are based on these units This last remark is true except at lower levels, on the link interfaces and on physical links where the unit is the bit or sometimes even smaller units with fractions of bits, grouped into frames that not only transport data bits but also specific items needed by the physical constraints, such as maintaining the mean polarity, restricting the frequency bandwidth, reducing noise in lateral bands, synchronizing clocks for data sampling, reducing the power usage, allowing adaptation of bandwidth by insertion of new parallel streams in the same shared band, allowing changing the framing format in the case where the signal-noise ratio is degraded by using some additional signals normally not used by the normal data stream, or adapting to the degradation of the transport medium, or to some emergency situations (or sometimes to local legal requirements) that require reducing the usage to leave space for priority traffic (e.g. air regulation or military use)... Each time the transport medium has to be shared with third parties (this is the case for infrastructure networks or for the radiofrequencies in the public airspace which may also be shared internationally), or if the medium is known to have a slowly degrading quality (e.g. SSD storage), the transport and storage protocols never use the whole bandwidth available and reserve some regulatory space for specific signalisation that may be needed to allow the current usages to be autoadapted: the physical format of datastreams can change at any time, and what was initially encoded one way will then be encoded another way (such things also occur extremely locally, for example on databuses within computers, for example between the various electronic chips on the same motherboard, or that could be plugged to it as optional extensions ! Electronic devices are full of bus adapters that have to manage the priority between concurrent traffics that are unpredictable, and with changing environment conditions such as the current state of power sources). Programmers however only see the result on the upper layer data frames where they manage bits, then they can create streams of bytes, that are usable for transport protocols and interchange over a larger network or computing system. But for the worldwide network (Internet), everything is based on 8-bit bytes that are the minimal units of information (and also the maximal units: over larger units are not portable, not interoperable over the global network) in all related protocols (including for negociating options in these protocols): UTF-8 is then THE universal encoding that will interoperate everywhere on the Internet, even if locally (in connected hosts), other encoding may be used (which ''may'' be more efficiently processed) after a simple conversion (this does not necessarly requires changing the size of code units used in local protocols and interfaces, for example there could exist some reencoding, or data compression or expansion).

