XML never started from scratch based on old versions of SGML or any updated version of SGML. When it was created, Unicode was already there and its support in XML was mandatary from the start, including the support for UTF-8 by default. And It was based on the earlier work on XHTML which already included Unicode support by default as well, from the current development of HTML4 which was also updated to enforce the behavior for Unicode (notably it was made clear that to be conforming, the numeric character references could only refer to the UCS codepoints, independently of the charset used for the document, and that all charsets had to have a mapping to the UCS.
Now the issue is possibly elsewhere: when languages uses a script or orthography not based on Unicode because it is still not well supported or has problems. - there were problems for Korean in Unicode 1.0 before the merge with the ISO 10646, but Unicode 1.0 is dead since long and no software today are making any reference to Unicode 1.0; - there has been problems with the Unicode encoding for Burmese, and Mongolian, they are mostly solved, except Mongolian with works still pending for the behavior of some clusters and the best way to encode the vowels, this will soon change but yes in that case there are problems; but the change will not be from adopting or not Unicode, but in the best sequences of Unicode characters to use to represent these clusters: this is an orthographic change, not a change of encodings, but yes in that ase it measn changing Unicode fonts for other updated Unicode fonts; no hack based on legacy charsets are involded. Now there remains languages/scripts not encoded at all (not in Unicode and not even in any other charset): making a reference to a legacy ISO chartset is inapplicable as there's no such legacy charset. All that an be done for now in these languages is to use some transliteration (but not necessarily Latin): Uyghur for example is generally written in that case using Chinese sinograms (with some specific forms in rare cases), or Arabic (with some additional diacritics and forms, but if thee forms are not handled in fonts, at least there's a basic orthography that is readable, the same way that we can substitute some characters in Latin or remove some diacritics for African languages, or simply not encoding some ligatures by writing digrams instead: this is what happens already when these langauges are used in some international documents and forms like passports: there's a degraded orthography, but this is still readable and sufficiently distinctive for practical uses and isolated text fragemtsn are not the onily source of disambiguation as there are other contextual information, including photo and biometric data or unique identifiers, and a scanned handwritten signature, plus personal data, including address for identification purpose). Anyway, even if there's a prefered orthography, slight deviation of orthograhy is very common and frequently used in public displays or advertizing, and no one is confused. And the "prefered" orthography is just a matter of choice and is unstable across time, or even space when there are competing authorities providing their own local terminology for some local official uses, and not mandatory everywhere (and most languages also have lot of dialects that may use different orthography to render their own local phonology and accents: not everyone agree with these prefered form, even in the same location where dialects are also competing. and let's remember that all modern language continue to evolve and borrow a lot from other languages and new terms are creatively added. Finally there are orthographic reforms, but they take a considerable time to be adopted or never reah any acceptation and legacy orthographies remain visible in lot of places and publications (plus, people are much more mobile today and there are widespread communities located around the world that adapt constantly to their new context and on which the official reforms have no impact). So in conclusion, there's no other choice than Unicode today. Unicode is mandatory in XML, and in OSM. Don't spak about legacy charsets. But we are jsut concerned by support in fonts: ALL characters encoded up to Unicode 9.0 have suitable fonts immediately usable, and these fonts are all free for use, and based on TrueType/OpenType. All OSM rendering softwares should be able to use TrueType/Opentype fonts. The only remaining problem is the existence of mobile phones that don't have a lot of embedded fonts and support a more limited set. But none of them are using or need any legacy charsets. Le jeu. 28 nov. 2019 à 15:11, John Whelan <jwhelan0...@gmail.com> a écrit : > The way I would approach this professionally would be to define the > requirements first. > > In this case we have a requirement to display the name in the language of > choice. > > We also have a requirement to be compatible with existing software. > > Pragmatically I would recommend changing the name field to use only an 8 > bit Latin alphabet character set recognizing that not all systems can > handle more complex character sets. Which precise character set should be > chosen would a be subject for discussion but either ISO-8859-1 or Windows-1252 > would be contenders. My personal preference would be the ISO standard. > > Unicode is nice but we managed with 6 bit character sets for many years > when I started with computers. Even accented characters were a major > problem. Also remember that .OSM data is in XML format and XML came out of > SGML which was first used to transmit documents over modems so only 7 bits > where available for encoding characters. The extended characters use a > special escape code sequence to hold the unicode characters. > > Realistically software never wears out but source code gets lost. > Compilers and operating systems get updated. It may not be possible to > modify existing software to handle unicode characters. I have a perfectly > good scanner sitting in the corner that no long can be used with Win 10 > because of a new and improved driver. With the OpenStreetMap environment > there isn't even a way to get a complete list of software that uses the > OpenStreetMap data so it can be tested. > > The local language can be added in a name: then software that can handle > the local names can pick it up. Osmand etc. can be configured to use the > local name transparently so the local population can use it in the language > of their choice. > > This approach would appear to meet the requirements. The argument that we > should change all the existing software to meet a requirement that was not > clearly defined when the software was written doesn't make sense to me. > > Cheerio John > > Frederik Ramm wrote on 2019-11-28 3:25 AM: > > John, > > On 28.11.19 01:40, John Whelan wrote: > > Is there any reason why name:en could not be used? > > The country's official language requires a "non-standard" font to be > available which does not seem to be a given on all platforms. Like if > you set up a standard tile server and don't install extra fonts you will > see little squares instead of place names all over China. > > Apparently not all applications are as good in name:xx handling as > OsmAnd. A recurring point in the discussion is that the proponents of > using the official language say "we shouldn't fall back to English name > tags just because some apps/web sites are broken, we should file bug > reports with them instead", and the proponents of using English say > "let's be pragmatic, there's no way all these apps/sites will be fixed > within a short time, so we should use English". > > Bye > Frederik > > > > -- > Sent from Postbox <https://www.postbox-inc.com> > _______________________________________________ > HOT mailing list > HOT@openstreetmap.org > https://lists.openstreetmap.org/listinfo/hot >
_______________________________________________ HOT mailing list HOT@openstreetmap.org https://lists.openstreetmap.org/listinfo/hot