XML never started from scratch based on old versions of SGML or any updated
version of SGML.
When it was created, Unicode was already there and its support in XML was
mandatary from the start, including the support for UTF-8 by default. And
It was based on the earlier work on XHTML which already included Unicode
support by default as well, from the current development of HTML4 which was
also updated to enforce the behavior for Unicode (notably it was made clear
that to be conforming, the numeric character references could only refer to
the UCS codepoints, independently of the charset used for the document, and
that all charsets had to have a mapping to the UCS.

Now the issue is possibly elsewhere: when languages uses a script or
orthography not based on Unicode because it is still not well supported or
has problems.
- there were problems for Korean in Unicode 1.0 before the merge with the
ISO 10646, but Unicode 1.0 is dead since long and no software today are
making any reference to Unicode 1.0;
- there has been problems with the Unicode encoding for Burmese, and
Mongolian, they are mostly solved, except Mongolian with works still
pending for the behavior of some clusters and the best way to encode the
vowels, this will soon change but yes in that case there are problems; but
the change will not be from adopting or not Unicode, but in the best
sequences of Unicode characters to use to represent these clusters: this is
an orthographic change, not a change of encodings, but yes in that ase it
measn changing Unicode fonts for other updated Unicode fonts; no hack based
on legacy charsets are involded.

Now there remains languages/scripts not encoded at all (not in Unicode and
not even in any other charset): making a reference to a legacy ISO chartset
is inapplicable as there's no such legacy charset. All that an be done for
now in these languages is to use some transliteration (but not necessarily
Latin): Uyghur for example is generally written in that case using Chinese
sinograms (with some specific forms in rare cases), or Arabic (with some
additional diacritics and forms, but if thee forms are not handled in
fonts, at least there's a basic orthography that is readable, the same way
that we can substitute some characters in Latin or remove some diacritics
for African languages, or simply not encoding some ligatures by writing
digrams instead: this is what happens already when these langauges are used
in some international documents and forms like passports: there's a
degraded orthography, but this is still readable and sufficiently
distinctive for practical uses and isolated text fragemtsn are not the
onily source of disambiguation as there are other contextual information,
including photo and biometric data or unique identifiers, and a scanned
handwritten signature, plus personal data, including address for
identification purpose).

Anyway, even if there's a prefered orthography, slight deviation of
orthograhy is very common and frequently used in public displays or
advertizing, and no one is confused. And the "prefered" orthography is just
a matter of choice and is unstable across time, or even space when there
are competing authorities providing their own local terminology for some
local official uses, and not mandatory everywhere (and most languages also
have lot of dialects that may use different orthography to render their own
local phonology and accents: not everyone agree with these prefered form,
even in the same location where dialects are also competing. and let's
remember that all modern language continue to evolve and borrow a lot from
other languages and new terms are creatively added. Finally there are
orthographic reforms, but they take a considerable time to be adopted or
never reah any acceptation and legacy orthographies remain visible in lot
of places and publications (plus, people are much more mobile today and
there are widespread communities located around the world that adapt
constantly to their new context and on which the official reforms have no
impact).

So in conclusion, there's no other choice than Unicode today. Unicode is
mandatory in XML, and in OSM. Don't spak about legacy charsets. But we are
jsut concerned by support in fonts: ALL characters encoded up to Unicode
9.0 have suitable fonts immediately usable, and these fonts are all free
for use, and based on TrueType/OpenType. All OSM rendering softwares should
be able to use TrueType/Opentype fonts. The only remaining problem is the
existence of mobile phones that don't have a lot of embedded fonts and
support a more limited set. But none of them are using or need any legacy
charsets.


Le jeu. 28 nov. 2019 à 15:11, John Whelan <jwhelan0...@gmail.com> a écrit :

> The way I would approach this professionally would be to define the
> requirements first.
>
> In this case we have a requirement to display the name in the language of
> choice.
>
> We also have a requirement to be compatible with existing software.
>
> Pragmatically I would recommend changing the name field to use only an 8
> bit Latin alphabet character set recognizing that not all systems can
> handle more complex character sets.  Which precise character set should be
> chosen would a be subject for discussion but either ISO-8859-1 or Windows-1252
> would be contenders.  My personal preference would be the ISO standard.
>
> Unicode is nice but we managed with 6 bit character sets for many years
> when I started with computers.  Even accented characters were a major
> problem.  Also remember that .OSM data is in XML format and XML came out of
> SGML which was first used to transmit documents over modems so only 7 bits
> where available for encoding characters.  The extended characters use a
> special escape code sequence to hold the unicode characters.
>
> Realistically software never wears out but source code gets lost.
> Compilers and operating systems get updated.  It may not be possible to
> modify existing software to handle unicode characters.  I have a perfectly
> good scanner sitting in the corner that no long can be used with Win 10
> because of a new and improved driver.  With the OpenStreetMap environment
> there isn't even a way to get a complete list of software that uses the
> OpenStreetMap data so it can be tested.
>
> The local language can be added in a name:  then software that can handle
> the local names can pick it up.  Osmand etc. can be configured to use the
> local name transparently so the local population can use it in the language
> of their choice.
>
> This approach would appear to meet the requirements.  The argument that we
> should change all the existing software to meet a requirement that was not
> clearly defined when the software was written doesn't make sense to me.
>
> Cheerio John
>
> Frederik Ramm wrote on 2019-11-28 3:25 AM:
>
> John,
>
> On 28.11.19 01:40, John Whelan wrote:
>
> Is there any reason why name:en could not be used?
>
> The country's official language requires a "non-standard" font to be
> available which does not seem to be a given on all platforms. Like if
> you set up a standard tile server and don't install extra fonts you will
> see little squares instead of place names all over China.
>
> Apparently not all applications are as good in name:xx handling as
> OsmAnd. A recurring point in the discussion is that the proponents of
> using the official language say "we shouldn't fall back to English name
> tags just because some apps/web sites are broken, we should file bug
> reports with them instead", and the proponents of using English say
> "let's be pragmatic, there's no way all these apps/sites will be fixed
> within a short time, so we should use English".
>
> Bye
> Frederik
>
>
>
> --
> Sent from Postbox <https://www.postbox-inc.com>
> _______________________________________________
> HOT mailing list
> HOT@openstreetmap.org
> https://lists.openstreetmap.org/listinfo/hot
>
_______________________________________________
HOT mailing list
HOT@openstreetmap.org
https://lists.openstreetmap.org/listinfo/hot

Reply via email to