Martin, you seem to be labouring under the impression that HTML5 is a 
substitute for character encoding. If it is, why do we need unicode? We could 
just have documents laden with <IMG tags, and restrict ourselves to ascii.

It seems I need to spell out one more time why HTML is not character encoding:

1. HTML5 doesn’t separate one particular representation (font, size, etc) from 
the actual meaning of the character. So you can’t paste it somewhere and expect 
to increase its point size or change its font.
2. It’s highly inefficient in space to drop multi-kilobyte strings into a 
document to represent one character.
3. The entire design of HTML has nothing to do with characters. So there is no 
way to process a string of characters interspersed with HTML elements and know 
which of those elements are a “character”. This makes programatic manipulation 
impossible, and means most computer applications simply will not allow HTML in 
scenarios where they expect a list of “characters”.
4. There is no way to compare 2 HTML elements and know they are talking about 
the same character. I could put some HTML representation of a character in my 
document, you could put a different one in, and there would absolutely no way 
to know that they are the same character. Even if we are in the same community 
and agree on the existence of this character.
5. Similarly, there is no way to search or index html elements. If a HTML 
document contained an image of a particular custom character, there would be no 
way to ask google or whatever to find all the documents with that character. 
Different documents would represent it differently. HTML is a rendering 
technology. It makes things LOOK a particular way, without actually ENCODING 
anything about it. The only part of of HTML that is actually searchable in a 
deterministic fashion is the part that is encoded - the unicode part.

Unicode encodes symbols that have “reasonable popularity”. (a) that is not all 
of them. (b) how can a symbol attain reasonable popularity when it is not in 
unicode? Of course some can, but others have their popularity hindered by the 
very fact that they are not encoded!

Take the poop emoji that people recently have been talking about here. It 
gained popularity because the Japanese telecom companies decided to encode it. 
If they hadn’t encoded it, well would have become popular through normal 
culture such that the unicode consortium would have adopted it! No it wouldn’t! 
The Japanese telcos were able to do this because they controlled their entire 
user base from hardware on up to encodings. That won’t be happening into the 
future, so new interesting and potentially universal emojis won’t ever come 
into existence in the way that this one did because of the control the unicode 
consortium exercises over this technology. But the problem isn’t restricted to 
emojis, many other potentially popular symbols can’t come into existence 
either. The internet *COULD* be the birthplace of lots of interesting new 
symbols in the same way that Japanese telecom companies birthed the original 
emojis, but it won’t be because the unicode consortium r!
 ules it from the top down.

Summary: 
1. HTML renders stuff, it encodes nothing. It addresses a completely different 
problem domain. If rendering and encoding were the same problem, unicode can 
disband now.
2. Unicode encodes stuff, but isn’t extensible in a way that broadly useful. 
i.e. in a way that allows anybody (or any application) receiving a custom 
character to know what it is, or how to render it, or to combine it with other 
custom character sets.
3. The problem under discussion is not a rendering problem. HTML5 lacks nothing 
in terms of ability to render. Yet the problem remains. Because it’s an 
encoding problem. Encoding problems are in the unicode domain, not in the HTML5 
domain.

You say that character encodings work best when they are used widely and 
uniformly.  But they can only be as wide or as uniform as reality itself.  We 
could try and conform reality to technology and… for example… force all the 
world to use Latin characters and 128 ASCII representations. OR we can conform 
technology to reality. Not all encodings need to be, or ought to be as 
universal as requiring one world wide committee to pass judgment on them.



> On 3 Jun 2015, at 11:09 am, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:
> 
> On 2015/06/03 07:55, Chris wrote:
> 
>> As you point out, "The UCS will not encode characters without a demonstrated 
>> usage.”. But there are use cases for characters that don’t meet UCS’s 
>> criteria for a world wide standard, but are necessary for more specific use 
>> cases, like specialised regional, business, or domain specific situations.
> 
> Unicode contains *a lot* of characters for specialized regional, business, or 
> domain specific situations.

> 
>> My question is, given that unicode can’t realistically (and doesn’t aim to) 
>> encode every possible symbol in the world, why shouldn’t there be an 
>> EXTENSIBLE method for encoding, so that people don’t have to totally 
>> rearchitect their computing universe because they want ONE non-standard 
>> character in their documents?
> 
> As has been explained, there are technologies that allow you to do (more or 
> less) that. Information technology, like many other technologies, works best 
> when finding common cases used by many people. Let's look at some examples:
> 
> Character encodings work best when they are used widely and uniformly. I 
> don't know anybody who actually uses all the characters in Unicode (except 
> the guys that work on the standard itself). So for each individual, a smaller 
> set would be okay. And there were (and are) smaller sets, not for 
> individuals, but for countries, regions, scripts, and so on. Originally (when 
> memory was very limited), these legacy encodings were more efficient overall, 
> but that's no longer the case. So everything is moving towards Unicode.
> 
> Most Website creators don't use all the features in HTML5. So having 
> different subsets for different use cases may seem to be convenient. But 
> overall, it's much more efficient to have one Hypertext Markup Language, so 
> that's were everybody is converging to.
> 
> From your viewpoint, it looks like having something in between character 
> encodings and HTML is what you want. It would only contain the features you 
> need, and nothing more, and would work in all the places you wanted it to 
> work. Asmus's "inline" text may be something similar.
> 
> The problem is that such an intermediate technology only makes sense if it 
> covers the needs of lots and lots of people. It would add a third technology 
> level (between plain text and marked-up text), which would divert energy from 
> the current two levels and make things more complicated.
> 
> Up to now, such as third level hasn't emerged, among else because both 
> existing technologies were good at absorbing the most important use cases 
> from the middle. Unicode continues to encode whatever symbols that gain 
> reasonable popularity, so every time somebody has a "real good use case" for 
> the middle layer with a symbol that isn't yet in Unicode, that use case gets 
> taken away. HTML (or Web technology in general) also worked to improve the 
> situation, with technologies such as SVG and Web Fonts.
> 
> No technology is perfect, and so there are still some gaps between character 
> encoding and markup, some of which may in due time eventually be filled up, 
> but I don't think a third layer in the middle will emerge soon.
> 
> Regards,   Martin.


Reply via email to