Re: Text encoding Babel. Was Re: George Keremedjiev

Guy Dunphy via cctalk Mon, 26 Nov 2018 06:22:16 -0800

At 10:52 PM 25/11/2018 -0700, you wrote:


>> Then adds a plain ASCII space 0x20 just to be sure.
>
>I don't think it's adding a plain ASCII space 0x20 just to be sure. 
>Looking at the source of the message, I see =C2=A0, which is the UTF-8 
>representation followed by the space.  My MUA that understands UTF-8 
>shows that "=C2=A0 " translates to "  ".  Further, "=C2=A0 =C2=A0" 
>translates to "   ".

I was speaking poetically. Perhaps "the mail software he uses was
written by morons" is clearer.

>Some of the reading that I did indicates that many things, HTML 
>included, use white space compaction (by default), which means that 
>multiple white space characters are reduced to a single white space 
>character.

Oh yes, tell me about the html 'there is no such thing as hard formatting
and you can't have any even when you want it' concept. Thank you Tim Berners 
Lee.
  http://everist.org/NobLog/20130904_Retarded_ideas_in_comp_sci.htm
  http://everist.org/NobLog/20140427_p-term_is_retarded.htm

>  So, when Ed wants multiple white spaces, his MUA has to do 
>something to state that two consecutive spaces can't be compacted. 
>Hence the non-breaking space.

Except that 'non-breaking space' is mostly about inhibiting line wrap at
that word gap. But anyway, there's little point trying to psychoanalyze
the writers of that software. Probably involved pointy-headed bosses.


>As stated in another reply, I don't think ASCII was ever trying to be 
>the Babel fish.  (Thank you Douglas Adams.)

Of course not. It was for American English only. This is one of the major
points of failure in the history of information processing.

>> Takeaway: Ed, one space is enough. I don't know how you got the idea 
>> people might miss seeing a single space, and so you need to type two or 
>> more.
>
>I wondered if it wasn't a typo or keyboard sensitivity issue.  I 
>remember I had to really slow down the double click speed for my grandpa 
>(R.I.P.) so that he could use the mouse.  Maybe some users actuate keys 
>slowly enough that the computer thinks that it's repeated keys.  Â¯\_(ã)_/Â¯

Well now he's flaunting it in his latest posts. Never mind. :)

>> And since plain ASCII is hard-formatted, extra spaces are NOT ignored 
>> and make for wider spacing between words.
>
>It seems as if you made an assumption.  Just because the underlying 
>character set is ASCII (per RFC 821 & 822, et al) does not mean that the 
>data that they are carrying is also ASCII.  As is evident by the 
>Content-Type: header stating the character set of UTF-8.

Containing extended Unicode character sets via UTF-8, doesn't make it a
non-hard-formatted medium. In ASCII a space is a space, and multi-spaces
DON'T collapse. White space collapse is a feature of html, and whether an
email is html or not is determined by the sending utility.


>Especially when textual white space compression does exactly that, 
>ignore extra white spaces.
>
>> Which  looks    very       odd, even if your mail utility didn't try to 
>> do something 'special' with your unusual user input.

As you see, this IS NOT HTML, since those extra spaces and your diagram below 
would
have collapsed if it was html. Also saving it as text and opening in a plain 
text ed
or hex editor absolutely reveals what it is.


>I frequently use multiple spaces with ASCII diagrams.
>
>+------+
>| This |
>|  is  |
>|   a  |
>|  box |
>+------+


>> Btw, I changed the subject line, because this is a wider topic. I've been 
>> meaning to start a conversation about the original evolution of ASCII, 
>> and various extensions. Related to a side project of mine.
>
>I'm curious to know more about your side project.

Hmm... the problem is it's intended to be serious, but is still far from 
exposure-ready.
So if I talk about it now, I risk having specific terms I've coined in the doco 
(including
the project name) getting meme-jammed or trademarked by others. The plan is to 
release it
all in one go, eventually. Definitely will be years before that happens, if 
ever.

However, here's a cut-n-paste (in plain text) of a section of the Introduction 
(html with diags.)
----------
Almost always, a first attempt at some unfamiliar, complex task produces a less 
than optimal result. Only with the knowledge gained from actually doing a new 
thing, can one look back and see the mistakes made. It usually takes at least 
one more cycle of doing it over from scratch to produce something that is 
optimal for the needs of the situation. Sometimes, especially where deep and 
subtle conceptual innovations are involved, it takes many iterations.

Human development of computing science (including information coding schemes) 
has been effectively a 'first time effort', since we kept on developing new 
stuff built on top of earlier work. We almost never went back to the roots and 
rebuilt everything, applying insights gained from the many mistakes made.

In reviewing the evolution of information coding schemes since very early 
stages such as the Morse code, telegraph signal recording, typewriters, etc, 
through early computing systems, mass data storage and file systems, computer 
languages from Assembler through Compilers and Interpreters, and so on, several 
points can be identified at which early (inadequate) concepts became embedded 
then used as foundations for further developments. This made the original 
concepts seem like fundamentals, difficult to question (because they are 
underlying principles for much more complex later work), and virtually 
impossible to alter (due to the vast amounts of code dependent on them.)

And yet, when viewed in hindsight many of the early concepts are seriously 
flawed. They effectively hobbled all later work dependent on them.

Examples of these pivotal conceptual errors:

Defects in the ASCII code table. This was a great improvement at the time, but 
fails to implement several utterly essential concepts. The lack of these 
concepts in the character coding scheme underlying virtually all information 
processing since the 1960s, was unfortunate. Just one (of many) bad 
consequences has been the proliferation of 'patch-up' text coding schemes such 
as proprietry document formats (MS Word for eg), postscript, pdf, html (and its 
even more nutty academia-gone-mad variants like XML), UTF-8, unicode and so on.

        [pic]

This is a scan from the 'Recommended USA Standard Code for Information 
Interchange (USASCII) X3.4 - 1967'
The Hex A-F on rows 10-15, added here. Hexadecimal notation was not commonly in 
use in the 1960s.
Fig. ___ The original ASCII definition table.

ASCII's limitations were so severe that even the text (ie ASCII) program code 
source files used by programmers to develop literally everything else in 
computing science, had major shortcomings and inconveniences.

A few specific examples of ASCII's flaws:

    Missing concept of control vs data channel separation. And so we needed the 
"< >" syntax of html, etc.
    Inability to embed meta-data about the text in standard programatically 
accessible form.
    Absense of anything related to text adornments, ie italics, underline and 
bold. The most basic essentials of expressive text, completely ignored.
    Absense of any provision for creative typography. No awareness of fonts, 
type sizes, kerning, etc.
    Lack of logical 'new line', 'new paragraph' and 'new page' codes.
    Inadequate support of basic formatting elements such as tabular columns, 
text blocks, etc.
    Even the extremely fundamental and essential concept of 'tab columns' is 
impropperly implemented in ASCII, hence almost completely dysfunctional.
    No concept of general extensible-typed functional blocks within text, with 
the necessary opening and closing delimiters.
    Missing symmetry of quote characters. (A consequence of the absense of 
typed functional blocks.)
    No provision for code commenting. Hence the gaggle of comment delimiting 
styles in every coding language since. (Another consequence of the absense of 
typed functional blocks.)
    No awareness of programatic operations such as Inclusion, Variable 
substitution, Macros, Indirection, Introspection, Linking, Selection, etc.
    No facility for embedding of multi-byte character and binary code sequences.
    Missing an informational equivalent to the pure 'zero' symbol of number 
systems. A specific "There is no information here" symbol. (The NUL symbol has 
other meanings.) This lack has very profound implications.
    No facility to embed multiple data object types within text streams.
    No facility to correlate coded text elements to associated visual 
typographical elements within digital images, AV files, and other 
representational constructs. This has crippled efforts to digitize the cultural 
heritage of humankind.
    Non-configurable geometry of text flow, when representing the text in 2D 
planes. (Or 3D space for that matter.)
    Many of the 32 'control codes' (characters 0x00 to 0x1F) were allocated to 
hardware-specific uses that have since become obsolete and fallen into disuse. 
Leaving those codes as a wasted resource.
    ASCII defined only a 7-bit (128 codes) space, rather than the full 8-bit 
(256 codes) space available with byte sized architectures. This left the 
'upper' 128 code page open to multiple chaotic, conflicting usage 
interpretations. For example the IBM PC code page symbol sets (multiple 
languages and graphics symbols, in pre-Unicode days) and the UTF-8 character 
bit-size extensions.
    Inability to create files which encapsulate the entirety of the visual 
appearance of the physical object or text which the file represents, without 
dependence on any external information. Even plain ASCII text files depend on 
the external definition of the character glyphs that the character codes 
represent. This can be a problem if files are intended to serve as long term 
historical records, potentially for geological timescales. This problem became 
much worse with the advent of the vast Unicode glyph set, and typset formats 
such as PDF. The PDF 'archival' format (in which all referenced fonts must be 
defined in the file) is a step in the right direction  except that format 
standard is still proprietary and not available for free. 
----------

Sorry to be a tease.

Soon I'd like to have a discussion about the functional evolution of
the various ASCII control codes, and how they are used (or disused) now. 
But am a bit too busy atm to give it adequate attention.

Guy

Re: Text encoding Babel. Was Re: George Keremedjiev

Reply via email to