At 11:05 AM -0800 7/2/00, John Hudson wrote:
>At 09:16 AM 7/2/00 -0800, Doug Ewell wrote:
>
> >The problem with the phrase "plain text ceases to be plain if you decide
> >that layout information needs to be encoded" is the word "layout."  In
> >the broadest sense, line and paragraph separation could be considered
> >"layout," and nobody would suggest doing away with the plain-text
> >characters needed to control those functions.

The problem with the phrase "plain text" is that it is a polite 
fiction. ASCII characters, printing and non-printing, originated as 
commands to printers. What we originally called plain text files are 
those that would give reasonable results when printed on an ASCII 
teleprinter used as a terminal. The mechanical functions of Teletypes 
defined the original semantics of the control characters used in text 
files, and since carried over to screen and laser printer output--

CR Carriage Return    Move printing point to beginning of current line.
LF Line Feed          Move printing point down one line.
BS Back Space         Move printing point one space left, unless at left limit.
HT Horizontal Tab     Move printing point right to next tab stop, unless at
                         right limit.
FF Form Feed          Move printing point to top of next page.

and is the reason why many of us call CR-LF either a line or 
paragraph break today. Explicit line breaks were, of course, 
essential on the original devices. Both CR and BS were routinely used 
for overstriking.

The semantics of these and other ASCII control characters have been 
changing with technology. *Some* computer system designers, noticing 
that the demands of printing terminals were not requirements on 
system file internals, chose to use either CR alone or LF alone for 
line or paragraph ends, all without coordination. Line breaks in 
files became optional on systems that provided word wrap on output or 
display. Users were given options for setting tab stops, margins, and 
page lengths. Character 7F, DEL, originally meant "not a character; 
deleted" on punched paper tape, but began turning into destructive 
backspace even before tape died. ESC has undoubtedly mutated the 
most. The use of 1A SUB for end of file in several operating systems 
including PCDOS is a violation of the ASCII standard, which provides 
both 03 ETX (End of Text) and 04 EOT (End of Transmission), but who 
cared?

There are now numerous incompatible formats bearing the name "plain 
text". Some are distinguished by the choice of line end string. In 
some cases, line ends are required, especially if there is a maximum 
line length. Lines of unlimited length may represent paragraphs or 
database records. Character sets other than ASCII may be used, 
especially 8859-1 or Windows code page 1252. These days, people want 
to be able to use any coded character set and still call it plain 
text. In fact, people want to introduce all kinds of markup, 
including furigana/rubi, language tags, ligature marking, and even 
character set shift sequences (not just the poky SI and SO), and 
still call the result plain text.

>I think this is a fair comment, if one assumes so broad a sense of
>'layout'. On the other hand, I wouldn't consider a paragraph break to be
>necessarily 'layout', since it is primarily a textual convention that can
>be represented in layout in a myriad of different ways: double spacing,
>indentation, pilcrows, etc.. Now, we have interpreted a paragraph break in
>a particular way in plain text code -- a hard break and a move to a new
>line, i.e. the behaviour of a typewriter 'return' key --

by way of the Teletype

>and have further
>muddied things by using this code to force layout by, for instance,
>entering two paragraph breaks
>
>to achieve this particular layout.

The use of tabs, spaces, CR, and LF to lay out "plain text" is 
necessary in mail and news, and a total pain in documents that will 
need to be converted to anything else.

>Personally, I think a truly plain text paragraph break would have no
>particular layout behaviour associated with it; rather, it would indicate a
>textual break that would be interpreted by applications according to user
>defined layout preferences. In e-mail, it is handy to have paragraphs
>separated by a 'double return', especially when several correspondents are
>being quoted, but elsewhere I would prefer indented, single-spaced
>paragraphs. Since it is the same textual break that is being indicated, I
>don't think these two layout options should be differently encoded. I think
>equating a digital paragraph break with the return key on a manual
>typewriter is actually a failure to encode plain text.

It is too late for such simple solutions. If we want to have a 
standard for plain text, we have to provide for each of the common 
usages. We have tried to start such a project twice on this list, and 
have failed utterly both times.

>That said, I realise that this might be an extremist view, and I certainly
>don't expect anybody to change anything now. Although I have to add, as
>someone who has typeset books, that having to remove all the double returns
>in a document before I can properly control the paragraph breaks is almost
>as annoying as replacing multiple tabs or word spaces when these have been
>used to force layout in 'plain text'. Thank goodness for macros.

Hear, hear. I have wasted a remarkable amount of time over the years 
on reformatting Word documents into FrameMaker. The "pain text" [sic] 
markup habits of engineers are responsible for most of the work in 
those conversions. Thank goodness for book-wide search and replace in 
FM 6.

>John Hudson
>
>Tiro Typeworks
>Vancouver, BC
>http://www.tiro.com


Edward Cherlin, Spamfighter <http://www.cauce.org>
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit

Reply via email to