On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:

Text files may indeed contain binary (i.e., bytes that are not interpretable as characters). Namely, text files may contain newlines, tabs, and some other invisible things.

Question: "characters" are defined as only the visible things, right?

No. You've gone astray right there. Please read Chapter 2 of the Unicode Standard, and in particular, Section 2.4, Code Points and Characters:

https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564

All of those types of characters can occur in Unicode plain text. (With the exception of surrogate code points.)

I conclude:

Binary files may contain arbitrary text.

Binary files can contain *whatever*, including text.

Text files may contain binary, but only a restricted set of binary.

The distinction is definitional. A text file contains *only* characters, interpretable by a specific character encoding (usually Unicode, these days).

But a text file need not be "plain text". An HTML file is an example of a text file (it contains only a sequence of characters, whose identity and interpretation is all clearly specified by looking them up in the Unicode Standard), but it is not *plain* text. It is *rich* text, consisting of markup tags interspersed with runs of plain text.

Another distinction that may be leading you astray is the distinction between binary file transfer and text file transfer. If you are using ftp, for example, you can specify use of binary file transfer, *even if* the file you are transferring is actually a text file. That simply means that the file transfer will agree to treat the entire file as a binary blob and transfer it byte-for-byte intact. A text file transfer, on the other hand, may look for "lines" in a text file and may adjust line endings to suit the receiving platform conventions.

Do you agree?

No.

--Ken

Reply via email to