Stephan,

> If it has ANY bytes above 127, it's not, by definition, ASCII. i.e.
> "it's binary."

I would disagree with part of this statement. I agree that ASCII defines only the 7-bit code values, but I think this whole thread has run off the rails in talking about the content values as determining whether the file is "text" or "binary".

But this discussion of content heuristics misses the point of why there is a distinction to be made in the first place. And that I think has more to do with whether the content is organized into "lines".

In a functional sense for Fossil, a "text" file is one for which it is useful to display a line-oriented difference. For all other files ("binary" files) the difference can only be displayed in a way that is agnostic of the internal structure (if any) of the content.

Given that there is no universal heuristic for discriminating "text" from "binary" files based on content, that determination must be treated as a bit of metadata about the file.

Likewise, it is necessary to know for a given file what representation is used to separate lines. Knowledge of the line separator is seldom carried as metadata, because it is usually uniform in a given system. But in these days of interoperable systems and multi-platform support, this detail also may be a necessary piece of metadata to know about a file. ASCII code calls out the CR (carriage return) and LF (line-feed) control characters. DOS-based systems (including Windows) follow the direct ASCII tradition of using CR and LF, paired in that order (and often represented as CRLF) as the line separator. That tradition is also embodied in the Internet Mail standards for message content, header and body (absent MIME extensions). Unix-based systems use the LF character alone as the line separator in files (aka "newline"). Other systems have used CR alone.

And additionally, the character set used to represent text in a file must also be carried as metadata (because of the ISO-8859 and other code-page based character sets).

Only if all these items of metadata are known can the file content, or differences in the file content, be displayed in a useful form. So returning to this thread, it is convenient to have a heuristic that works most of the time to discriminate "text" from "binary" files, but it is necessary to also have a way for the user to explicitly provide that metadata (and ideally the character set metadata).

-- Shal

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to