Line Separator and Paragraph Separator

Jill Ramonsky Mon, 20 Oct 2003 08:37:51 -0700

Are the LS and PS characters actually used in real plain-text documents?

I ask because plain text documents are created by text editors. The text editor I happen to use is TextPad (there are hundreds of others, and everyone has their favorite). It can save in UTF-8, and so on. But it always saves documents with CRLF separating the lines. (It's a Windows system).

Going a little deeper, applications like this are often written in C or C++. These languages have the convention that "\n" in a string literal means "new line". Strictly speaking, BY DEFINITION (from the C and C++ specs), "\n" is supposed to mean LF, and nothing else, but programs compiled on Windows will reinterpret "\n" in a string literal to mean either LF only (when in memory) or CRLF (when encoded to or from a file or stream opened in text mode). Yes, it's a kludge, but it obviously works quite well. I suspect (but I don't know for sure) that the Mac will interpret "\n" as CR only.

It would seem impossible (or at least, a violation of the C/C++ specs) to reinterpret "\n" as LS in C/C++ ... but then again, that specification has already been violated, so maybe the precedent is there and that no longer matters.

Nonetheless, it would seem, at least /slightly/ sensible to me that text files encoded as UTF-8 should be using LS instead of CRLF. But this appears to be difficult to achieve. There is no C/C++ escape sequence which is defined to mean LS (unless you're prepared to write "\xE2\x80\xA2" instead of "\n" all over the place), and what "\n" generates is platform-dependent.

We can't change C or C++, of course, but would it make sense for other computer languages, in particular future computer languages, either to redefine "\n" to mean LS (if the encoding is capable of representing it)* or to introduce a new escape sequence, ("\l"?) to mean LS? (Of course, if we introduced "\l" for LS, we could also introduce "\p" for PS).

Thoughts anyone?

Jill

*FOOTNOTE - this is actually quite difficult to achieve if you're storing stuff internally as bytes. Windows knows whether or not to convert LF->CRLF and vice versa by means of a parameter passed to fopen(), but this parameter can only distinguish between "text" and "binary", not between "latin-1 text" and "utf-8 text". Things get easier if you stor chars internally as Unicode chars of course.

Line Separator and Paragraph Separator

Reply via email to