RE: Backslash n [OT] was Line Separator and Paragraph Separator
So this legacy encoding of end-of-lines is now quite obsolete even on MacOS. I don't think it can be called obsolete as long as files generated using that line end convention exist. Or, at least, applications that have an operation for read a line will have to cope with it. (In other words, all of the CR LF CRLF LFCR should mark an end of line.)
Re: Backslash n [OT] was Line Separator and Paragraph Separator
From: [EMAIL PROTECTED] I wrote: So this legacy encoding of end-of-lines is now quite obsolete even on MacOS. I don't think it can be called obsolete as long as files generated using that line end convention exist. Or, at least, applications that have an operation for read a line will have to cope with it. (In other words, all of the CR LF CRLF LFCR should mark an end of line.) I was not speaking about the actual encoding of files into bytes, but only about the interpretation of '\n' or '\r' in C/C++, which was the real subject of the message. You are refering to the run-time behavior of I/O readers/writers for files or network messages, and of course this is not obsolete, as the plain/text MIME format, as well as RFC822 message format (also used in the HTTP protocol) still use a CR+LF sequence for end-of-line marks in headers (this is even mandatory for RFC-822 and HTTP conformance). I just wonder if more recent C/C++ compilers for MacOS still compile a CR for the '\n' _source_ string or character constants.
RE: GDP by language
Mark Davis wrote: BTW, some time ago I had generated a pie chart of world GDP divided up by language. Those quotients are immoral. Of course, this immorality is not the fault of he who did the calculation: the immorality is out there, and those infamous numbers are just an arithmetical expressions of it. In practice, those quotients say that, e.g., Italian (spoken by 50 millions people or less) is more important than Hindi (spoken by nearly one billion people), just because an average Italian is richer than an average Indian. In other terms, each Indian (or any other citizen from poor countries) has 1/20 or less of the linguistic rights of an Italian (or any other citizen from rich countries). BTW, by summing up languages written with the same script, it is easy to derive the immoral quotients of writing systems: Latin 59.13% Han 20.60% Arabic 3.82% Cyrillic 2.99% Devanagari 2.54% Hangul 1.84% Thai 0.87% Bengali 0.44% Telugu 0.42% Greek0.40% Tamil0.34% Gujarati 0.26% _ Marco
RE: Backslash n [OT] was Line Separator and Paragraph Separator
all of the CR LF CRLF LFCR should mark an end of line.) All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the encoding of the text file is recognised.) Don't know about LF, CR. I think that should be two line ends. /kent k smime.p7s Description: S/MIME cryptographic signature
RE: Backslash n [OT] was Line Separator and Paragraph Separator
all of the CR LF CRLF LFCR should mark an end of line.) All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the I was still staying within the ASCII and \r \n discussion, but yes, if one goes Latin 1 / Unicode the NEL and LS PS (why not FF, then?), and of course EOF. encoding of the text file is recognised.) Don't know about LF, CR. I think that should be two line ends. That's a good question: is it a case of mixing different EOLs in the same file, or a question of a \r\n emitted by MacOS Classic? /kent k
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Kent Karlsson scripsit: All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the encoding of the text file is recognised.) XML 1.0 treats CR, LF, and CR, LF as line terminators and reports them as LF. XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line terminators and report them all as LF. PS is left alone, because of the bare possibility that it is being used as quasi-markup. I can't imagine why EOF should be called a line terminator, except in the sense that a read a line operation should obviously not attempt to read past EOF. Calling it a line terminator means that every document is forced into the mold of being an integral number of lines long, regardless of the facts. Don't know about LF, CR. I think that should be two line ends. I agree. I don't know any system that uses this sequence. -- BALIN FUNDINUL UZBAD KHAZADDUMU[EMAIL PROTECTED] BALIN SON OF FUNDIN LORD OF KHAZAD-DUM http://www.ccil.org/~cowan
Encoding for Fun (was Line Separator)
-Original Message- From: Doug Ewell [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 22, 2003 6:19 AM To: Unicode Mailing List Cc: Marco Cimarosti; Jill Ramonsky Subject: Re: Line Separator and Paragraph Separator Importance: Low Jill, I'd be interested in details of your invented encodings, just for fun. Please e-mail privately to avoid incurring the wrath of group (b). I'm going to risk the wrath of the group because I hereby place this in the public domain. Now you can't patent it! :-) Unicode list, please note, I used this a few years back internally, within one particular piece of software. It was never intended for wider use ... and that's the case for the defence, m'lud! The only invented encoding which got any real use was the following (currently nameless) one: We define an 8X byte as a byte with bit pattern 1000 We define a 9X byte as a byte with bit pattern 1001 The rules are: (1) If the codepoint is in the range U+00 to U+7F, represent it as a single byte (that covers ASCII) (2) If the codepoint is in the range U+A0 to U+FF, also represent it as a single byte (that covers Latin-1, minus the C1 controls) (3) In all other cases, represent the codepoint as a sequence of one or more 8X bytes followed by a single 9X byte. A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1) bits of "payload", which are then interpretted literally as a Unicode codepoint. EXAMPLES: U+2A ('*') would be represented as 2A (all Latin-1 chars are left unchanged apart from the C1s). U+85 (NEL) would be represented as 88 95 (just to prove that we haven't lost the C1 controls altogether!) U+20AC (Euro sign) would be represented as 82 80 8A 9C As you can see, the hex value of the encoded codepoint is actually "readable" from the hex, if you just look at the second nibble of each 8X or 9X byte. Another interesting feature: starting from a random point in a string, it is easy to scan backwards or forwards to find the start-byte or end-byte of a character. This is valuable, as it means that you don't have to parse a string from the beginning in order not to get lost. Finally, of course, the big plus is that it "looks like ASCII". Although this was used for "internal use only", it is interesting to speculate how it might have been declared, had it been a published encoding. Because, you see, it is quite interprettable by any engine which understands only Latin-1. The worst outcome is that any 8X...9X sequences will be incorrectly displayed as multiple unknown-character glyphs ... but that is not much worse than displaying a single unknown-character glyph. On the other hand, if you declare it as "LATIN-1-PLUS" or something, then any application which does not recognise that encoding name will be forced to interpret the stream as 7-bit, ASCII, thereby replacing all codepoints above U+7F with '?' or something. Which behavior is preferable, I wonder? What we'd really want the encoding name to say is "interpret as LATIN-1-PLUS if you can, otherwise interpret as LATIN-1", but there doesn't seem to any way of saying that with current encoding nonclamenture. Jill
Re: Backslash n [OT] was Line Separator and Paragraph Separator
From: John Cowan [EMAIL PROTECTED] Kent Karlsson scripsit: All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the encoding of the text file is recognised.) XML 1.0 treats CR, LF, and CR, LF as line terminators and reports them as LF. XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line terminators and report them all as LF. PS is left alone, because of the bare possibility that it is being used as quasi-markup. [...] I also have some old documents that use VT=U+000B instead of LF=U+000A to increase the interparagraph spacing. This is still mapped to the source '\v' character constant in C/C++ (and Java as well, except that Java _requires_ that '\v' be mapped only to VT. Some applications still seem to use VT after CR to create soft line breaks, in text files where paragraphs are normally ended by CRLF. CR was intended to create an overstrike on the previously written (but still complete) line, for example to underline some characters on that line. This is what '\r' should imply in C, and in fact such '\r' should no more be used in C, as it relies to add visual attributes to the previous text. That why CR comes before LF that terminates the paragraph. Of course there will still be a lot more usages in terminal emulation protocols, which technically are not a text file encodings, as they can create dynamic effects, or can encode and render a text in a non logical order, for example when emulating blinking, or creating ASCII arts: I consider that terminal emulation protocols (including printing protocols) are supersets of the plain text format, but plain texts should not attempt to reproduce all the terminal features. So what is the status of VT in plain text files ? For me it should have the same behavior as LF, except that it does not imply a end of paragraph. Is there a good replacement for this legacy control, that just means a explicit soft line break in the middle of a paragraph (in which case it may occur instead of a SPACE and act as a word separator, except if it occurs after a soft hyphen where it becomes ignorable) ?
Re: Encoding for Fun (was Line Separator)
From: Jill Ramonsky From: Doug Ewell [mailto:[EMAIL PROTECTED] Jill, I'd be interested in details of your invented encodings, just for fun. Please e-mail privately to avoid incurring the wrath of group (b). I'm going to risk the wrath of the group because I hereby place this in the public domain. Now you can't patent it! :-) Unicode list, please note, I used this a few years back internally, within one particular piece of software. It was never intended for wider use ... and that's the case for the defence, m'lud! The only invented encoding which got any real use was the following (currently nameless) one: We define an 8X byte as a byte with bit pattern 1000 We define a 9X byte as a byte with bit pattern 1001 The rules are: (1) If the codepoint is in the range U+00 to U+7F, represent it as a single byte (that covers ASCII) (2) If the codepoint is in the range U+A0 to U+FF, also represent it as a single byte (that covers Latin-1, minus the C1 controls) (3) In all other cases, represent the codepoint as a sequence of one or more 8X bytes followed by a single 9X byte. A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1) bits of payload, which are then interpretted literally as a Unicode codepoint. EXAMPLES: U+2A ('*') would be represented as 2A (all Latin-1 chars are left unchanged apart from the C1s). U+85 (NEL) would be represented as 88 95 (just to prove that we haven't lost the C1 controls altogether!) U+20AC (Euro sign) would be represented as 82 80 8A 9C As you can see, the hex value of the encoded codepoint is actually readable from the hex, if you just look at the second nibble of each 8X or 9X byte. That's a quite simple encoding. At least it has the merit of not being restricted in encoding length (but this may also be a security issue in systems that would implement it, as there's no limitation in the number of bytes to scan forward or backward to get the whole sequence, unless you specify that there can be no more than five 8X bytes, as the the longest valid sequence would be {0x81, 0x80, 0x8F, 0x8F, 0x8F, 0x9D}=U+10FFFD) However UTF-8 is much more compact. The second merit is that the technic can be used on top of all ISO-8859-* charsets, by replacing the C1 controls mapped in 0x8X and 0x9X positions. It could as well be mapped over EBCDIC, using the mapping between standard ISO Latin 1 and EBCDIC Latin 1, but there's a problem caused by the legacy and widely used controls NEL: You can't then say that it is fully compatible with ISO-8859-1, as it breaks the reversible compatibility with an EBCDIC transcoding (unless you are sure that no internal system or protocol will transcode your text files to/from EBCDIC). But one could argue that 8-bit JIS and EUC do not also offer this reversibility of encodings for C1 controls, except through ISO2022 codepage-switches and escaping mechanisms which allow a reversible conversion between 8-bit and 7-bit encodings (through SS2, SI and SO controls and escape-sequences)
RE: Backslash n [OT] was Line Separator and Paragraph Separator
John Cowan wrote: XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line terminators and report them all as LF. PS is left alone, because of the bare possibility that it is being used as quasi-markup. I'm not sure why CR, NEL should be seen as a single line end. And I think PS should be seen as a line end for XML too. It, like LS, can be used to format the XML source, but should not be interpreted as other than line end when parsing the XML source. E.g., PS is not a begin-end markup, which all other XML markup is; nor do I know of a way of attaching style to a PS, like can be done for p/p etc. Following (ex-) UAX 14 fully, FF and VT should be seen as line separtors too. Though they are unlikely in XML source files. FF shouldn't be interpreted as generating a page break in the styled output of an XML file, should it? I can't imagine why EOF should be called a line terminator, except in the sense that a read a line operation should obviously not attempt to read past EOF. There have been Unix programs that (mistakenly, I'd say) *discarded* the last (possibly partial) line of input, just because it had no LF at its end... And LS it's a separator, not a terminator, so EOF has to be a line terminator. Calling it a line terminator means that every document is forced into the mold of being an integral number of lines long, regardless of the facts. ?? If you mean that concatenating files should not generate a line break between the files, I agree. /kent k smime.p7s Description: S/MIME cryptographic signature
Re: Encoding for Fun (was Line Separator)
The only invented encoding which got any real use was the following (currently nameless) one: We define an 8X byte as a byte with bit pattern 1000 We define a 9X byte as a byte with bit pattern 1001 The rules are: (1) If the codepoint is in the range U+00 to U+7F, represent it as a single byte (that covers ASCII) (2) If the codepoint is in the range U+A0 to U+FF, also represent it as a single byte (that covers Latin-1, minus the C1 controls) (3) In all other cases, represent the codepoint as a sequence of one or more 8X bytes followed by a single 9X byte. A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1) bits of payload, which are then interpretted literally as a Unicode codepoint. EXAMPLES: U+2A ('*') would be represented as 2A (all Latin-1 chars are left unchanged apart from the C1s). U+85 (NEL) would be represented as 88 95 (just to prove that we haven't lost the C1 controls altogether!) U+20AC (Euro sign) would be represented as 82 80 8A 9C If you used this for interchange between components there would be a potential security issue if you allowed for over-long encodings, such as encoding U+002F as 0x82 0x9F. Beyond that of course one can use whatever encodings one wants privately.
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Philippe Verdy scripsit: I also have some old documents that use VT=U+000B instead of LF=U+000A to increase the interparagraph spacing. This is still mapped to the source '\v' character constant in C/C++ (and Java as well, except that Java _requires_ that '\v' be mapped only to VT. The XML Core WG also looked at FF, but decided that like PS it might be markup, and therefore shouldn't arbitrarily be mapped to LF. We didn't look at VT as far as I remember. Historically and originally, VT was meant to control line printers, which had a paper tape loop inside that selected the number of lines per page, and was advanced by one frame for each line printed. A hole punched in a certain column represented line 1, and so FF was implemented by advancing the tape and the paper until this hole was detected. Another column could contain holes for vertical tabulation points, and VT advanced the tape and paper until the next such hole was reached. Thus VT was strictly analogous to TAB. Some applications still seem to use VT after CR to create soft line breaks, in text files where paragraphs are normally ended by CRLF. IIRC, Microsoft Word uses VT internally to indicate a hard line break, and CR for a paragraph break. CR was intended to create an overstrike on the previously written (but still complete) line, for example to underline some characters on that line. This is what '\r' should imply in C, and in fact such '\r' should no more be used in C, as it relies to add visual attributes to the previous text. That why CR comes before LF that terminates the paragraph. In addition, Teletype terminals that received LF, CR would not reliably print the next character in the first horizontal position, because of the time it took to execute a CR. -- Not to perambulate John Cowan [EMAIL PROTECTED] the corridors http://www.reutershealth.com during the hours of repose http://www.ccil.org/~cowan in the boots of ascension. --Sign in Austrian ski-resort hotel
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Kent Karlsson scripsit: John Cowan wrote: XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line terminators and report them all as LF. PS is left alone, because of the bare possibility that it is being used as quasi-markup. I'm not sure why CR, NEL should be seen as a single line end. The IBM people, who are authoritative about their own mainframes, asked for it. It primarily arises out of semi-broken conversion programs that map LF to NEL but fail to remove a preceding CR. Since all line terminators are inherently a matter of legacy (i.e. de facto) practice, we accepted it. And I think PS should be seen as a line end for XML too. It, like LS, can be used to format the XML source, but should not be interpreted as other than line end when parsing the XML source. We are not here concerned, as the UAX is, with when to stop reading characters in a read-line routine. We are concerned with which distinctions to hide in the name of simplicity. Our predecessors considered that the differences between CR, CR, LF, and LF were non-semantic, and somewhat arbitrarily chose LF as the character to be passed to applications. We decided that CR, NEL, NEL, and LS had this same semantic. But PS and FF and VT have their own semantics, and we did not consider it justifiable to make it impossible for XML applications to receive and process them. E.g., PS is not a begin-end markup, which all other XML markup is; nor do I know of a way of attaching style to a PS, like can be done for p/p etc. PS is strictly analogous to an XML empty-tag without attributes. While it is traditional in SGML/XML to use container elements for paragraphs, there is no necessity to do so. Following (ex-) UAX 14 fully, FF and VT should be seen as line separtors too. Though they are unlikely in XML source files. FF shouldn't be interpreted as generating a page break in the styled output of an XML file, should it? It should be interpreted however the application chooses to interpret it. Arbitrarily turning it into a LF makes it impossible for the application to interpret it at all. I can't imagine why EOF should be called a line terminator, except in the sense that a read a line operation should obviously not attempt to read past EOF. There have been Unix programs that (mistakenly, I'd say) *discarded* the last (possibly partial) line of input, just because it had no LF at its end... And LS it's a separator, not a terminator, so EOF has to be a line terminator. It would be a corruption of the input to infer a LF at the end of a document. -- First known example of political correctness: John Cowan After Nurhachi had united all the otherhttp://www.reutershealth.com Jurchen tribes under the leadership of the http://www.ccil.org/~cowan Manchus, his successor Abahai (1592-1643) [EMAIL PROTECTED] issued an order that the name Jurchen should --S. Robert Ramsey, be banned, and from then on, they were all _The Languages of China_ to be called Manchus.
Re: Encoding for Fun (was Line Separator)
Philippe Verdy scripsit: It could as well be mapped over EBCDIC, using the mapping between standard ISO Latin 1 and EBCDIC Latin 1, but there's a problem caused by the legacy and widely used controls NEL: ironyWhy, that is no problem! Just ignore the EBCDIC difference between NEL and LF, and map ASCIIoid LFs to EBCDIC NELs! Doesn't everybody know that's the Right Thing anyhow? This business of treating NEL as a distinct line delimiter is a complete waste of time and money. And nobody cares about clanking iron dinosaurs, anyway. They aren't cool./irony -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan Does anybody want any flotsam? / I've gotsam. Does anybody want any jetsam? / I can getsam. --Ogden Nash, _No Doctors Today, Thank You_
Re: Backslash n [OT] was Line Separator and Paragraph Separator
On 22/10/2003 05:19, Kent Karlsson wrote: ... And LS it's a separator, not a terminator, so EOF has to be a line terminator. Calling it a line terminator means that every document is forced into the mold of being an integral number of lines long, regardless of the facts. ?? If you mean that concatenating files should not generate a line break between the files, I agree. /kent k But if two files each consist of one or more lines of text separated by LS (but with no final LS), when they are concatenated, surely LS must be added as a separator. Similarly with paragraphs and PS. And this applies even when each consists of one line or one paragraph, hence no LS or PS in either file. Conclusion: both LS and PS must be added in ANY concatenation. Way to avoid this absurd conclusion: redefine LS and PS as line and paragraph terminators, to be used at end of file when (as is normal) this corresponds to a line or paragraph end. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Backslash n [OT] was Line Separator and Paragraph Separator
So this legacy encoding of end-of-lines is now quite obsolete even on MacOS. I don't think it can be called obsolete as long as files generated using that line end convention exist. Or, at least, applications that have an operation for read a line will have to cope with it. (In other words, all of the CR LF CRLF LFCR should mark an end of line.) I was not speaking about the actual encoding of files into bytes, but only about the interpretation of '\n' or '\r' in C/C++, which was the real subject of the message. ISO 14882 says that \n is LF (and also it is newline, i.e. LF is the newline function as far as C++ is concerned) and \r is CR. It does not define this relative to any given character set. So there is nothing in the standard to prevent char being interpreted as an implementation- defined character encoding which is identical to, say US-ASCII or a part of ISO 8859, except for having CR encoded as 0x0A and LF encoded as 0x0D. This would simplify converting newline functions when writing text files on Macs, but potentially cause problems elsewhere. However because the universal-character-name escapes (\u and \U) are defined relative to a particular encoding, namely ISO 10646, it would be an error if ('\n' != '\u000A' || '\r' != '\u000D'). Whether this is implemented by using the values 0x0A and 0x0D for LF and CR respectivley (e.g. by using US- ASCII or a proper superset of US-ASCII such as Unicode) or by converting those values to another encoding when parsing isn't specified. Given that C and C++ are intended to be neutral to encodings, and indeed they do not even mandate that a char be an octet, or that a wchar_t be of the same size as 2 or 4 chars, this is not surprising. The consequence is that we cannot assume that conversion of character, wide character, and string literals to and from Unicode will be trivial.
RE: Encoding for Fun (was Line Separator)
Well, that was considerably less wrath than I was expecting. Phew! But to justify a few design decisions - yes, the encoding is longer (in general) than UTF-8, but UTF-8 only attempts to preserve ASCII. I needed to preserve ISO-8859-1. The reasons for this are complicated, but basically I had to find a way to feed a Unicode string (originally an array of 32-bit integers) into a legacy engine which was designed, many years previously (by somebody else), to assume that everythingin the world was Latin-1. That legacy engine did take ascribe meaning to the U+A0 to U+FF characters, so I couldn't use them for anything else. But all I needed it to do with the non-Latin-1-Unicode characters was preserve them. Essentially, I needed round-trip compatibility when converting from Unicode to Latin-1 and back. This is of course impossible ... but the C1 controls weren't being used, so I made it possible. Security wasn't an issue, as the encoding never "leaked" into the outside world, and its spec was never published. If I had wanted to use it for interchange, I would obviously have further specified that all characters be stored in the minimum number of bytes. My software didn't check for violations of this, but only because it didn't need to. Jill
Re: GDP by language
Marco, I certainly wouldn't draw that conclusion. This is not the appropriate forum for a political or ethical discussion, but equating GDP with more important in any general sense is clearly a huge leap, and one that I certainly would not make. There is a rough correlation of GDP with currently has more money to spend for products, but that is only very, very rough. And the currently is very important; projections are for this chart to change pretty dramatically over the course of the next 20-50 years. See, for example, http://www.gs.com/insight/research/reports/99.pdf, http://www.economist.com/displaystory.cfm?story_id=1632512, and http://www.economist.com/displaystory.cfm?story_id=1923383. (It would be pretty interesting to make a dynamic pie chart with pieces growing/shrinking over the period of some seconds to reflect projected changes in the future.) The goal of the chart was different. Many people mistakenly think the potential customer base of non-English-speakers is smaller than it actually is. The goal was to graphically illustrate -- in a very general fashion -- that if a product only works with English, it misses a huge potential market. Mark __ http://www.macchiato.com - Original Message - From: Marco Cimarosti [EMAIL PROTECTED] To: 'Mark Davis' [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wed, 2003 Oct 22 02:17 Subject: RE: GDP by language Mark Davis wrote: BTW, some time ago I had generated a pie chart of world GDP divided up by language. Those quotients are immoral. Of course, this immorality is not the fault of he who did the calculation: the immorality is out there, and those infamous numbers are just an arithmetical expressions of it. In practice, those quotients say that, e.g., Italian (spoken by 50 millions people or less) is more important than Hindi (spoken by nearly one billion people), just because an average Italian is richer than an average Indian. In other terms, each Indian (or any other citizen from poor countries) has 1/20 or less of the linguistic rights of an Italian (or any other citizen from rich countries). BTW, by summing up languages written with the same script, it is easy to derive the immoral quotients of writing systems: Latin 59.13% Han 20.60% Arabic3.82% Cyrillic 2.99% Devanagari 2.54% Hangul1.84% Thai 0.87% Bengali 0.44% Telugu0.42% Greek 0.40% Tamil 0.34% Gujarati 0.26% _ Marco
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Peter Kirk scripsit: But if two files each consist of one or more lines of text separated by LS (but with no final LS), when they are concatenated, surely LS must be added as a separator. Similarly with paragraphs and PS. But your protasis is a petitio principii. Files may or may not consist of lines of text: a file may contain less than one line. Way to avoid this absurd conclusion: redefine LS and PS as line and paragraph terminators, to be used at end of file when (as is normal) this corresponds to a line or paragraph end. No doubt this is the de facto position. (The *true* de facto position, of course, is not to use LS or PS at all.) -- Dream projects long deferredJohn Cowan [EMAIL PROTECTED] usually bite the wax tadpole.http://www.ccil.org/~cowan --James Lileks http://www.reutershealth.com
RE: Backslash n [OT] was Line Separator and Paragraph Separator
Peter Kirk wrote: But if two files each consist of one or more lines of text separated by LS (but with no final LS), when they are concatenated, surely LS must be added as a separator. Similarly with paragraphs and PS. And this applies even when each consists of one line or one paragraph, hence no LS or PS in either file. Conclusion: both LS and PS must be added in ANY concatenation. Way to avoid this absurd conclusion: redefine LS and PS as line and paragraph terminators, to be used at end of file when (as is normal) this corresponds to a line or paragraph end. No, and no. The first and last lines in a text file may well be partial. If one wants a PS or LS in-between when concatenating them (assuming they are of the same encoding), the LS or PS must be explicitly concatenated in. (The result of reading, line-by-line, first file A then file B is not always the same as reading, line-by-line, the concatenation of files A and B. I.e. readline does not distribute over concatenation, if you like that kind of formulation. Maybe you would like it to, but it doesn't, never has.) /kent k smime.p7s Description: S/MIME cryptographic signature
RE: Encoding for Fun (was Line Separator)
I can't argue with that ... but my strings were always in (32-bit wide) Unicode at sort-time. I'm not sure exactly how much value there is a lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should not come after 'z'? Of course, UTF-16 doesn't have the binary sort property either. Jill -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 22, 2003 4:32 PM To: Jill Ramonsky Cc: [EMAIL PROTECTED] Subject: Re: Encoding for Fun (was Line Separator) UTF-8 has this property too. This protocol lacks, however, the binary sorting property that UTF-8 has.
Re: Encoding for Fun (was Line Separator)
Jill Ramonsky scripsit: The only invented encoding which got any real use was the following (currently nameless) one: I believe the name UTF-4 is currently unclaimed. :-) I like this idea. Another interesting feature: starting from a random point in a string, it is easy to scan backwards or forwards to find the start-byte or end-byte of a character. This is valuable, as it means that you don't have to parse a string from the beginning in order not to get lost. UTF-8 has this property too. This protocol lacks, however, the binary sorting property that UTF-8 has. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If I have seen farther than others, it is because I was standing on the shoulders of giants. --Isaac Newton
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Unicode UAX 14 (Line Breaking Properties) also has a bit to say on this topic of line separators From http://www.unicode.org/reports/tr14/ BK - Mandatory Break (A) - (normative) Explicit breaks act independently of the surrounding characters. 000C FORM FEED Form Feed separates pages. The text on the new page starts at the beginning of the line. No paragraph formatting is applied. 2028 LINE SEPARATOR The text after the Line Separator starts at the beginning of the line. No paragraph formatting is applied. This is similar to HTML BR 2029 PARAGRAPH SEPARATOR The text of the new paragraph starts at the beginning of the line. Paragraph formatting is applied. NEW LINE FUNCTION (NLF) New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [Unicode] Section 5.8, Newline Guidelines. If a character sequence for a new line function contains more than one character, it is kept together. The default behavior is to break after LF or CR, but not between CR and LF. Two additional line breaking classes have been added for convenience in this operation. Mandatory breaks: LB 3a Always break after hard line breaks (but never between CR and LF). BK ! LB 3b Treat CR followed by LF, as well as CR, LF and NL as hard line breaks CR LF CR ! LF ! NL ! -- -- Andy Heninger [EMAIL PROTECTED]
RE: Encoding for Fun (was Line Separator)
I can't argue with that ... but my strings were always in (32-bit wide) Unicode at sort-time. I'm not sure exactly how much value there is a lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should not come after 'z'? Not always. In particular there's time when a dependable sort order is required, but just what that sort order is isn't important. In those cases it can useful that UTF-8 and UTF-32 will both do a binary sort with equivalent results. Of course, UTF-16 doesn't have the binary sort property either. Nope, though an efficient mechanism to sort UTF-16 in the codepoint order is available.
Re: Backslash n [OT] was Line Separator and Paragraph Separator
On 22/10/2003 08:36, John Cowan wrote: Peter Kirk scripsit: But if two files each consist of one or more lines of text separated by LS (but with no final LS), when they are concatenated, surely LS must be added as a separator. Similarly with paragraphs and PS. But your protasis is a petitio principii. Files may or may not consist of lines of text: a file may contain less than one line. Way to avoid this absurd conclusion: redefine LS and PS as line and paragraph terminators, to be used at end of file when (as is normal) this corresponds to a line or paragraph end. No doubt this is the de facto position. (The *true* de facto position, of course, is not to use LS or PS at all.) Well, perhaps this needs to be read as disproof by reductio ad absurdum. I have shown it to be absurd to consider files to consist of one or more lines of text separated by LS, most obviously because it becomes impossible to tell whether the last line is intended to be complete or not. But Kent did imply this model of file structure when he wrote And LS it's a separator, not a terminator, so EOF has to be a line terminator. But according to Kent's latest posting (my emphasis), The *first* and last lines in a text file may well be partial. How can one tell, in any encoding, whether the first line is partial? And it seems that, in a file where LS is used as a separator not a terminator, EOF is a line terminator except when it isn't. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Encoding for Fun (was Line Separator)
Jill Ramonsky scripsit: I can't argue with that ... but my strings were always in (32-bit wide) Unicode at sort-time. I'm not sure exactly how much value there is a lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should not come after 'z'? Fair enough. Another good property that your UTF-4 scheme has is that 8-bit search will work correctly, which is true of UTF-8 as well but not of UTF-16. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com I must confess that I have very little notion of what [s. 4 of the British Trade Marks Act, 1938] is intended to convey, and particularly the sentence of 253 words, as I make them, which constitutes sub-section 1. I doubt if the entire statute book could be successfully searched for a sentence of equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940
Re: GDP by language
On 22/10/2003 02:17, Marco Cimarosti wrote: ... BTW, by summing up languages written with the same script, it is easy to derive the immoral quotients of writing systems: Latin 59.13% Han 20.60% Arabic 3.82% Cyrillic 2.99% Devanagari 2.54% Hangul 1.84% Thai 0.87% Bengali 0.44% Telugu 0.42% Greek0.40% Tamil0.34% Gujarati 0.26% _ Marco The data doesn't support addition to this degree of accuracy because of the effect of the others area. Cyrillic may even overtake Arabic, because there are several countries using the Cyrillic alphabet, but not Russian or Ukrainian, which might each contribute 0.1-0.2%, but no countries as far as I know using Arabic script but not Arabic, Persian or Urdu as official languages (except perhaps Pashto in Afghanistan). Also of course the GDP data is surely not reliable to sufficient accuracy. Also you might get a slightly different picture if you add in the relatively prosperous users of non-western scripts who have migrated to western countries - Hebrew and Armenian as well as south Asian scripts. As for the morality issue: while we can't do much about the relative availability of computers, it is encouraging to see that commercial software vendors as well as the open source community are making internationalisation packages and localised versions of software available, sometimes free to all and sometimes at greatly reduced cost in poorer countries. Unicode isn't going to solve inequalities on its own, but it can hardly be blamed for contributing to them. In the long term, and if other factors allow it, we might even find that the computer revolution is the key to breaking down these inequalities. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Public Review Issues - closing Oct 27
This is to remind everyone that several current Unicode Public Review Issues will close for comments on October 27. Please send soon any comments that you have not already submitted. The Public Review Issues page is here: http://www.unicode.org/review/ The page includes instructions for returning comments for UTC consideration. Regards, Rick McGowan Unicode, Inc.
Re: GDP by language
Peter Kirk wrote: On 22/10/2003 02:17, Marco Cimarosti wrote: ... BTW, by summing up languages written with the same script, it is easy to derive the immoral quotients of writing systems: Latin 59.13% Han 20.60% Arabic 3.82% Cyrillic 2.99% Devanagari 2.54% Hangul 1.84% Thai 0.87% Bengali0.44% Telugu 0.42% Greek 0.40% Tamil 0.34% Gujarati 0.26% The data doesn't support addition to this degree of accuracy because of the effect of the others area. Cyrillic may even overtake Arabic, because there are several countries using the Cyrillic alphabet, but not Russian or Ukrainian, which might each contribute 0.1-0.2%, but no countries as far as I know using Arabic script but not Arabic, Persian or Urdu as official languages (except perhaps Pashto in Afghanistan). Also of course the GDP data is surely not reliable to sufficient accuracy. Don't forget to take in account that Latin and Greek letters are used in most languages, e.g. as part of mathematical formul. Stefan
Re: Backslash n [OT] was Line Separator and Paragraph Separator
On 22 Oct 2003, at 6:53, John Cowan wrote: Kent Karlsson scripsit: All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the encoding of the text file is recognised.) XML 1.0 treats CR, LF, and CR, LF as line terminators and reports them as LF. XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line terminators and report them all as LF. PS is left alone, because of the bare possibility that it is being used as quasi-markup. I can't imagine why EOF should be called a line terminator, except in the sense that a read a line operation should obviously not attempt to read past EOF. Calling it a line terminator means that every document is forced into the mold of being an integral number of lines long, regardless of the facts. Don't know about LF, CR. I think that should be two line ends. I agree. I don't know any system that uses this sequence. The BBC Micro---well-known to a generation of British schoolchildren---used this sequence. You can probably find files in that encoding on some 5.25in floppies in DFS format in some store cupboards somewhere (for what that's worth). I wrote a little line-conversion (f)utility recently, and the (minimal) research I did suggested that the following was a complete set of line- terminators that might be found in practice: CRLF CRFF CRVT LF FF VT CR LFCR NEL CRCRLF NUL end of file (not control-D or control-Z, I mean the real end-of-file) CRLF is derived from standard printer technology. CRFF and CRVT are how you would get the printer to move by more than a line. More recent practice allows LF, FF or VT to be used solo. If sent directly to a printer they still terminate the line, though the printed output would look different since the carriage would not return. CR is from MacOS, LFCR is from the BBC Micro. NEL is a dedicated character with the right meaning. CRCRLF is generated by some buggy software I have to put up with. And I can't remember why I wanted to allow NUL. I probably reasoned that (in its C role as end of string) it must terminate a line, just as EOF does. This is all for Latin-1 only. Obviously, it's pretty idiosyncratic, but it looks like I missed at least CRNEL---any others? I think someone mentioned the IND (index) character recently in the context of line-breaking. I'd like to ask, what is its intended function? /| o o o (_|/ /| (_/
Re: Backslash n [OT] was Line Separator and Paragraph Separator
Jonathan Coxhead [EMAIL PROTECTED] wrote: On 22 Oct 2003, at 6:53, John Cowan wrote: Kent Karlsson scripsit: Don't know about LF, CR. I think that should be two line ends. I agree. I don't know any system that uses this sequence. The BBC Micro---well-known to a generation of British schoolchildren---used this sequence. You can probably find files in that encoding on some 5.25in floppies in DFS format in some store cupboards somewhere (for what that's worth). Also the PRIME computers of the 1970s and 80s. If you remember an online service called The Source (similar to Compuserve, but different), it ran on big PRIMEs. File transfer protocols (such as Kermit) that were used to get text files into and out of The Source swapped LFCR to CRLF (and stripped the 8th bit from its native negative ASCII / Mark parity encoding). LFCR makes historical sense if you think about how manual typewriters work. When you push the carriage return lever (did you ever wonder where the name Carriage Return came from?), the platen rolls up one line immediately (LF), and then as you keep pushing it, the carriage returns to the left margin (CR). See: http://xavier.xu.edu:8000/~polt/tw-parts.html (hey, I never saw a left-handed typewriter before...) I wrote a little line-conversion (f)utility recently, and the (minimal) research I did suggested that the following was a complete set of line- terminators that might be found in practice: You can't really tell by inspection what any of these sequences is supposed to do, without knowing where and how the file was created. Is CR a line terminator, or a paragraph separator, or is it being used for overstriking (a common method of underlining). Treating EOF as EOL is dangerous. In Unix, many applications flag an error when a text file does not end with a line terminator, since it might mean the file is incomplete. This is in contrast to the Windows practice of auto-detecting-and-correcting-and-accepting everything, on the assumption that users can't possibly know what they are doing. Another interesting scheme is used in VMS text files of a certain format: each line begins with LF and ends with CR. - Frank
Preliminary minutes from UTC 96 (August 2003) posted publicly
The preliminary minutes from UTC 96 (August 2003) have beenpostedfor public access at http://www.unicode.org/consortium/utc-minutes.html Magda Danish Administrative Director The Unicode Consortium 650-693-3921
Re: Backslash n [OT] was Line Separator and Paragraph Separator
From: [EMAIL PROTECTED] However because the universal-character-name escapes (\u and \U) are defined relative to a particular encoding, namely ISO 10646, it would be an error if ('\n' != '\u000A' || '\r' != '\u000D'). Whether this is implemented by using the values 0x0A and 0x0D for LF and CR respectivley (e.g. by using US- ASCII or a proper superset of US-ASCII such as Unicode) or by converting those values to another encoding when parsing isn't specified. You're wrong here: Neither Unicode or ISO specify that the source constants '\n' or '\r', which are made with an escaping mechanism of _multiple_ distinct characters specific for some programming languages must be bound at compile-time or run-time to a LF or CR character. The '\n' and '\r' conventions are specific to each language, and C/C++ use conventions distinct from those in Java for example... This is not an encoding issue, but a language feature. In C or C++, if you want to be sure that your program will be portable when you need to specify LF or CR exclusively, you MUST NOT use the '\n' and '\r' constants but instead the numeric escapes in strings (i.e. \012 or \x0A for LF, and \015 or \x0D for CR), or simply the integer constants for the char, int, or wchar_t datatypes (i.e. 10 or 012 or 0x0A for LF, and 13 or 015 or 0x0D for CR), and make sure that your run-time library will map these values correctly with your run-time locale or system environment (you may need to specify file-open flags to control this mapping, such as the t flag for fopen function calls). So a test like: if ('\n'==10) may or may not be true in C/C++, depending on the compiler implementation (but not of the system platform...), and the same test in Java will always be true...
Re: Backslash n [OT] was Line Separator and Paragraph Separator
At 11:56 AM -0700 10/22/03, Jonathan Coxhead wrote: Don't know about LF, CR. I think that should be two line ends. I agree. I don't know any system that uses this sequence. The BBC Micro---well-known to a generation of British schoolchildren---used this sequence. You can probably find files in that encoding on some 5.25in floppies in DFS format in some store cupboards somewhere (for what that's worth). My God! I had no idea. Those poor British school children who can't write XML on their BBC micros! Clearly we must allow LFCR as a legal line ending in XML 1.1. It's a matter of justice! (Tongue firmly in cheek.) -- Elliotte Rusty Harold [EMAIL PROTECTED] Processing XML with Java (Addison-Wesley, 2002) http://www.cafeconleche.org/books/xmljava http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA