RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread jarkko.hietaniemi
 So this legacy encoding of end-of-lines is now quite obsolete
 even on MacOS.

I don't think it can be called obsolete as long as files generated using
that line end convention exist.  Or, at least, applications that have an
operation for  read a line will have to cope with it.  (In other words,
all of the CR LF CRLF LFCR should mark an end of line.)



  



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Philippe Verdy
From: [EMAIL PROTECTED]

 I wrote:

  So this legacy encoding of end-of-lines is now quite obsolete
  even on MacOS.

 I don't think it can be called obsolete as long as files generated using
 that line end convention exist.  Or, at least, applications that have an
 operation for  read a line will have to cope with it.  (In other words,
 all of the CR LF CRLF LFCR should mark an end of line.)

I was not speaking about the actual encoding of files into bytes, but
only about the interpretation of '\n' or '\r' in C/C++, which was the real
subject of the message.

You are refering to the run-time behavior of I/O readers/writers for
files or network messages, and of course this is not obsolete, as the
plain/text MIME format, as well as RFC822 message format (also used
in the HTTP protocol) still use a CR+LF sequence for end-of-line
marks in headers (this is even mandatory for RFC-822 and HTTP
conformance).

I just wonder if more recent C/C++ compilers for MacOS still compile
a CR for the '\n' _source_ string or character constants.




RE: GDP by language

2003-10-22 Thread Marco Cimarosti
Mark Davis wrote:
 BTW, some time ago I had generated a pie chart of world GDP 
 divided up by language.

Those quotients are immoral.

Of course, this immorality is not the fault of he who did the calculation:
the immorality is out there, and those infamous numbers are just an
arithmetical expressions of it.

In practice, those quotients say that, e.g., Italian (spoken by 50 millions
people or less) is more important than Hindi (spoken by nearly one billion
people), just because an average Italian is richer than an average Indian.

In other terms, each Indian (or any other citizen from poor countries) has
1/20 or less of the linguistic rights of an Italian (or any other citizen
from rich countries).

BTW, by summing up languages written with the same script, it is easy to
derive the immoral quotients of writing systems:

Latin   59.13%
Han 20.60%
Arabic   3.82%
Cyrillic 2.99%
Devanagari   2.54%
Hangul   1.84%
Thai 0.87%
Bengali  0.44%
Telugu   0.42%
Greek0.40%
Tamil0.34%
Gujarati 0.26%

_ Marco




RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Kent Karlsson
 all of the CR LF CRLF LFCR should mark an end of line.)

All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the
encoding of the text file is recognised.)

Don't know about LF, CR. I think that should be two line ends.

/kent k


smime.p7s
Description: S/MIME cryptographic signature


RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread jarkko.hietaniemi
 
  all of the CR LF CRLF LFCR should mark an end of line.)
 
 All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the

I was still staying within the ASCII and \r \n discussion, but yes,
if one goes Latin 1 / Unicode the NEL and LS PS (why not FF, then?),
and of course EOF.

 encoding of the text file is recognised.)
 
 Don't know about LF, CR. I think that should be two line ends.

That's a good question: is it a case of mixing different EOLs in the
same file, or a question of a \r\n emitted by MacOS Classic?
 
   /kent k
 



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread John Cowan
Kent Karlsson scripsit:

 All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the
 encoding of the text file is recognised.)

XML 1.0 treats CR, LF, and CR, LF as line terminators and reports
them as LF.

XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line
terminators and report them all as LF.  PS is left alone, because of
the bare possibility that it is being used as quasi-markup.

I can't imagine why EOF should be called a line terminator, except
in the sense that a read a line operation should obviously not attempt
to read past EOF.  Calling it a line terminator means that every
document is forced into the mold of being an integral number of lines
long, regardless of the facts.

 Don't know about LF, CR. I think that should be two line ends.

I agree.  I don't know any system that uses this sequence.

-- 
BALIN FUNDINUL  UZBAD KHAZADDUMU[EMAIL PROTECTED]
BALIN SON OF FUNDIN LORD OF KHAZAD-DUM  http://www.ccil.org/~cowan



Encoding for Fun (was Line Separator)

2003-10-22 Thread Jill Ramonsky






 -Original Message-
 From: Doug Ewell [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, October 22, 2003 6:19 AM
 To: Unicode Mailing List
 Cc: Marco Cimarosti; Jill Ramonsky
 Subject: Re: Line Separator and Paragraph Separator
 Importance: Low
 
 
 Jill, I'd be interested in details of your invented 
 encodings, just for
 fun. Please e-mail privately to avoid incurring the wrath of 
 group (b).
 

I'm going to risk the wrath of the group because I hereby place this in
the public domain. Now you can't patent it! :-)
Unicode list, please note, I used this a few years back internally,
within one particular piece of software. It was never intended for
wider use ... and that's the case for the defence, m'lud!



The only invented encoding which got any real use was the following
(currently nameless) one:

We define an 8X byte as a byte with bit pattern 1000
We define a 9X byte as a byte with bit pattern 1001

The rules are:
(1) If the codepoint is in the range U+00 to U+7F, represent it as a
single byte (that covers ASCII)
(2) If the codepoint is in the range U+A0 to U+FF, also represent it as
a single byte (that covers Latin-1, minus the C1 controls)
(3) In all other cases, represent the codepoint as a sequence of one or
more 8X bytes followed by a single 9X byte.

A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1)
bits of "payload", which are then interpretted literally as a Unicode
codepoint.

EXAMPLES:
U+2A ('*') would be represented as 2A (all Latin-1 chars are left
unchanged apart from the C1s).
U+85 (NEL) would be represented as 88 95 (just to prove that we haven't
lost the C1 controls altogether!)
U+20AC (Euro sign) would be represented as 82 80 8A 9C

As you can see, the hex value of the encoded codepoint is actually
"readable" from the hex, if you just look at the second nibble of each
8X or 9X byte.

Another interesting feature: starting from a random point in a string,
it is easy to scan backwards or forwards to find the start-byte or
end-byte of a character. This is valuable, as it means that you don't
have to parse a string from the beginning in order not to get lost.

Finally, of course, the big plus is that it "looks like ASCII".
Although this was used for "internal use only", it is interesting to
speculate how it might have been declared, had it been a published
encoding. Because, you see, it is quite interprettable by any engine
which understands only Latin-1. The worst outcome is that any 8X...9X
sequences will be incorrectly displayed as multiple unknown-character
glyphs ... but that is not much worse than displaying a single
unknown-character glyph. On the other hand, if you declare it as
"LATIN-1-PLUS" or something, then any application which does not
recognise that encoding name will be forced to interpret the stream as
7-bit, ASCII, thereby replacing all codepoints above U+7F with '?' or
something. Which behavior is preferable, I wonder? What we'd really want
the encoding name to say is "interpret as LATIN-1-PLUS if you can,
otherwise interpret as LATIN-1", but there doesn't seem to any way of
saying that with current encoding nonclamenture.

Jill








Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Philippe Verdy
From: John Cowan [EMAIL PROTECTED]

 Kent Karlsson scripsit:
 
  All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the
  encoding of the text file is recognised.)
 
 XML 1.0 treats CR, LF, and CR, LF as line terminators and reports
 them as LF.
 XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line
 terminators and report them all as LF.  PS is left alone, because of
 the bare possibility that it is being used as quasi-markup.
 [...]

I also have some old documents that use VT=U+000B instead of
LF=U+000A to increase the interparagraph spacing. This is still
mapped to the source '\v' character constant in C/C++ (and Java
as well, except that Java _requires_ that '\v' be mapped only to
VT.

Some applications still seem to use VT after CR to create soft line
breaks, in text files where paragraphs are normally ended by CRLF.

CR was intended to create an overstrike on the previously written (but
still complete) line, for example to underline some characters on that
line. This is what '\r' should imply in C, and in fact such '\r' should no
more be used in C, as it relies to add visual attributes to the previous
text. That why CR comes before LF that terminates the paragraph.

Of course there will still be a lot more usages in terminal emulation
protocols, which technically are not a text file encodings, as they can
create dynamic effects, or can encode and render a text in a non logical
order, for example when emulating blinking, or creating ASCII arts:
I consider that terminal emulation protocols (including printing protocols)
are supersets of the plain text format, but plain texts should not attempt
to reproduce all the terminal features.

So what is the status of VT in plain text files ? For me it should
have the same behavior as LF, except that it does not imply a end
of paragraph. Is there a good replacement for this legacy control, that
just means a explicit soft line break in the middle of a paragraph (in
which case it may occur instead of a SPACE and act as a word
separator, except if it occurs after a soft hyphen where it
becomes ignorable) ?




Re: Encoding for Fun (was Line Separator)

2003-10-22 Thread Philippe Verdy
From: Jill Ramonsky 
  From: Doug Ewell [mailto:[EMAIL PROTECTED]
   
   Jill, I'd be interested in details of your invented 
   encodings, just for
   fun.  Please e-mail privately to avoid incurring the wrath of 
   group (b).
 
 I'm going to risk the wrath of the group because I hereby
 place this in the public domain. Now you can't patent it! :-)
 Unicode list, please note, I used this a few years back
 internally, within one particular piece of software.
 It was never intended for wider use ... and that's the case
 for the defence, m'lud!
 
 The only invented encoding which got any real use was the
 following (currently nameless) one:
 
 We define an 8X byte as a byte with bit pattern 1000
 We define a 9X byte as a byte with bit pattern 1001
 
 The rules are:
 (1) If the codepoint is in the range U+00 to U+7F, represent
 it as a single byte (that covers ASCII)
 (2) If the codepoint is in the range U+A0 to U+FF, also represent
 it as a single byte (that covers Latin-1, minus the C1 controls)
 (3) In all other cases, represent the codepoint as a sequence of
 one or more 8X bytes followed by a single 9X byte.
 A sequence of N 8X bytes plus one 9X bytes therefore contains
 4(N+1) bits of payload, which are then interpretted literally as
 a Unicode codepoint.
 
 EXAMPLES:
 U+2A ('*') would be represented as 2A (all Latin-1 chars are left
 unchanged apart from the C1s).
 U+85 (NEL) would be represented as 88 95 (just to prove that we
 haven't lost the C1 controls altogether!)
 U+20AC (Euro sign) would be represented as 82 80 8A 9C
 
 As you can see, the hex value of the encoded codepoint is actually
 readable from the hex, if you just look at the second nibble of
 each 8X or 9X byte.

That's a quite simple encoding. At least it has the merit of not being
restricted in encoding length (but this may also be a security issue
in systems that would implement it, as there's no limitation in the
number of bytes to scan forward or backward to get the whole
sequence, unless you specify that there can be no more than
five 8X bytes, as the the longest valid sequence would be
{0x81, 0x80, 0x8F, 0x8F, 0x8F, 0x9D}=U+10FFFD)
However UTF-8 is much more compact.

The second merit is that the technic can be used on top of all
ISO-8859-* charsets, by replacing the C1 controls mapped in
0x8X and 0x9X positions.

It could as well be mapped over EBCDIC, using the mapping
between standard ISO Latin 1 and EBCDIC Latin 1, but there's
a problem caused by the legacy and widely used controls NEL:

You can't then say that it is fully compatible with ISO-8859-1,
as it breaks the reversible compatibility with an EBCDIC
transcoding (unless you are sure that no internal system or
protocol will transcode your text files to/from EBCDIC). But one
could argue that 8-bit JIS and EUC do not also offer this
reversibility of encodings for C1 controls, except through
ISO2022 codepage-switches and escaping mechanisms which
allow a reversible conversion between 8-bit and 7-bit encodings
(through SS2, SI and SO controls and escape-sequences)




RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Kent Karlsson


John Cowan wrote:
 XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line
 terminators and report them all as LF.  PS is left alone, because of
 the bare possibility that it is being used as quasi-markup.

I'm not sure why CR, NEL should be seen as a single line end.

And I think PS should be seen as a line end for XML too.
It, like LS, can be used to format the XML source, but should not
be interpreted as other than line end when parsing the XML source.
E.g., PS is not a begin-end markup, which all other XML markup is;
nor do I know of a way of attaching style to a PS, like can be done
for p/p etc.

Following (ex-) UAX 14 fully, FF and VT should be seen as line
separtors too. Though they are unlikely in XML source files.
FF shouldn't be interpreted as generating a page break in the
styled output of an XML file, should it?

 I can't imagine why EOF should be called a line terminator, except
 in the sense that a read a line operation should obviously 
 not attempt to read past EOF.

There have been Unix programs that (mistakenly, I'd say) *discarded*
the last (possibly partial) line of input, just because it had no LF at
its
end... And LS it's a separator, not a terminator, so EOF has to be a
line
terminator.

  Calling it a line terminator means that every
 document is forced into the mold of being an integral number of lines
 long, regardless of the facts.

?? If you mean that concatenating files should not generate a line break
between the files, I agree.

/kent k


smime.p7s
Description: S/MIME cryptographic signature


Re: Encoding for Fun (was Line Separator)

2003-10-22 Thread jon
 The only invented encoding which got any real use was the following 
 (currently nameless) one:
 
 We define an 8X byte as a byte with bit pattern 1000
 We define a 9X byte as a byte with bit pattern 1001
 
 The rules are:
 (1) If the codepoint is in the range U+00 to U+7F, represent it as a 
 single byte (that covers ASCII)
 (2) If the codepoint is in the range U+A0 to U+FF, also represent it as 
 a single byte (that covers Latin-1, minus the C1 controls)
 (3) In all other cases, represent the codepoint as a sequence of one or 
 more 8X bytes followed by a single 9X byte.
 
 A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1) 
 bits of payload, which are then interpretted literally as a Unicode 
 codepoint.
 
 EXAMPLES:
 U+2A ('*') would be represented as 2A (all Latin-1 chars are left 
 unchanged apart from the C1s).
 U+85 (NEL) would be represented as 88 95 (just to prove that we haven't 
 lost the C1 controls altogether!)
 U+20AC (Euro sign) would be represented as 82 80 8A 9C

If you used this for interchange between components there would be a potential 
security issue if you allowed for over-long encodings, such as encoding 
U+002F as 0x82 0x9F.

Beyond that of course one can use whatever encodings one wants privately.



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread John Cowan
Philippe Verdy scripsit:

 I also have some old documents that use VT=U+000B instead of
 LF=U+000A to increase the interparagraph spacing. This is still
 mapped to the source '\v' character constant in C/C++ (and Java
 as well, except that Java _requires_ that '\v' be mapped only to
 VT.

The XML Core WG also looked at FF, but decided that like PS it might
be markup, and therefore shouldn't arbitrarily be mapped to LF.
We didn't look at VT as far as I remember.

Historically and originally, VT was meant to control line printers,
which had a paper tape loop inside that selected the number of lines
per page, and was advanced by one frame for each line printed.  A hole
punched in a certain column represented line 1, and so FF was implemented
by advancing the tape and the paper until this hole was detected.  Another
column could contain holes for vertical tabulation points, and VT advanced
the tape and paper until the next such hole was reached.  Thus VT was
strictly analogous to TAB.

 Some applications still seem to use VT after CR to create soft line
 breaks, in text files where paragraphs are normally ended by CRLF.

IIRC, Microsoft Word uses VT internally to indicate a hard line break,
and CR for a paragraph break.

 CR was intended to create an overstrike on the previously written (but
 still complete) line, for example to underline some characters on that
 line. This is what '\r' should imply in C, and in fact such '\r' should no
 more be used in C, as it relies to add visual attributes to the previous
 text. That why CR comes before LF that terminates the paragraph.

In addition, Teletype terminals that received LF, CR would not reliably
print the next character in the first horizontal position, because of
the time it took to execute a CR.

-- 
Not to perambulate John Cowan [EMAIL PROTECTED]
the corridors  http://www.reutershealth.com
during the hours of repose http://www.ccil.org/~cowan
in the boots of ascension.   --Sign in Austrian ski-resort hotel  



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread John Cowan
Kent Karlsson scripsit:
 
 
 John Cowan wrote:
  XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line
  terminators and report them all as LF.  PS is left alone, because of
  the bare possibility that it is being used as quasi-markup.
 
 I'm not sure why CR, NEL should be seen as a single line end.

The IBM people, who are authoritative about their own mainframes, asked
for it.  It primarily arises out of semi-broken conversion programs
that map LF to NEL but fail to remove a preceding CR.  Since all line
terminators are inherently a matter of legacy (i.e. de facto) practice,
we accepted it.

 And I think PS should be seen as a line end for XML too.
 It, like LS, can be used to format the XML source, but should not
 be interpreted as other than line end when parsing the XML source.

We are not here concerned, as the UAX is, with when to stop reading
characters in a read-line routine.  We are concerned with which
distinctions to hide in the name of simplicity.  Our predecessors
considered that the differences between CR, CR, LF, and LF were
non-semantic, and somewhat arbitrarily chose LF as the character to be
passed to applications.  We decided that CR, NEL, NEL, and LS had
this same semantic.  But PS and FF and VT have their own semantics,
and we did not consider it justifiable to make it impossible for XML
applications to receive and process them.

 E.g., PS is not a begin-end markup, which all other XML markup is;
 nor do I know of a way of attaching style to a PS, like can be done
 for p/p etc.

PS is strictly analogous to an XML empty-tag without attributes.
While it is traditional in SGML/XML to use container elements for
paragraphs, there is no necessity to do so.

 Following (ex-) UAX 14 fully, FF and VT should be seen as line
 separtors too. Though they are unlikely in XML source files.
 FF shouldn't be interpreted as generating a page break in the
 styled output of an XML file, should it?

It should be interpreted however the application chooses to interpret it.
Arbitrarily turning it into a LF makes it impossible for the application
to interpret it at all.

  I can't imagine why EOF should be called a line terminator, except
  in the sense that a read a line operation should obviously 
  not attempt to read past EOF.
 
 There have been Unix programs that (mistakenly, I'd say) *discarded*
 the last (possibly partial) line of input, just because it had no LF at
 its end... And LS it's a separator, not a terminator, so EOF has to be a
 line terminator.

It would be a corruption of the input to infer a LF at the end of a
document.

-- 
First known example of political correctness:   John Cowan
After Nurhachi had united all the otherhttp://www.reutershealth.com
Jurchen tribes under the leadership of the  http://www.ccil.org/~cowan
Manchus, his successor Abahai (1592-1643)   [EMAIL PROTECTED]
issued an order that the name Jurchen should   --S. Robert Ramsey,
be banned, and from then on, they were all _The Languages of China_
to be called Manchus.



Re: Encoding for Fun (was Line Separator)

2003-10-22 Thread John Cowan
Philippe Verdy scripsit:

 It could as well be mapped over EBCDIC, using the mapping
 between standard ISO Latin 1 and EBCDIC Latin 1, but there's
 a problem caused by the legacy and widely used controls NEL:

ironyWhy, that is no problem!  Just ignore the EBCDIC difference
between NEL and LF, and map ASCIIoid LFs to EBCDIC NELs!  Doesn't
everybody know that's the Right Thing anyhow?  This business of
treating NEL as a distinct line delimiter is a complete waste
of time and money.  And nobody cares about clanking iron dinosaurs,
anyway.  They aren't cool./irony

-- 
John Cowan  [EMAIL PROTECTED]  http://www.ccil.org/~cowan
Does anybody want any flotsam? / I've gotsam.
Does anybody want any jetsam? / I can getsam.
--Ogden Nash, _No Doctors Today, Thank You_



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Peter Kirk
On 22/10/2003 05:19, Kent Karlsson wrote:

 

... And LS it's a separator, not a terminator, so EOF has to be a
line
terminator.
 

Calling it a line terminator means that every
document is forced into the mold of being an integral number of lines
long, regardless of the facts.
   

?? If you mean that concatenating files should not generate a line break
between the files, I agree.
		/kent k
 

But if two files each consist of one or more lines of text separated by 
LS (but with no final LS), when they are concatenated, surely LS must be 
added as a separator. Similarly with paragraphs and PS. And this applies 
even when each consists of one line or one paragraph, hence no LS or PS 
in either file. Conclusion: both LS and PS must be added in ANY 
concatenation. Way to avoid this absurd conclusion: redefine LS and PS 
as line and paragraph terminators, to be used at end of file when (as is 
normal) this corresponds to a line or paragraph end.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread jon
   So this legacy encoding of end-of-lines is now quite obsolete
   even on MacOS.
 
  I don't think it can be called obsolete as long as files generated using
  that line end convention exist.  Or, at least, applications that have an
  operation for  read a line will have to cope with it.  (In other words,
  all of the CR LF CRLF LFCR should mark an end of line.)
 
 I was not speaking about the actual encoding of files into bytes, but
 only about the interpretation of '\n' or '\r' in C/C++, which was the real
 subject of the message.

ISO 14882 says that \n is LF (and also it is newline, i.e. LF is the newline 
function as far as C++ is concerned) and \r is CR.

It does not define this relative to any given character set. So there is 
nothing in the standard to prevent char being interpreted as an implementation-
defined character encoding which is identical to, say US-ASCII or a part of ISO 
8859, except for having CR encoded as 0x0A and LF encoded as 0x0D. This would 
simplify converting newline functions when writing text files on Macs, but 
potentially cause problems elsewhere.

However because the universal-character-name escapes (\u and \U) 
are defined relative to a particular encoding, namely ISO 10646, it would be an 
error if ('\n' != '\u000A' || '\r' != '\u000D'). Whether this is implemented by 
using the values 0x0A and 0x0D for LF and CR respectivley (e.g. by using US-
ASCII or a proper superset of US-ASCII such as Unicode) or by converting those 
values to another encoding when parsing isn't specified.

Given that C and C++ are intended to be neutral to encodings, and indeed they 
do not even mandate that a char be an octet, or that a wchar_t be of the same 
size as 2 or 4 chars, this is not surprising. The consequence is that we cannot 
assume that conversion of character, wide character, and string literals to and 
from Unicode will be trivial.



RE: Encoding for Fun (was Line Separator)

2003-10-22 Thread Jill Ramonsky




Well, that was considerably less wrath than I was expecting. Phew!

But to justify a few design decisions - yes, the encoding is longer (in
general) than UTF-8, but UTF-8 only attempts to preserve ASCII. I
needed to preserve ISO-8859-1. The reasons for this are complicated,
but basically I had to find a way to feed a Unicode string (originally
an array of 32-bit integers) into a legacy engine which was designed,
many years previously (by somebody else), to assume that everythingin
the world was Latin-1. That legacy engine did take ascribe
meaning to the U+A0 to U+FF characters, so I couldn't use them for
anything else. But all I needed it to do with the non-Latin-1-Unicode
characters was preserve them. Essentially, I needed round-trip
compatibility when converting from Unicode to Latin-1 and back. This is
of course impossible ... but the C1 controls weren't being used, so I
made it possible.

Security wasn't an issue, as the encoding never "leaked" into the
outside world, and its spec was never published. If I had wanted to use
it for interchange, I would obviously have further specified that all
characters be stored in the minimum number of bytes. My software didn't
check for violations of this, but only because it didn't need to.

Jill




Re: GDP by language

2003-10-22 Thread Mark Davis
Marco, I certainly wouldn't draw that conclusion. This is not the appropriate
forum for a political or ethical discussion, but equating GDP with more
important in any general sense is clearly a huge leap, and one that I certainly
would not make. There is a rough correlation of GDP with currently has more
money to spend for products, but that is only very, very rough.

And the currently is very important; projections are for this chart to change
pretty dramatically over the course of the next 20-50 years. See, for example,
http://www.gs.com/insight/research/reports/99.pdf,
http://www.economist.com/displaystory.cfm?story_id=1632512, and
http://www.economist.com/displaystory.cfm?story_id=1923383.

(It would be pretty interesting to make a dynamic pie chart with pieces
growing/shrinking over the period of some seconds to reflect projected changes
in the future.)

The goal of the chart was different. Many people mistakenly think the potential
customer base of non-English-speakers is smaller than it actually is. The goal
was to graphically illustrate -- in a very general fashion -- that if a product
only works with English, it misses a huge potential market.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Marco Cimarosti [EMAIL PROTECTED]
To: 'Mark Davis' [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Wed, 2003 Oct 22 02:17
Subject: RE: GDP by language


 Mark Davis wrote:
  BTW, some time ago I had generated a pie chart of world GDP
  divided up by language.

 Those quotients are immoral.

 Of course, this immorality is not the fault of he who did the calculation:
 the immorality is out there, and those infamous numbers are just an
 arithmetical expressions of it.

 In practice, those quotients say that, e.g., Italian (spoken by 50 millions
 people or less) is more important than Hindi (spoken by nearly one billion
 people), just because an average Italian is richer than an average Indian.

 In other terms, each Indian (or any other citizen from poor countries) has
 1/20 or less of the linguistic rights of an Italian (or any other citizen
 from rich countries).

 BTW, by summing up languages written with the same script, it is easy to
 derive the immoral quotients of writing systems:

 Latin 59.13%
 Han   20.60%
 Arabic3.82%
 Cyrillic  2.99%
 Devanagari 2.54%
 Hangul1.84%
 Thai  0.87%
 Bengali   0.44%
 Telugu0.42%
 Greek 0.40%
 Tamil 0.34%
 Gujarati  0.26%

 _ Marco






Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread John Cowan
Peter Kirk scripsit:

 But if two files each consist of one or more lines of text separated by 
 LS (but with no final LS), when they are concatenated, surely LS must be 
 added as a separator. Similarly with paragraphs and PS. 

But your protasis is a petitio principii.  Files may or may not consist of
lines of text: a file may contain less than one line.

 Way to avoid this absurd conclusion: redefine LS and PS 
 as line and paragraph terminators, to be used at end of file when (as is 
 normal) this corresponds to a line or paragraph end.

No doubt this is the de facto position.  (The *true* de facto position,
of course, is not to use LS or PS at all.)

-- 
Dream projects long deferredJohn Cowan [EMAIL PROTECTED]
usually bite the wax tadpole.http://www.ccil.org/~cowan
--James Lileks  http://www.reutershealth.com



RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Kent Karlsson

Peter Kirk wrote:

 But if two files each consist of one or more lines of text 
 separated by 
 LS (but with no final LS), when they are concatenated, surely 
 LS must be 
 added as a separator. Similarly with paragraphs and PS. And 
 this applies 
 even when each consists of one line or one paragraph, hence 
 no LS or PS 
 in either file. Conclusion: both LS and PS must be added in ANY 
 concatenation. Way to avoid this absurd conclusion: redefine 
 LS and PS 
 as line and paragraph terminators, to be used at end of file 
 when (as is 
 normal) this corresponds to a line or paragraph end.

No, and no.

The first and last lines in a text file may well be partial. If one
wants
a PS or LS in-between when concatenating them (assuming they are
of the same encoding), the LS or PS must be explicitly concatenated in.

(The result of reading, line-by-line, first file A then file B is not
always
the same as reading, line-by-line, the concatenation of files A and B.
I.e. readline does not distribute over concatenation, if you like that
kind
of formulation. Maybe you would like it to, but it doesn't, never has.)

/kent k


smime.p7s
Description: S/MIME cryptographic signature


RE: Encoding for Fun (was Line Separator)

2003-10-22 Thread Jill Ramonsky
I can't argue with that ... but my strings were always in (32-bit wide) 
Unicode at sort-time. I'm not sure exactly how much value there is a 
lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should 
not come after 'z'?

Of course, UTF-16 doesn't have the binary sort property either.
Jill
 -Original Message-
 From: John Cowan [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 22, 2003 4:32 PM
 To: Jill Ramonsky
 Cc: [EMAIL PROTECTED]
 Subject: Re: Encoding for Fun (was Line Separator)


 UTF-8 has this property too.  This protocol lacks, however, the binary
 sorting property that UTF-8 has.




Re: Encoding for Fun (was Line Separator)

2003-10-22 Thread John Cowan
Jill Ramonsky scripsit:

 The only invented encoding which got any real use was the following 
 (currently nameless) one:

I believe the name UTF-4 is currently unclaimed.  :-)  I like this idea.

 Another interesting feature: starting from a random point in a string, 
 it is easy to scan backwards or forwards to find the start-byte or 
 end-byte of a character. This is valuable, as it means that you don't 
 have to parse a string from the beginning in order not to get lost.

UTF-8 has this property too.  This protocol lacks, however, the binary
sorting property that UTF-8 has.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
If I have seen farther than others, it is because I was standing on
the shoulders of giants.
--Isaac Newton



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Andy Heninger
Unicode UAX 14 (Line Breaking Properties) also has a bit to say on this 
topic of line separators

From http://www.unicode.org/reports/tr14/

BK - Mandatory Break (A) - (normative)

Explicit breaks act independently of the surrounding characters.

000C

FORM FEED
Form Feed separates pages. The text on the new page starts at
 the beginning of the line. No paragraph formatting is applied.
2028

LINE SEPARATOR

The text after the Line Separator starts at the beginning of the line.
 No paragraph formatting is applied.
This is similar to HTML BR

2029

PARAGRAPH SEPARATOR

The text of the new paragraph starts at the beginning of the line.
 Paragraph formatting is applied.
NEW LINE FUNCTION (NLF)

New line functions provide additional explicit breaks. 
 They are not individual characters, but are expressed as sequences
 of control characters NEL, LF, and CR. What particular sequence(s)
 form a NLF depends on the implementation and other circumstances
 as described in [Unicode] Section 5.8, Newline Guidelines.
If a character sequence for a new line function contains more than
 one character, it is kept together. The default behavior is to break
 after LF or CR, but not between CR and LF. Two additional line
 breaking classes have been added for convenience in this operation.

Mandatory breaks:

LB 3a  Always break after hard line breaks (but never between CR and LF).

BK !

LB 3b  Treat CR followed by LF, as well as CR, LF and NL as hard line breaks

CR  LF
CR !
LF !
NL !


--
  -- Andy Heninger
 [EMAIL PROTECTED]




RE: Encoding for Fun (was Line Separator)

2003-10-22 Thread jon
 I can't argue with that ... but my strings were always in (32-bit wide) 
 Unicode at sort-time. I'm not sure exactly how much value there is a 
 lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should 
 not come after 'z'?

Not always. In particular there's time when a dependable sort order is 
required, but just what that sort order is isn't important. In those cases it 
can useful that UTF-8 and UTF-32 will both do a binary sort with equivalent 
results.

 
 Of course, UTF-16 doesn't have the binary sort property either.

Nope, though an efficient mechanism to sort UTF-16 in the codepoint order is 
available.



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Peter Kirk
On 22/10/2003 08:36, John Cowan wrote:

Peter Kirk scripsit:

 

But if two files each consist of one or more lines of text separated by 
LS (but with no final LS), when they are concatenated, surely LS must be 
added as a separator. Similarly with paragraphs and PS. 
   

But your protasis is a petitio principii.  Files may or may not consist of
lines of text: a file may contain less than one line.
 

Way to avoid this absurd conclusion: redefine LS and PS 
as line and paragraph terminators, to be used at end of file when (as is 
normal) this corresponds to a line or paragraph end.
   

No doubt this is the de facto position.  (The *true* de facto position,
of course, is not to use LS or PS at all.)
 

Well, perhaps this needs to be read as disproof by reductio ad absurdum. 
I have shown it to be absurd to consider files to consist of one or more 
lines of text separated by LS, most obviously because it becomes 
impossible to tell whether the last line is intended to be complete or 
not. But Kent did imply this model of file structure when he wrote And 
LS it's a separator, not a terminator, so EOF has to be a line terminator.

But according to Kent's latest posting (my emphasis), The *first* and 
last lines in a text file may well be partial. How can one tell, in any 
encoding, whether the first line is partial? And it seems that, in a 
file where LS is used as a separator not a terminator, EOF is a line 
terminator except when it isn't.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Encoding for Fun (was Line Separator)

2003-10-22 Thread John Cowan
Jill Ramonsky scripsit:
 
 I can't argue with that ... but my strings were always in (32-bit wide) 
 Unicode at sort-time. I'm not sure exactly how much value there is a 
 lexicographical sort anyway. I mean, even in Latin-1, surely 'é' should 
 not come after 'z'?

Fair enough.  Another good property that your UTF-4 scheme has is that
8-bit search will work correctly, which is true of UTF-8 as well but not
of UTF-16.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
I must confess that I have very little notion of what [s. 4 of the British
Trade Marks Act, 1938] is intended to convey, and particularly the sentence
of 253 words, as I make them, which constitutes sub-section 1.  I doubt if
the entire statute book could be successfully searched for a sentence of
equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940



Re: GDP by language

2003-10-22 Thread Peter Kirk
On 22/10/2003 02:17, Marco Cimarosti wrote:

...

BTW, by summing up languages written with the same script, it is easy to
derive the immoral quotients of writing systems:
Latin   59.13%
Han 20.60%
Arabic   3.82%
Cyrillic 2.99%
Devanagari   2.54%
Hangul   1.84%
Thai 0.87%
Bengali  0.44%
Telugu   0.42%
Greek0.40%
Tamil0.34%
Gujarati 0.26%
_ Marco

 

The data doesn't support addition to this degree of accuracy because of 
the effect of the others area. Cyrillic may even overtake Arabic, 
because there are several countries using the Cyrillic alphabet, but not 
Russian or Ukrainian, which might each contribute 0.1-0.2%, but no 
countries as far as I know using Arabic script but not Arabic, Persian 
or Urdu as official languages (except perhaps Pashto in Afghanistan). 
Also of course the GDP data is surely not reliable to sufficient accuracy.

Also you might get a slightly different picture if you add in the 
relatively prosperous users of non-western scripts who have migrated to 
western countries - Hebrew and Armenian as well as south Asian scripts.

As for the morality issue: while we can't do much about the relative 
availability of computers, it is encouraging to see that commercial 
software vendors as well as the open source community are making 
internationalisation packages and localised versions of software 
available, sometimes free to all and sometimes at greatly reduced cost 
in poorer countries. Unicode isn't going to solve inequalities on its 
own, but it can hardly be blamed for contributing to them. In the long 
term, and if other factors allow it, we might even find that the 
computer revolution is the key to breaking down these inequalities.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Public Review Issues - closing Oct 27

2003-10-22 Thread Rick McGowan
This is to remind everyone that several current Unicode Public Review  
Issues  will close for comments on October 27. Please send soon any  
comments that you have not already submitted.

The Public Review Issues page is here:

http://www.unicode.org/review/

The page includes instructions for returning comments for UTC consideration.

Regards,
Rick McGowan
Unicode, Inc.



Re: GDP by language

2003-10-22 Thread Stefan Persson
Peter Kirk wrote:

On 22/10/2003 02:17, Marco Cimarosti wrote:

...

BTW, by summing up languages written with the same script, it is easy to
derive the immoral quotients of writing systems:
Latin 59.13%
Han   20.60%
Arabic 3.82%
Cyrillic   2.99%
Devanagari 2.54%
Hangul 1.84%
Thai   0.87%
Bengali0.44%
Telugu 0.42%
Greek  0.40%
Tamil  0.34%
Gujarati   0.26%
The data doesn't support addition to this degree of accuracy because 
of the effect of the others area. Cyrillic may even overtake Arabic, 
because there are several countries using the Cyrillic alphabet, but 
not Russian or Ukrainian, which might each contribute 0.1-0.2%, but no 
countries as far as I know using Arabic script but not Arabic, Persian 
or Urdu as official languages (except perhaps Pashto in Afghanistan). 
Also of course the GDP data is surely not reliable to sufficient 
accuracy.
Don't forget to take in account that Latin and Greek letters are used in 
most languages, e.g. as part of mathematical formul.

Stefan




Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Jonathan Coxhead
   On 22 Oct 2003, at 6:53, John Cowan wrote:

 Kent Karlsson scripsit:
 
  All of CR, LF, CR, LF, NEL, LS, PS, and EOF(!). (Assuming that the
  encoding of the text file is recognised.)
 
 XML 1.0 treats CR, LF, and CR, LF as line terminators and reports
 them as LF.
 
 XML 1.1 will treat CR, LF, NEL, CR, LF, CR, NEL, and LS as line
 terminators and report them all as LF.  PS is left alone, because of
 the bare possibility that it is being used as quasi-markup.
 
 I can't imagine why EOF should be called a line terminator, except
 in the sense that a read a line operation should obviously not attempt
 to read past EOF.  Calling it a line terminator means that every
 document is forced into the mold of being an integral number of lines
 long, regardless of the facts.
 
  Don't know about LF, CR. I think that should be two line ends.
 
 I agree.  I don't know any system that uses this sequence.

   The BBC Micro---well-known to a generation of British schoolchildren---used 
this sequence. You can probably find files in that encoding on some 5.25in 
floppies in DFS format in some store cupboards somewhere (for what that's 
worth).

   I wrote a little line-conversion (f)utility recently, and the (minimal) 
research I did suggested that the following was a complete set of line-
terminators that might be found in practice:

  CRLF
  CRFF
  CRVT
  LF
  FF
  VT
  CR
  LFCR
  NEL
  CRCRLF
  NUL
  end of file (not control-D or control-Z, I mean the real end-of-file)

   CRLF is derived from standard printer technology. CRFF and CRVT are how you 
would get the printer to move by more than a line.

   More recent practice allows LF, FF or VT to be used solo. If sent directly 
to a printer they still terminate the line, though the printed output would 
look different since the carriage would not return.

   CR is from MacOS, LFCR is from the BBC Micro. NEL is a dedicated character 
with the right meaning.

   CRCRLF is generated by some buggy software I have to put up with. And I 
can't remember why I wanted to allow NUL. I probably reasoned that (in its C 
role as end of string) it must terminate a line, just as EOF does.

   This is all for Latin-1 only. Obviously, it's pretty idiosyncratic, but it 
looks like I missed at least CRNEL---any others?

   I think someone mentioned the IND (index) character recently in the 
context of line-breaking. I'd like to ask, what is its intended function?

/|
 o o o (_|/
/|
   (_/



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Frank da Cruz
Jonathan Coxhead [EMAIL PROTECTED] wrote:
 On 22 Oct 2003, at 6:53, John Cowan wrote:
  Kent Karlsson scripsit:
  
   Don't know about LF, CR. I think that should be two line ends.
  
  I agree.  I don't know any system that uses this sequence.
 
 The BBC Micro---well-known to a generation of British schoolchildren---used 
 this sequence. You can probably find files in that encoding on some 5.25in 
 floppies in DFS format in some store cupboards somewhere (for what that's 
 worth).
 
Also the PRIME computers of the 1970s and 80s.  If you remember an online
service called The Source (similar to Compuserve, but different), it ran on
big PRIMEs.  File transfer protocols (such as Kermit) that were used to get
text files into and out of The Source swapped LFCR to CRLF (and stripped the
8th bit from its native negative ASCII / Mark parity encoding).

LFCR makes historical sense if you think about how manual typewriters work.
When you push the carriage return lever (did you ever wonder where the
name Carriage Return came from?), the platen rolls up one line immediately
(LF), and then as you keep pushing it, the carriage returns to the left
margin (CR).  See:

  http://xavier.xu.edu:8000/~polt/tw-parts.html

(hey, I never saw a left-handed typewriter before...)

 I wrote a little line-conversion (f)utility recently, and the (minimal) 
 research I did suggested that the following was a complete set of line-
 terminators that might be found in practice:
 
You can't really tell by inspection what any of these sequences is supposed
to do, without knowing where and how the file was created.  Is CR a line
terminator, or a paragraph separator, or is it being used for overstriking
(a common method of underlining).

Treating EOF as EOL is dangerous.  In Unix, many applications flag an error
when a text file does not end with a line terminator, since it might mean
the file is incomplete.  This is in contrast to the Windows practice of
auto-detecting-and-correcting-and-accepting everything, on the assumption
that users can't possibly know what they are doing.

Another interesting scheme is used in VMS text files of a certain format:
each line begins with LF and ends with CR.

- Frank



Preliminary minutes from UTC 96 (August 2003) posted publicly

2003-10-22 Thread Magda Danish \(Unicode\)



The 
preliminary minutes from UTC 96 (August 2003) have beenpostedfor public access at
http://www.unicode.org/consortium/utc-minutes.html

Magda 
Danish Administrative Director The Unicode Consortium 650-693-3921 



Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Philippe Verdy
From: [EMAIL PROTECTED]
 However because the universal-character-name escapes (\u and
\U)
 are defined relative to a particular encoding, namely ISO 10646, it would
be an
 error if ('\n' != '\u000A' || '\r' != '\u000D'). Whether this is
implemented by
 using the values 0x0A and 0x0D for LF and CR respectivley (e.g. by using
US-
 ASCII or a proper superset of US-ASCII such as Unicode) or by converting
those
 values to another encoding when parsing isn't specified.

You're wrong here:
Neither Unicode or ISO specify that the source constants '\n' or '\r', which
are made with an escaping mechanism of _multiple_ distinct characters
specific for some programming languages must be bound at compile-time or
run-time to a LF or CR character.

The '\n' and '\r' conventions are specific to each language, and C/C++ use
conventions distinct from those in Java for example... This is not an
encoding issue, but a language feature.

In C or C++, if you want to be sure that your program will be portable when
you need to specify LF or CR exclusively, you MUST NOT use the '\n' and '\r'
constants but instead the numeric escapes in strings (i.e. \012 or \x0A
for LF, and \015 or \x0D for CR), or simply the integer constants for
the char, int, or wchar_t datatypes (i.e. 10 or 012 or 0x0A for LF, and 13
or 015 or 0x0D for CR), and make sure that your run-time library will map
these values correctly with your run-time locale or system environment (you
may need to specify file-open flags to control this mapping, such as the t
flag for fopen function calls).

So a test like: if ('\n'==10) may or may not be true in C/C++, depending
on the compiler implementation (but not of the system platform...), and the
same test in Java will always be true...




Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Elliotte Rusty Harold
At 11:56 AM -0700 10/22/03, Jonathan Coxhead wrote:

  Don't know about LF, CR. I think that should be two line ends.

 I agree.  I don't know any system that uses this sequence.
   The BBC Micro---well-known to a generation of British schoolchildren---used
this sequence. You can probably find files in that encoding on some 5.25in
floppies in DFS format in some store cupboards somewhere (for what that's
worth).
My God! I had no idea. Those poor British school children who can't 
write XML on their BBC micros! Clearly we must allow LFCR as a legal 
line ending in XML 1.1.  It's a matter of justice!

(Tongue firmly in cheek.)

--

  Elliotte Rusty Harold
  [EMAIL PROTECTED]
  Processing XML with Java (Addison-Wesley, 2002)
  http://www.cafeconleche.org/books/xmljava
  http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA