[unicode] Re: UCS-2 Files

2001-03-27 Thread Pierpaolo BERNARDI


On Fri, 23 Mar 2001, Roozbeh Pournader wrote:

 (And for those Knuth fans out there, yes I know that he has introduced
 MMIX now, that even adopts Unicode in the identifiers and strings of its
 assembly language, MMIXAL, 

But, if I remember correctly, he only allows the 1st plane.
So, even Knuth felt prey of the 'Unicode is 16 bit' myth.

P.





[unicode] Re: UCS-2 Files

2001-03-23 Thread Otto Stolz

Am 2001-03-22 um 14:31 h MEZ hat Tomas McGuinness geschrieben:
 I am currently developing a product that will support UCS-2

For a new project, it would be better to support UTF-16, rather
than UCS-2, from the very beginning. There are already characters
accepted for standardization that can not be encoded in UCS-2.
Cf.
- http://www.unicode.org/unicode/faq/,
- http://www.unicode.org/unicode/faq/utf_bom.html#5,
- http://www.unicode.org/unicode/alloc/Pipeline.html#Characters and
  Scripts Accepted for Unicode.

Best wishes,
  Otto Stolz




[unicode] Re: UCS-2 Files

2001-03-23 Thread Carl W. Brown

Jeff,

A byte is the least addressable portion of memory.  The IBM 1401 for example
has 6 bit bytes + a word mark.  Parity bits don't count.  A lot of systems
in the 50's and early 60's had 6 bit bytes.  That is why octal became so
popular.

Bytes were not used for systems like the IBM 1620 which was a scientific
system.  Memory was an array of number registers and was not character
based.  Instead the least addressable memory unit was a word.

A byte may be 8 bits now but it was not always 8 bits.

Carl


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Jeff Guevin
Sent: Thursday, March 22, 2001 12:01 PM
To: [EMAIL PROTECTED]
Subject: [unicode] Re: UCS-2 Files



 On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote:

 Better if you also keep the distinction between "octet" (a series of
 8 bits) and "byte" (a series of n bits, where n is often but NOT
 always 8).

 When is a byte not eight bits?


The Web version of the Oxford English Dictionary (http://dictionary.oed.com)
says a byte is always eight bits:

"A group of eight consecutive bits operated on as a unit in a computer."

1964 BLAAUW  BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of
information is fundamental to most of the formats [of the System/360]. A
consecutive group of n such units constitutes a field of length n.
Fixed-length fields of length one, two, four, and eight are termed bytes,
halfwords, words, and double words respectively. 1964 IBM Jrnl. Res. 
Developm. VIII. 97/1 When a byte of data appears from an I/O device, the CPU
is seized, dumped, used and restored. 1967 P. A. STARK Digital Computer
Programming xix. 351 The normal operations in fixed point are done on four
bytes at a time. 1968 Dataweek 24 Jan. 1/1 Tape reading and writing is at
from 34,160 to 192,000 bytes per second.

 --
 Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
 PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.









[unicode] Re: UCS-2 Files

2001-03-23 Thread Carl W. Brown

Marco,

I find that people often understand it better when you get away from bytes,
octets etc. and describe Unicode strings as an array of unsigned short (16
bit unsigned integers) in the same manner as single byte characters are an
array of 8 bit integers.  This way the only time you have to deal with
endian issues is when you deal with the memory or transmission layout of the
data.  This also helps when you get into null terminated strings.  You can
not terminate a Unicode string with a byte null, it has to be a full 16 bit
character.

Carl

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Marco Cimarosti
Sent: Thursday, March 22, 2001 7:03 AM
To: 'Tomas McGuinness'; [EMAIL PROTECTED]
Subject: [unicode] Re: UCS-2 Files



Tomas McGuinness wrote:
 I have a question relating to UCS-2. I am currently
 developing a product
 that will support UCS-2 and I have been sent several
 documents encoded in
 UCS-2. I have no reader or writer for UCS-2 but I have
 performed Hexdumps in
 UNIX. At the beginning of the UCS-2 characters there are two rogue
 characters 0xFF and 0xFE. Have these characters any importance?

They are quite important, yes. See
http://www.unicode.org/unicode/faq/utf_bom.html#24 for details.

But, beware that they are NOT characters: they are OCTETS (also known as
"bytes")!

The first thing that I'd suggest you to do when starting working with
Unicode and other character sets is to carefully disjoining the terms "byte"
and "character". Better if you also keep the distinction between "octet" (a
series of 8 bits) and "byte" (a series of n bits, where n is often but NOT
always 8).

In brief, those two octets tell you that:

1.  It is an Unicode text file.

2.  It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is
UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it
is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need
to distinguish).

3.  The 16-bit units are little endian, so you have to interpret these
two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the
"BOM".

4.  All subsequent pairs of octets a,b are interpreted the same way: (a
+ b * 256).

Regards.
_ Marco





[unicode] Re: UCS-2 Files

2001-03-23 Thread J M Sykes


 A byte may be 8 bits now but it was not always 8 bits.

Au contraire!

It was the designers of System/360 who invented the word "byte" to mean the
smallest addressable unit of storage, in their case 8 bits. It is others who
have appropriated the word for their own purposes, as has happened with so
many words since language was invented.

Remember Humpty Dumpty!

Mike.







[unicode] Re: UCS-2 Files

2001-03-22 Thread Michael \(michka\) Kaplan


This is known a byte order mark or BOM. It can be used to determine several
things:

1) That it is a Unicode file
2) The byte order of the file (little endian or big endian)

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: "Tomas McGuinness" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, March 22, 2001 5:31 AM
Subject: [unicode] UCS-2 Files



 Hi,

 I have a question relating to UCS-2. I am currently developing a product
 that will support UCS-2 and I have been sent several documents encoded in
 UCS-2. I have no reader or writer for UCS-2 but I have performed Hexdumps
in
 UNIX. At the beginning of the UCS-2 characters there are two rogue
 characters 0xFF and 0xFE. Have these characters any importance?

 thanks in advance,

 Tom McGuinness







[unicode] Re: UCS-2 Files

2001-03-22 Thread Marco Cimarosti


Tomas McGuinness wrote:
 I have a question relating to UCS-2. I am currently 
 developing a product
 that will support UCS-2 and I have been sent several 
 documents encoded in
 UCS-2. I have no reader or writer for UCS-2 but I have 
 performed Hexdumps in
 UNIX. At the beginning of the UCS-2 characters there are two rogue
 characters 0xFF and 0xFE. Have these characters any importance?

They are quite important, yes. See
http://www.unicode.org/unicode/faq/utf_bom.html#24 for details.

But, beware that they are NOT characters: they are OCTETS (also known as
"bytes")!

The first thing that I'd suggest you to do when starting working with
Unicode and other character sets is to carefully disjoining the terms "byte"
and "character". Better if you also keep the distinction between "octet" (a
series of 8 bits) and "byte" (a series of n bits, where n is often but NOT
always 8).

In brief, those two octets tell you that:

1.  It is an Unicode text file.

2.  It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is
UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it
is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need
to distinguish).

3.  The 16-bit units are little endian, so you have to interpret these
two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the
"BOM".

4.  All subsequent pairs of octets a,b are interpreted the same way: (a
+ b * 256).

Regards.
_ Marco




[unicode] Re: UCS-2 Files

2001-03-22 Thread Gaute B Strokkenes


On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote:

 Better if you also keep the distinction between "octet" (a series of
 8 bits) and "byte" (a series of n bits, where n is often but NOT
 always 8).

When is a byte not eight bits?

-- 
Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.




[unicode] Re: UCS-2 Files

2001-03-22 Thread Jeff Guevin

 
 On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote:
 
 Better if you also keep the distinction between "octet" (a series of
 8 bits) and "byte" (a series of n bits, where n is often but NOT
 always 8).
 
 When is a byte not eight bits?
 

The Web version of the Oxford English Dictionary (http://dictionary.oed.com)
says a byte is always eight bits:

"A group of eight consecutive bits operated on as a unit in a computer."

1964 BLAAUW  BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of
information is fundamental to most of the formats [of the System/360]. A
consecutive group of n such units constitutes a field of length n.
Fixed-length fields of length one, two, four, and eight are termed bytes,
halfwords, words, and double words respectively. 1964 IBM Jrnl. Res. 
Developm. VIII. 97/1 When a byte of data appears from an I/O device, the CPU
is seized, dumped, used and restored. 1967 P. A. STARK Digital Computer
Programming xix. 351 The normal operations in fixed point are done on four
bytes at a time. 1968 Dataweek 24 Jan. 1/1 Tape reading and writing is at
from 34,160 to 192,000 bytes per second.

 -- 
 Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/
 PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.
 
 





[unicode] Re: UCS-2 Files

2001-03-22 Thread Frank da Cruz

On Thu, 22 Mar 2001 15:00:55 -0500, Jeff Guevin [EMAIL PROTECTED] wrote:
  On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote:
  Better if you also keep the distinction between "octet" (a series of
  8 bits) and "byte" (a series of n bits, where n is often but NOT
  always 8).
  
  When is a byte not eight bits?
 
 The Web version of the Oxford English Dictionary (http://dictionary.oed.com)
 says a byte is always eight bits:
 
I hate to quibble with the OED, but its definition is overly IBM-360-centric.
I'm sure some of you remember the 36-bit-word machines of the 1950s through
80s (and I believe at least one is still in production to this day -- the
heir to the Sperry 1100 line).  Many of these machines had bytes of non-8
sizes, and some had variable-length bytes.  On the PDP-6 and PDP-10, for
example, bytes were commonly 7 bits long, but could (and often were) any
size between 1 and 36 bits.

I mention this only because these influential but by now largely forgotten
machines are poised to make a comeback, thanks to *several* PDP-10 emulators
that were released in recent weeks.

For more about one of the 36-bit machines, the DECSYSTEM-20, see:

  http://www.columbia.edu/kermit/dec20.html

and see the links at the bottom if you want to find the emulators.  For more
about some of the other 36-bit machines (such as the IBM 701 and its
descendents), see:

  http://www.columbia.edu/~fdc/timeline.html

None of this has a thing to do with Unicode, so anybody who finds this
interesting is encouraged to "turn on the wayback machine" at:

  news:alt.sys.pdp10

- Frank





[unicode] Re: UCS-2 Files

2001-03-22 Thread Roozbeh Pournader



On Thu, 22 Mar 2001, Jeff Guevin wrote:

 The Web version of the Oxford English Dictionary (http://dictionary.oed.com)
 says a byte is always eight bits:
 [...]

There is at least one computer currently in use whose bytes are 6 bits! :)
It's the MIX machine by Donald Knuth, which you can find about in The Art
of Computer Programming, Volume 1, Fundamental Algorithms. I can't find
the exact wording, since I don't have a copy here. It should be at the
very beginning of Section 1.2 (or 1.3?).

(And for those Knuth fans out there, yes I know that he has introduced
MMIX now, that even adopts Unicode in the identifiers and strings of its
assembly language, MMIXAL, but that came after the publication of last
editions of first three volumes, so all examples in the book are still in
MMIX. It's not official yet!)

--roozbeh





[unicode] Re: UCS-2 Files

2001-03-22 Thread jgo


 On Thu, 2001-03-22, marco.cimarosti wrote:
 Better if you also keep the distinction between "octet" (a series of
 8 bits) and "byte" (a series of n bits, where n is often but NOT
 always 8).

 When is a byte not eight bits?

When it is 6 bits or 12 bits or 16 bits or 18 bits...

 The Web version of the Oxford English Dictionary
 (http://dictionary.oed.com) says a byte is always eight bits:

 "A group of eight consecutive bits operated on as a unit in a computer."
 
 1964 BLAAUW  BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of
 information is fundamental to most of the formats [of the System/360]...

That's what you ( OED) get for relying on Ill-Begotten Monstrosities
for your definitions.

To those maroons a queue is a "spool" because their systems were
so primitive they had to put all files to be printed onto mag-tape,
then have the operator physically move the mag-tape to a drive on
another system that could do printing.

On Control Data Cybers (designed by Seymour Cray) a byte was either
6 bits or 12 bits, depending on context, and the fundamental definition
of byte was the number of bits needed to represent a character.

It's only been recently that people have resorted to the clearer,
less baggage-perverted term "octet" to be exactly 8 bits regardless
of the system, and then speak of the number of octets needed to
represent something (e.g. an IP or character).

John G. Otto Nisus Software, Engineering
www.infoclick.com  www.mathhelp.com  www.nisus.com  software4usa.com
EasyAlarms  PowerSleuth  NisusEMail  NisusWriter  MailKeeper  QUED/M
   My opinions are probably not those of Nisus Software, Inc.






[unicode] Re: UCS-2 Files

2001-03-22 Thread Mike Brown

  When is a byte not eight bits?
 
 The Web version of the Oxford English Dictionary 
 (http://dictionary.oed.com)
 says a byte is always eight bits:

Well, just my cursory research shows that to be an overstatement.

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=byte says:

A byte may be 9 bits on 36-bit computers. Some older
architectures used "byte" for quantities of 6 or 7
bits, and the PDP-10 and IBM 7030 supported "bytes"
that were actually bit-fields of 1 to 36 (or 64)
bits! These usages are now obsolete [...]

However, it is not difficult to find character encodings that are defined in
terms of, or that refer to, 7-bit "bytes" -- ASCII [1] and ISO-2022-JP [2]
being examples thereof.

ISO 2022 [3] in fact defines a byte as "a bit string that is operated upon
as a unit" and goes on to say "A graphic character shall have a coded
representation comprising one or more 8-bit combinations (bytes) in an 8-bit
code, and one or more 7-bit combinations (bytes) in a 7-bit code. Within a
coded graphic character set each character shall be represented by the same
number of such bit combinations."

So you can see that "octets" is the preferable term when referring to units
comprised of exactly 8 bits.

 [1] ftp://ftp.ecma.ch/ecma-st/Ecma-006.pdf (close enough)
 [2] http://www.faqs.org/rfcs/rfc1468.html
 [3] ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf (close enough)




[unicode] Re: UCS-2 Files

2001-03-22 Thread Peter_Constable


On 03/22/2001 12:09:09 PM unicode-bounce wrote:

When is a byte not eight bits?

When it's seven or less, and when it's nine or more. For some, the
definition of byte allows such possibilities. This is reflected in the fact
that ISO uses the term "octet" where you would use "byte".



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]






[unicode] Re: UCS-2 Files

2001-03-22 Thread mbrown

  When is a byte not eight bits?
 
 The Web version of the Oxford English Dictionary 
 (http://dictionary.oed.com)
 says a byte is always eight bits:

Well, just my cursory research shows that to be an overstatement.

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=byte says:

A byte may be 9 bits on 36-bit computers. Some older
architectures used "byte" for quantities of 6 or 7
bits, and the PDP-10 and IBM 7030 supported "bytes"
that were actually bit-fields of 1 to 36 (or 64)
bits! These usages are now obsolete [...]

However, it is not difficult to find character encodings that are defined in terms of, 
or that refer to, 7-bit "bytes" -- ASCII [1] and ISO-2022-JP [2] being examples 
thereof.

ISO 2022 [3] in fact defines a byte as "a bit string that is operated upon as a unit" 
and goes on to say "A graphic character shall have a coded representation comprising 
one or more 8-bit combinations (bytes) in an 8-bit code, and one or more 7-bit 
combinations (bytes) in a 7-bit code. Within a coded graphic character set each 
character shall be represented by the same number of such bit combinations."

So you can see that "octets" is the preferable term when referring to units comprised 
of exactly 8 bits.

 [1] ftp://ftp.ecma.ch/ecma-st/Ecma-006.pdf (close enough)
 [2] http://www.faqs.org/rfcs/rfc1468.html
 [3] ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf (close enough)