[unicode] Re: UCS-2 Files
On Fri, 23 Mar 2001, Roozbeh Pournader wrote: (And for those Knuth fans out there, yes I know that he has introduced MMIX now, that even adopts Unicode in the identifiers and strings of its assembly language, MMIXAL, But, if I remember correctly, he only allows the 1st plane. So, even Knuth felt prey of the 'Unicode is 16 bit' myth. P.
[unicode] Re: UCS-2 Files
Am 2001-03-22 um 14:31 h MEZ hat Tomas McGuinness geschrieben: I am currently developing a product that will support UCS-2 For a new project, it would be better to support UTF-16, rather than UCS-2, from the very beginning. There are already characters accepted for standardization that can not be encoded in UCS-2. Cf. - http://www.unicode.org/unicode/faq/, - http://www.unicode.org/unicode/faq/utf_bom.html#5, - http://www.unicode.org/unicode/alloc/Pipeline.html#Characters and Scripts Accepted for Unicode. Best wishes, Otto Stolz
[unicode] Re: UCS-2 Files
Jeff, A byte is the least addressable portion of memory. The IBM 1401 for example has 6 bit bytes + a word mark. Parity bits don't count. A lot of systems in the 50's and early 60's had 6 bit bytes. That is why octal became so popular. Bytes were not used for systems like the IBM 1620 which was a scientific system. Memory was an array of number registers and was not character based. Instead the least addressable memory unit was a word. A byte may be 8 bits now but it was not always 8 bits. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Guevin Sent: Thursday, March 22, 2001 12:01 PM To: [EMAIL PROTECTED] Subject: [unicode] Re: UCS-2 Files On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: "A group of eight consecutive bits operated on as a unit in a computer." 1964 BLAAUW BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of information is fundamental to most of the formats [of the System/360]. A consecutive group of n such units constitutes a field of length n. Fixed-length fields of length one, two, four, and eight are termed bytes, halfwords, words, and double words respectively. 1964 IBM Jrnl. Res. Developm. VIII. 97/1 When a byte of data appears from an I/O device, the CPU is seized, dumped, used and restored. 1967 P. A. STARK Digital Computer Programming xix. 351 The normal operations in fixed point are done on four bytes at a time. 1968 Dataweek 24 Jan. 1/1 Tape reading and writing is at from 34,160 to 192,000 bytes per second. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.
[unicode] Re: UCS-2 Files
Marco, I find that people often understand it better when you get away from bytes, octets etc. and describe Unicode strings as an array of unsigned short (16 bit unsigned integers) in the same manner as single byte characters are an array of 8 bit integers. This way the only time you have to deal with endian issues is when you deal with the memory or transmission layout of the data. This also helps when you get into null terminated strings. You can not terminate a Unicode string with a byte null, it has to be a full 16 bit character. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Marco Cimarosti Sent: Thursday, March 22, 2001 7:03 AM To: 'Tomas McGuinness'; [EMAIL PROTECTED] Subject: [unicode] Re: UCS-2 Files Tomas McGuinness wrote: I have a question relating to UCS-2. I am currently developing a product that will support UCS-2 and I have been sent several documents encoded in UCS-2. I have no reader or writer for UCS-2 but I have performed Hexdumps in UNIX. At the beginning of the UCS-2 characters there are two rogue characters 0xFF and 0xFE. Have these characters any importance? They are quite important, yes. See http://www.unicode.org/unicode/faq/utf_bom.html#24 for details. But, beware that they are NOT characters: they are OCTETS (also known as "bytes")! The first thing that I'd suggest you to do when starting working with Unicode and other character sets is to carefully disjoining the terms "byte" and "character". Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). In brief, those two octets tell you that: 1. It is an Unicode text file. 2. It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need to distinguish). 3. The 16-bit units are little endian, so you have to interpret these two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the "BOM". 4. All subsequent pairs of octets a,b are interpreted the same way: (a + b * 256). Regards. _ Marco
[unicode] Re: UCS-2 Files
A byte may be 8 bits now but it was not always 8 bits. Au contraire! It was the designers of System/360 who invented the word "byte" to mean the smallest addressable unit of storage, in their case 8 bits. It is others who have appropriated the word for their own purposes, as has happened with so many words since language was invented. Remember Humpty Dumpty! Mike.
[unicode] Re: UCS-2 Files
This is known a byte order mark or BOM. It can be used to determine several things: 1) That it is a Unicode file 2) The byte order of the file (little endian or big endian) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "Tomas McGuinness" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, March 22, 2001 5:31 AM Subject: [unicode] UCS-2 Files Hi, I have a question relating to UCS-2. I am currently developing a product that will support UCS-2 and I have been sent several documents encoded in UCS-2. I have no reader or writer for UCS-2 but I have performed Hexdumps in UNIX. At the beginning of the UCS-2 characters there are two rogue characters 0xFF and 0xFE. Have these characters any importance? thanks in advance, Tom McGuinness
[unicode] Re: UCS-2 Files
Tomas McGuinness wrote: I have a question relating to UCS-2. I am currently developing a product that will support UCS-2 and I have been sent several documents encoded in UCS-2. I have no reader or writer for UCS-2 but I have performed Hexdumps in UNIX. At the beginning of the UCS-2 characters there are two rogue characters 0xFF and 0xFE. Have these characters any importance? They are quite important, yes. See http://www.unicode.org/unicode/faq/utf_bom.html#24 for details. But, beware that they are NOT characters: they are OCTETS (also known as "bytes")! The first thing that I'd suggest you to do when starting working with Unicode and other character sets is to carefully disjoining the terms "byte" and "character". Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). In brief, those two octets tell you that: 1. It is an Unicode text file. 2. It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need to distinguish). 3. The 16-bit units are little endian, so you have to interpret these two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the "BOM". 4. All subsequent pairs of octets a,b are interpreted the same way: (a + b * 256). Regards. _ Marco
[unicode] Re: UCS-2 Files
On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.
[unicode] Re: UCS-2 Files
On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: "A group of eight consecutive bits operated on as a unit in a computer." 1964 BLAAUW BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of information is fundamental to most of the formats [of the System/360]. A consecutive group of n such units constitutes a field of length n. Fixed-length fields of length one, two, four, and eight are termed bytes, halfwords, words, and double words respectively. 1964 IBM Jrnl. Res. Developm. VIII. 97/1 When a byte of data appears from an I/O device, the CPU is seized, dumped, used and restored. 1967 P. A. STARK Digital Computer Programming xix. 351 The normal operations in fixed point are done on four bytes at a time. 1968 Dataweek 24 Jan. 1/1 Tape reading and writing is at from 34,160 to 192,000 bytes per second. -- Gaute Strokkeneshttp://www.srcf.ucam.org/~gs234/ PEGGY FLEMING is stealing BASKET BALLS to feed the babies in VERMONT.
[unicode] Re: UCS-2 Files
On Thu, 22 Mar 2001 15:00:55 -0500, Jeff Guevin [EMAIL PROTECTED] wrote: On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: I hate to quibble with the OED, but its definition is overly IBM-360-centric. I'm sure some of you remember the 36-bit-word machines of the 1950s through 80s (and I believe at least one is still in production to this day -- the heir to the Sperry 1100 line). Many of these machines had bytes of non-8 sizes, and some had variable-length bytes. On the PDP-6 and PDP-10, for example, bytes were commonly 7 bits long, but could (and often were) any size between 1 and 36 bits. I mention this only because these influential but by now largely forgotten machines are poised to make a comeback, thanks to *several* PDP-10 emulators that were released in recent weeks. For more about one of the 36-bit machines, the DECSYSTEM-20, see: http://www.columbia.edu/kermit/dec20.html and see the links at the bottom if you want to find the emulators. For more about some of the other 36-bit machines (such as the IBM 701 and its descendents), see: http://www.columbia.edu/~fdc/timeline.html None of this has a thing to do with Unicode, so anybody who finds this interesting is encouraged to "turn on the wayback machine" at: news:alt.sys.pdp10 - Frank
[unicode] Re: UCS-2 Files
On Thu, 22 Mar 2001, Jeff Guevin wrote: The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: [...] There is at least one computer currently in use whose bytes are 6 bits! :) It's the MIX machine by Donald Knuth, which you can find about in The Art of Computer Programming, Volume 1, Fundamental Algorithms. I can't find the exact wording, since I don't have a copy here. It should be at the very beginning of Section 1.2 (or 1.3?). (And for those Knuth fans out there, yes I know that he has introduced MMIX now, that even adopts Unicode in the identifiers and strings of its assembly language, MMIXAL, but that came after the publication of last editions of first three volumes, so all examples in the book are still in MMIX. It's not official yet!) --roozbeh
[unicode] Re: UCS-2 Files
On Thu, 2001-03-22, marco.cimarosti wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not eight bits? When it is 6 bits or 12 bits or 16 bits or 18 bits... The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: "A group of eight consecutive bits operated on as a unit in a computer." 1964 BLAAUW BROOKS in IBM Systems Jrnl. III. 122 An 8-bit unit of information is fundamental to most of the formats [of the System/360]... That's what you ( OED) get for relying on Ill-Begotten Monstrosities for your definitions. To those maroons a queue is a "spool" because their systems were so primitive they had to put all files to be printed onto mag-tape, then have the operator physically move the mag-tape to a drive on another system that could do printing. On Control Data Cybers (designed by Seymour Cray) a byte was either 6 bits or 12 bits, depending on context, and the fundamental definition of byte was the number of bits needed to represent a character. It's only been recently that people have resorted to the clearer, less baggage-perverted term "octet" to be exactly 8 bits regardless of the system, and then speak of the number of octets needed to represent something (e.g. an IP or character). John G. Otto Nisus Software, Engineering www.infoclick.com www.mathhelp.com www.nisus.com software4usa.com EasyAlarms PowerSleuth NisusEMail NisusWriter MailKeeper QUED/M My opinions are probably not those of Nisus Software, Inc.
[unicode] Re: UCS-2 Files
When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: Well, just my cursory research shows that to be an overstatement. http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=byte says: A byte may be 9 bits on 36-bit computers. Some older architectures used "byte" for quantities of 6 or 7 bits, and the PDP-10 and IBM 7030 supported "bytes" that were actually bit-fields of 1 to 36 (or 64) bits! These usages are now obsolete [...] However, it is not difficult to find character encodings that are defined in terms of, or that refer to, 7-bit "bytes" -- ASCII [1] and ISO-2022-JP [2] being examples thereof. ISO 2022 [3] in fact defines a byte as "a bit string that is operated upon as a unit" and goes on to say "A graphic character shall have a coded representation comprising one or more 8-bit combinations (bytes) in an 8-bit code, and one or more 7-bit combinations (bytes) in a 7-bit code. Within a coded graphic character set each character shall be represented by the same number of such bit combinations." So you can see that "octets" is the preferable term when referring to units comprised of exactly 8 bits. [1] ftp://ftp.ecma.ch/ecma-st/Ecma-006.pdf (close enough) [2] http://www.faqs.org/rfcs/rfc1468.html [3] ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf (close enough)
[unicode] Re: UCS-2 Files
On 03/22/2001 12:09:09 PM unicode-bounce wrote: When is a byte not eight bits? When it's seven or less, and when it's nine or more. For some, the definition of byte allows such possibilities. This is reflected in the fact that ISO uses the term "octet" where you would use "byte". - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
[unicode] Re: UCS-2 Files
When is a byte not eight bits? The Web version of the Oxford English Dictionary (http://dictionary.oed.com) says a byte is always eight bits: Well, just my cursory research shows that to be an overstatement. http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=byte says: A byte may be 9 bits on 36-bit computers. Some older architectures used "byte" for quantities of 6 or 7 bits, and the PDP-10 and IBM 7030 supported "bytes" that were actually bit-fields of 1 to 36 (or 64) bits! These usages are now obsolete [...] However, it is not difficult to find character encodings that are defined in terms of, or that refer to, 7-bit "bytes" -- ASCII [1] and ISO-2022-JP [2] being examples thereof. ISO 2022 [3] in fact defines a byte as "a bit string that is operated upon as a unit" and goes on to say "A graphic character shall have a coded representation comprising one or more 8-bit combinations (bytes) in an 8-bit code, and one or more 7-bit combinations (bytes) in a 7-bit code. Within a coded graphic character set each character shall be represented by the same number of such bit combinations." So you can see that "octets" is the preferable term when referring to units comprised of exactly 8 bits. [1] ftp://ftp.ecma.ch/ecma-st/Ecma-006.pdf (close enough) [2] http://www.faqs.org/rfcs/rfc1468.html [3] ftp://ftp.ecma.ch/ecma-st/Ecma-035.pdf (close enough)