Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Doug Ewell wrote: That last paragraph echoes what Frank said about "reversing the layers," performing the UTF-8 conversion first and then looking for escape sequences. True UTF-8 support, in terminal emulators and in other software as well, really should depend on UTF-8 conversion being performed first. The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8 if that is what is being used, when ideally this should happen automatically and transparently. I may be misunderstanding the above, but ISO 2022 says: ESC 2/5 F shall mean that the other coding system uses ESC 2/5 4/0 to return; ESC 2/5 2/15 F shall mean that the other coding system does not use ESC 2/5 4/0 to return (it may have an alternative means to return or none at all). Registration number 196 is for UTF-8 without implementation level, and its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed that way so that a decoder that does not know UTF-8 (or any other coding system invoked by ESC 2/5 F) could simply "skip" the octets in that encoding until it gets to the octets ESC 2/5 4/0. This means that it does not need to decode UTF-8 just to find the escape sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters below U+0080 anyway (they're just single-byte ASCII), so it works, no? Of course, if you wanted to include any C1 controls inside the UTF-8 segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is entirely in the ASCII range (less than 128), so those octets are encoded as is. Erik
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Erik van der Poel wrote: Frank da Cruz wrote: The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8 if that is what is being used, when ideally this should happen automatically and transparently. I may be misunderstanding the above, but ISO 2022 says: ESC 2/5 F shall mean that the other coding system uses ESC 2/5 4/0 to return; ESC 2/5 2/15 F shall mean that the other coding system does not use ESC 2/5 4/0 to return (it may have an alternative means to return or none at all). Registration number 196 is for UTF-8 without implementation level, and its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed that way so that a decoder that does not know UTF-8 (or any other coding system invoked by ESC 2/5 F) could simply "skip" the octets in that encoding until it gets to the octets ESC 2/5 4/0. This means that it does not need to decode UTF-8 just to find the escape sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters below U+0080 anyway (they're just single-byte ASCII), so it works, no? Yes, but I was thinking more about the ISO 2022 invocation features than the designation ones: LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls. The situation *could* arise where these would be used prior to announcing (or switching to) UTF-8. In this case, the end-user would have to configure the software in advance to know whether the incoming byte stream is UTF-8. Not a big deal; just an illustration of what happens when we can't use the normal layering. - Frank
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Yes, but I was thinking more about the ISO 2022 invocation features than the designation ones: LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls. The situation *could* arise where these would be used prior to announcing (or switching to) UTF-8. In this case, the end-user would have to configure the software in advance to know whether the incoming byte stream is UTF-8. Shouldn't the UTF-8 segment switch back to ISO 2022 before invoking any of those C1 controls? This way, the decoder wouldn't have to know UTF-8, and could skip over it reliably. Erik
Re: Euro character in ISO
On Tue, 11 Jul 2000, Asmus Freytag wrote: The only safe way to encode a Euro in HTML appears to be to use Unicode - e.g. by using 8859-1 together with the numeric character reference (NCR) of #x20AC; euro; is much safer. Netscape 4 doesn't recognize hexadecimal character references. --roozbeh
Re: Euro character in ISO
Ar 15:30 -0800 2000-07-11, scríobh Asmus Freytag: At 01:25 PM 7/11/00 -0800, Leon Spencer wrote: Has ISO addressed the Euro character? Yes. It's at 0x20AC in ISO/IEC 10646-1. This is not a standard notation. Please use U+20AC or one of the other standard notations to refer to UCS code positions. ME
Re: Euro character in ISO
Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Excuse me, but that is not appropriate. The ISO/IEC 8859 series is conformant with ISO/IEC 2022, and protocols which adhere to that standard should not be compromised by what you suggest. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). The problem is that some companies do/did not correctly identify their code pages. The world can live with Latin-1 and CP-1252. It shouldn't have to live with CP-1252 being identified as Latin-1. Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: Euro character in ISO
Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Sorry. It may work for CP1252/iso-8859-1, and CP1254/iso-8859-9, but won't for the others. Since Windows starts with the same letter as Word --or is the reason that they both come from the same company. No! I cannot believe that-- there are a couple of requirements that makes effectively the "other" codepages slighty incompatible, such as the necessary presence for · at position B5 (because this is the character Word uses when you ask it to "display" the spaces, and this is hard-coded in the product). Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). Even if 8859-21 is defined to be exactly the same as some stage of CP1252, and everyone in the standardization community admits this as such, habits are so much entrenched, and love against Microsoft so rare in the Unix world, that you may bet a lot that such a standard will never gain wide acceptance. Furthermore, this is completely unnecessary, as nowadays such a standard exists, and it is used to be called 'charset=windows-1252'... The real problem is that: - Windows browsers/MAs did not know that until 1999 (as it seems) - Windows HTML-tools/MAs are reluctant to add the test for presence of non-Latin1 characters to either tag as iso-8859-1 or windows-1252. Apparently they are too lazy (because they already did such a test for ASCII). Well, I am angry, because probably nowadays browsers do the job correctly. Antoine
Re: Euro character in ISO
Robert A. Rosenberg wrote: Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). And because certain companies had (and still have) bugs in their comms products, incorrectly identifying CP1252 data as ISO 8859-1, ISO standards should reject ISO-2022 and populate C1 with graphic characters? I suppose other inconsiderate incompatibilities such as the incorrect encoding of half-pitch kana in ISO-2022-JP is the fault of ISO too? Perhaps those companies that have these major bugs in their software, all of which have been repeatedly pointed out, should fix the probems there. The rest of the industry bends over backwards to accomodate these corrupt data, so a little effort on the part of the guilty would help a lot, and might prevent misguided postings like the above. B=
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz [EMAIL PROTECTED] wrote: . If you send a code in the 0x80-8x9f range to such a terminal or emulator, it properly treats it as a control code. If it was intended as a graphic character ("smart quote" or somesuch) the result is a fractured screen, sometimes even a frozen session. This is the widely reported compatibility problem between UTF-8 and terminals. I know I read somewhere, possibly on Markus Kuhn's Unicode page, possibly somewhere else, that ISO 2022 codes exist to switch out of "ISO 2022 mode" and into "UTF-8 mode" and to either allow or prevent switching back to 2022. Is there any progress on implementing this so terminals and emulators can live with UTF-8? Maybe Markus can clarify. I would be surprised if there's anything in ISO 2022 about UTF8, except that it does provide a way to switch out of and back into ISO 2022 mode, allowing the use of character sets that do not comply with ISO 2022 and 4873. That's what the designating escape sequences "with standard return" and "without standard return" are for. But that's not quite the same thing. There is no good reason why UTF-8 couldn't be used by (say) a VT320 emulator without switching out of the ISO 2022 regime, except that UTF-8 contains C1 control codes as data. This was discussed here a while back and "the other Markus" showed how a C1-safe form of UTF-8 could have been designed: http://www.mindspring.com/~markus.scherer/utf-8c1.html But, as they say, "it's too late now". Therefore, those of us who want to make use of UTF-8 within the ISO 2022 regime must reverse the layers. First decode the UTF-8, then parse for escape sequences. Of course your emulator can get into awful trouble that way if the data stream isn't really UTF-8. But overall it's not that bad; we can live with it, and in fact have done it this way in practice in our own emulator. - Frank
Re: Euro character in ISO
At 04:27 AM 07/12/2000 -0800, Michael Everson wrote: Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Excuse me, but that is not appropriate. The ISO/IEC 8859 series is conformant with ISO/IEC 2022, and protocols which adhere to that standard should not be compromised by what you suggest. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). The problem is that some companies do/did not correctly identify their code pages. The world can live with Latin-1 and CP-1252. It shouldn't have to live with CP-1252 being identified as Latin-1. Which is what I am saying when I talk about admitting that you are using CP-1252 not ISO-8859-1 (in your MIME/HTML headers) at least in the case where there are glyphs in the x80-x9F range in use. If a system can claim US-ASCII if no codes in the x80-xFF range appear and ISO-8859-1 otherwise (as many MUAs do), it should have the smarts to claim CP-1252 if in its scan it found a x80-x9F glyph). Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: Euro character in ISO
At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote: On Tue, 11 Jul 2000, Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Except that would break all the systems that understand that C1 "junk," and a number of systems do so because they are adhering to other ISO standards. If you are going to force someone to change their datastreams to something new, they might as well go to some flavour of Unicode anyways. Who is going to get broken if I say on my MIME header (or HTML) that my CHARSET is (example) ISO-8859-21? You are talking about uses where the computer is talking to a device and needs the C1 range to tell it what to do not another computer (where it is just passing a text stream). The C1 codes are DEVICE CONTROL and have no purpose (except to occupy slots that are better used for extra GLYPHS) in EMAIL or HTML transfer. I am NOT asking for anyone to change their mode of operation - only for ISO-8859-x codes that are designed for transfer of printable data. UNICODE is not a viable option since all we are talking about is the ability to select from a number of 256 codepoint 8-bit tables not go over to UTF-8 or UTF-16 (which would require changes to the program code). Geoffrey "tilting at terminal emulators, err windmills."
Re: Euro character in ISO
On Wed, 12 Jul 2000 10:43:59 -0800, Robert A. Rosenberg wrote: At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote: On Tue, 11 Jul 2000, Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Except that would break all the systems that understand that C1 "junk," and a number of systems do so because they are adhering to other ISO standards. If you are going to force someone to change their datastreams to something new, they might as well go to some flavour of Unicode anyways. Who is going to get broken if I say on my MIME header (or HTML) that my CHARSET is (example) ISO-8859-21? We go through this exercise about twice a year. First, let's recognize that ISO is not about to revoke Standards 4873 and 2022, so there's not much point in suggesting it. Second, think of a terminal that complies with these standards. A physical terminal such as a VT320. I am using it to access my mail host in text mode, and I'm reading mail with (say) Unix 'mail'. The terminal does not interpret the MIME headers. It doesn't parse HTML. It implements a very straightforward finite state automaton that implements the ISO 2022 based terminal. Unix 'mail' sends to my terminal the bytes of the message, period. Perhaps you're suggesting the Unix 'mail' should become a translation agent between the character set of the mail and that of the user's terminal? I hope not, since given that practically any character set anybody can dream up is "MIME-compliant" as long as it's tagged, then every mail program must know how to convert from every character set in existence to every other one. Or is it the mail transfer agent? Or both? It's really quite a mess; let's not go out of our way to make it worse. To understand the implications of using 8-bit character sets that contain graphic characters in the C1 area FOR INTERCHANGE, imagine trying to do the same thing to the C0 area. - Frank
Re: Euro character in ISO
On Wed, 12 Jul 2000, Frank da Cruz wrote: Perhaps you're suggesting the Unix 'mail' should become a translation agent between the character set of the mail and that of the user's terminal? I hope not, since given that practically any character set anybody can dream up is "MIME-compliant" as long as it's tagged, then every mail program must know how to convert from every character set in existence to every other one. Yes, it damn well should. And this is easy, as there is a standard Unix function that knows how to do this. (it's called iconv). I'm logged into unix right now: $ iconv bash: iconv: command not found $ How standard can it be? And what about VMS, VMS/CMS, VOS, OS/390, OS/400, Tandem, and all the others? How does the mail client know what character set my terminal has? Anyway, between you and me, there are potentially lots of places where character-set conversion can occur. Your mail client, your MTA, my MTA, my mail client, my Telnet server, my Telnet client, my terminal emulator. Let's think carefully about this before we have random combinations of these clients, agents, and servers stepping on each others' toes. - Frank
Re: Euro character in ISO
There are lots of Unixes: http://www.columbia.edu/kermit/unix.html How many of them have an iconv function? rangda 47: man iconv man: no entry for iconv in the manual. rangda 48: cat /etc/motd Welcome to Darwin! rangda 49: well, hmmm... zsh: command not found: well, rangda 50:
RE: Euro character in ISO
The trick is HTML4. Since you sent the message in HTML format, the Euro is encoded as numeric character reference. Exchange knows how to decode HTML and generate RTF, depending on what your email client needs. If you had sent plain text, the Euro would have turned into ?. As is the case in the plain text part of the multipart message. This is the case for Outlook Express 5. Older versions of OE treated Windows-1252 and iso-8859-1 the same. Here is the source of the message from my Outlook Express Sent Mail folder. (To see the source, open message and press Ctrl-F3). From: "Chris Wendt" [EMAIL PROTECTED] To: "Chris Wendt" [EMAIL PROTECTED] Subject: Euro test Date: Wed, 12 Jul 2000 15:17:49 -0700 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_NextPart_000_0005_01BFEC14.57202A10" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 This is a multi-part message in MIME format. --=_NextPart_000_0005_01BFEC14.57202A10 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable abcdef ? abcdef --=_NextPart_000_0005_01BFEC14.57202A10 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" HTMLHEAD META content=3D"text/html; charset=3Diso-8859-1" = http-equiv=3DContent-Type META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR STYLE/STYLE /HEAD BODY bgColor=3D#ff DIVFONT color=3D#008000 face=3DVerdana size=3D2abcdef #8364;=20 abcdef/FONT/DIV/BODY/HTML --=_NextPart_000_0005_01BFEC14.57202A10-- -Original Message- From: Leon Spencer [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 12, 2000 2:38 PM To: Unicode List Subject: RE: Euro character in ISO Is Microsoft playing tricks in MS Outlook or IE? If I send text from Outlook Express to my exchange account, with charset set to iso-8859-1 but containing the Trademark symbol ((tm)) in the body, it shows up okay. The body of the message is in text/html. Is it possible that MS Outlook's HTML ActiveX control (which I'm assuming to be the same used for IE) is defaulting to Cp1252/Windows-1252 when it sees iso-8859-1? Leon BTW, The body also contains the Euro!
RE: Euro character in ISO
Does anyone know where I can easily download the latest ISO-8859-X specs? The ones at ftp.unicode.org seem to be dated 1996. Also, does anyone know which ISO-8859-X contains the Euro? Thanks. Leon -Original Message- From: Murray Sargent [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 11, 2000 2:44 PM To: Unicode List Cc: Unicode List Subject: RE: Euro character in ISO The two statements are correct. ISO has addressed the problem by adding more ISO-8859-x standards, since changing 8859-1 would cause problems. The best thing to do is to use Unicode and avoid the codepage confusion :-) Murray -Original Message- From: Leon Spencer [SMTP:[EMAIL PROTECTED]] Sent: Tuesday, July 11, 2000 2:26 PM To: Unicode List Subject:Euro character in ISO The Euro does not exist in iso-8859-1. It is in Cp1252 (WinLatin1) - Microsoft's code page superset of iso-8859-1. Is this correct? Has ISO addressed the Euro character? If so, it the issue more of vendors implementing it? Leon
Re: Euro character in ISO
At 01:25 PM 7/11/00 -0800, Leon Spencer wrote: Has ISO addressed the Euro character? Yes. It's at 0x20AC in ISO/IEC 10646-1. There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: Not only is 8859-15 slightly different from 8859-1, but the difference involves codes that are perfectly valid in 8859-1. Because for 99% of all text, there is no difference, people are almost certain to mix them up, mislabel HTML files etc. etc. The only safe way to encode a Euro in HTML appears to be to use Unicode - e.g. by using 8859-1 together with the numeric character reference (NCR) of #x20AC; A./
Re: Euro character in ISO
At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252).
Re: Euro character in ISO
Robert, I am a big fan of the Windows code pages, they often make my life easier. However, there is a disadvantage to the fact that even over the course of a few service packs (let alone a few operating systems!) the code pages have changed, and there is simply no good documentation that will tell you when (for example) Farsi characters U+06A9 and U+06AF were added to Windows CP1256 (Arabic) . All that one knows for certain is that it was before Windows 98 SE and before NT4 SP5 (although it did no ship with NT4). When you cannot figure out why an application works on one platform and not another, it can make you pine for a more stationary standard! :-) My ISP moved to Windows 2000 so I do not have to worry about making them install things like newer code page files on the web server, but for a long time thse differences plagued me heavily. michka - Original Message - From: "Robert A. Rosenberg" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED] Sent: Tuesday, July 11, 2000 7:19 PM Subject: Re: Euro character in ISO At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252).