RE: What's in a wchar_t string ...
Folks, Since "ISO/IEC 9899 - Programming Language C" was quoted, I wonder if you are aware of the efforts of SC22/WG14 to develop a Technical Report that deals with the problems discussed in this thread. The document is ISO/IEC DTR 19769 - Extensions for the programming language C to support new character data types The project is currently in DTR ballot and will, when approved, certainly take some time to be implemented in C-compilers and in operating systems. But it gives a good indication, in which direction the formal standardization is going with data types in C language. Here are some excerpts from the DTR 19769: Quote: 3 The new typedefs This Technical Report introduces the following two new typedefs, char16_t and char32_t : typedef T1 char16_t; typedef T2 char32_t; where T1 has the same type as uint_least16_t and T2 has the same type as uint_least32_t. The new typedefs guarantee certain widths for the data types, whereas the width of wchar_t is implementation defined. The data values are unsigned, while char and wchar_t could take signed values. This Technical Report also introduces the new header: The new typedefs, char16_t and char32_t, are defined in 4 Encoding C99 subclause 6.10.8 specifies that the value of the macro _ _STDC_ISO_10646_ _ shall be "an integer constant of the form mmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale. Analogous to this macro, this Technical Report introduces two new macros. If the header defines the macro _ _STDC_UTF_16_ _, values of type char16_t shall have UTF-16 encoding. This allows the use of UTF-16 in char16_t even when wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\U and \u) because for these the conversion to UTF-16 is defined unambiguously. If the header defines the macro _ _STDC_UTF_32_ _, values of type char32_t shall have UTF-32 encoding. If the header does not define the macro _ _STDC_UTF_16_ _, the encoding of char16_t is implementation defined. Similarly, if the header does not define the macro _ _STDC_UTF_32_ _, the encoding of char32_t is implementation defined. An implementation may define other macros to indicate a different encoding. Unquote The document, which of course is copyrighted by ISO, starts with a nice introduction that defines the problem. In addition to the excerpts above, it also addresses the following subjects: 5 String literals and character constants 5.1 String literals and character constants notations 5.2 The string concatenation 6 Library functions 6.1 The mbrtoc16 function 6.2 The c16rtomb function 6.3 The mbrtoc32 function 6.4 The c32rtomb function 7 ANNEX A Unicode encoding forms: UTF-16, UTF-32 Best regards Arnold -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nelson H. F. Beebe Sent: Wednesday, March 03, 2004 1:49 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: What's in a wchar_t string ... "Frank Yung-Fong Tang" <[EMAIL PROTECTED]> asks on Wed, 3 Mar 2004 12:38:49 -0500: >> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is defined? >> or does it only mean wchar_t hold the character in ISO_10646 >> (which mean it could be 2 bytes, 4 bytes or more than that?) Here is the exact text from INTERNATIONAL ISO/IEC STANDARD 9899 Second edition 1999-12-01 Programming languages -- C >> ... >> __STDC_ISO_10646__ An integer constant of the form mmL (for >> example, 199712L), intended to indicate >> that values of type wchar_t are the coded >> representations of the characters defined >> by ISO/IEC 10646, along with all amendments >> and technical corrigenda as of the >> specified year and month. >> ... It says nothing more about the size of wchar_t, or what encodings are used: note the vague language "coded representations...". This means effectively that the implementation, not the Standard, decides. Very few current Unix C or C++ compilers even define the symbol __STDC_ISO_10646__; the C/C++ feature test package at ftp://ftp.math.utah.edu/pub/features http://www.math.utah.edu/pub/features probes that macro value, and many others. My logs of its runs in about 90 build environments show definitions with values 29 for GNU gcc versions 3.x (all platforms), Intel icc versions 7.x and 8.0 (Intel IA-32 and
RE: UNICODE & OTHER STANDARDS
And of course: COBOL, FORTRAN, C, C++, POSIX, 10176 Characters for identifiers in programming languages, 14651 string ordering, 15897 registry of cultural elements, the 8859 family, 15924 names of script, 19769 new character types in C , and more ... Arnold -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries Sent: Monday, December 29, 2003 2:38 PM To: Markus Scherer; unicode Subject: Re: UNICODE & OTHER STANDARDS - Message d'origine - De: "Markus Scherer" <[EMAIL PROTECTED]> > It looks to me like Christopher is not after an analysis of what standards could somehow be squeezed > to use Unicode charsets, but rather a list of standards that _specify_ (actively, not potentially) > Unicode/10646. > > The obvious ones are of course > HTML (at least since 4.01: http://www.w3.org/TR/html401/charset.html#h-5.1) > XML > ECMAScript > > I do not have a complete list. Another one : ISO 14651 (collation), I believe. Ken Whistler (or Alain Labonté) can confirm (or deny) this. P. A.
RE: Stability of WG2
Jill, Speaking as an Austrian, I don't care why the UK does not participate in SC2/WG2. But I DO appreciate the information, that I am not going to see an answer to this question. Please be kind to Michael. Regards Arnold From: Arcane Jill [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 8:41 AMTo: [EMAIL PROTECTED]Subject: RE: Stability of WG2 Speaking as a Brit, I would like to know the answer to this one too. What's the problem with answering online?And if you're really not going toanswer this online, you could have just emailed Peter privately, instead of telling the whole list that you're going to keep the answer secret from all of us except Peter. What a wind up!Jill> -Original Message-> From: Michael Everson [mailto:[EMAIL PROTECTED]]> Sent: Tuesday, December 16, 2003 12:49 PM> To: [EMAIL PROTECTED]> Subject: Re: Stability of WG2> > > At 04:36 -0800 2003-12-16, Peter Kirk wrote:> > >Seriously, can you remind us briefly what the situation is, why > >there is no current UK representation?> > I will answer this off-line.> -- > Michael Everson * * Everson Typography * * http://www.evertype.com>
RE: Backslash n [OT] was Line Separator and Paragraph Separator
Jill, The standard is available at http://www.techstreet.com/cgi-bin/detail?product_id=232462 It is a bargain, the PDF file goes for $18.00 (yes, eighteen USD). The printed version is somewhat more expensive, $220.00. Go order it, and your desire for a reference will be satisfied. Regards Arnold == -Original Message- From: Jill Ramonsky [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 21, 2003 8:31 AM To: [EMAIL PROTECTED] Subject: RE: Backslash n [OT] was Line Separator and Paragraph Separator I am very happy to be corrected. Thank you very much. I would also greatly appreciate the "chapter and verse" ... not because I want to carry on arguing (I don't), but simply because I would very much like to have that standard available to me as a reference work. Thanks again, and my apologies John, Jill > -Original Message- > From: John Cowan [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 21, 2003 1:19 PM > To: Jill Ramonsky > Cc: [EMAIL PROTECTED] > Subject: Re: Backslash n [OT] was Line Separator and > Paragraph Separator > > > Jill Ramonsky scripsit: > > > This is axiomatically *THE* definition. Period. Everything else is > > merely quoting, rephrasing or reinterpretting this original. > > Absolutely not. The *standard* for the C programming language is now > ISO/IEC 9899. > Anyone have the standard handy to quote chapter and verse? >
RE: Aramaic, Samaritan, Phoenician
I grew up in Austria more than 50 years ago, and trust me, cursive script was already ancient then. Yes, we had to learn it (1945 - 1948) in primary school, but even then it was not used any more (except for some VERY old people with grey or no hair at all). I might still be able to read it, but I was never able to write it legibly. Just checked with my children - writing cursive had disappeared from the schools altogether before the 1960's. Arnold PS.: I blame the fact that I had to learn to write cursive for my lousy handwriting today - at least it is a good excuse. -Original Message- From: Michael Everson [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 15, 2003 9:54 AM To: [EMAIL PROTECTED] Subject: Re: Aramaic, Samaritan, Phoenician At 08:42 -0400 2003-07-15, Karljürgen Feuerherm wrote: > Michael Everson said: > > My native script isn't Hebrew but I am certain that no one who was could > > easily read a newspaper article written in Phoenician or Samaritan letters. > >Surely that is not an argument for encoding a separate script, is it? It is sometimes. :-) >Most German people I know can't read the German >cursive script used say 50 years ago. But the >characters clearly correspond to the Latin >characters in use today. The handwriting is difficult to read. One would think that in German schools it would be at least introduced so children would know about it. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Historians- what is origin of i18n, l10n, etc.?
Hideki, You are most likely right that I18N was used much earlier than I was able to witness. I entered the standards game in 1989 (X3/L2) and started with the POSIX activity sometime in 1991. Thanks for remembering. Arnold -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 10, 2002 10:18 AM To: Winkler, Arnold F Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: Historians- what is origin of i18n, l10n, etc.? > From: "Winkler, Arnold F" <[EMAIL PROTECTED]> > Sometime around 1991 in a IEEE P1003.1 (POSIX) meeting, Gary Miller (IBM) > was writing on the blackboard. After having spelled out > Internationalization a few times, he first abbreviated it to I--n and a bit > later (obviously after counting the letters in between) used I18N. Sandra > might have been at the meeting, and Keld - they might be able to confirm my > recollection. The acronym "I18N" appeared before 1991, since I recall I have already used I18N in '89 ;-). The beginning of this kind of acronym was S12N(Scherpenhuizen) at DEC, as far as on the record, as an email address for him on DEC VMS. By 1985, I18N became an acronym for Internationalization in the I18N team at DEC, by following this Scherpenhuizen's S12N convention. Among the standard organizations, the /usr/group (It became UniForum later) was the first one using I18N as an acronym for Internationalization, in '88. -- hiura@{freestandards.org,OpenI18N.org,li18nux.org,unicode.org,sun.com} Chair, Li18nux/Linux Internationalization Initiative, http://www.li18nux.org Board of Directors, Free Standards Group, http://www.freestandards.org Architect/Sr. Staff Engineer, Sun Microsystems, Inc, USA eFAX: 509-693-8356
RE: Historians- what is origin of i18n, l10n, etc.?
Tex, Here is my recollection: Sometime around 1991 in a IEEE P1003.1 (POSIX) meeting, Gary Miller (IBM) was writing on the blackboard. After having spelled out Internationalization a few times, he first abbreviated it to I--n and a bit later (obviously after counting the letters in between) used I18N. Sandra might have been at the meeting, and Keld - they might be able to confirm my recollection. L10N did not show up until quite some time later. I have no idea who used it first. Regards Arnold -Original Message- From: Tex Texin [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 10, 2002 2:02 AM To: NE Localization SIG; Unicoders Subject: Historians- what is origin of i18n, l10n, etc.? I was asked about the origin of these acronyms. Does anyone know who created these or where they were first used? tex -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
RE: OCR characters
Folks, that is my VERY LAST post on this VERY OLD subject: In the L2 document register I found L2/98-397 http://www.unicode.org/L2/L2/98396.pdf which is a proposal for ISO/IEC TR 15907, a Type 3 TR for the revision of ISO 1073/II:1976. On page 18 is a note that says: NOTE – The glyphs previously defined with reference numbers 120 (CHARACTER ERASE) and 121 (GROUP ERASE) have been deleted. That's the end of my digging in older documents. And have a nice weekend too ! Arnold > -Original Message- > From: Otto Stolz [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 16, 2002 10:30 AM > To: Winkler, Arnold F > Cc: Eric Muller; [EMAIL PROTECTED] > Subject: Re: OCR characters > > > Eric Muller had written: > > > In our OCR fonts, we have two glyphs named "erase" [...] > > > and "grouperase" [...] I suspect those are mandated by these > > standards. On the other hand, and I can't find traces of those in > > Unicode, >
RE: OCR characters
Otto, I am looking at ISO 1073/II-1976: The two erase characters are the only members of set #5, reference numbers are 120 and 121. The "Remarks" column is empty. 6.4 says : Application advise is given in the column "Remarks", where it is indicated, inter alia, which characters are included for general purpose use only and should not be used for OCR purposes. (I guess, an empty column means that the character can be used for OCR). I have not found any more information in ISO 1073/II:1976. Sorry Arnold > -Original Message- > From: Otto Stolz [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 16, 2002 10:30 AM > To: Winkler, Arnold F > Cc: Eric Muller; [EMAIL PROTECTED] > Subject: Re: OCR characters > > > Eric Muller had written: > > > In our OCR fonts, we have two glyphs named "erase" [...] > > > and "grouperase" [...] I suspect those are mandated by these > > standards. On the other hand, and I can't find traces of those in > > Unicode, > > > Arnold F. Winkler wrote: > > I believe, Eric is talking about the characters on the > attached page 8 of > > the OCR standard. > > I don't have ISO 1073 at hand, only the German > - DIN 66 008 (Jan 1978), which is essentially identical with ISO > 1073/I-1976, >and > - DIN 66 009 (Sept. 1977), which is based on, but not identical with, >ISO 1073/II-1976. > > DIN 66 008 contains the figure reported by Arnold Winkler. > This standard > does not specify the intended usage of these characters -- > not beyond their > expressive names. > > DIN 66 009 says about the equivalent OCR-B characters (my > translation): > > In case of a typo, a keyboard-driven device will print the > Character > Erase > > on top of an erroneous character. This will cause the OCR > reading device > > to ignore this position. > > The Group Erase may be either drawn by hand, or printed as > discussed in > > the previous paragraph. It will cause the OCR reading > device to ignore > > this position. > > So, these characters would never be read by an OCR device. > They would be > printed only in response to a function key (such as Erase > Backwards), but > never sent (encoded as characters) to a device. This means, > that they will > not normally be encoded, hence there will probably no need to > assgin Uni- > codes to them. > > The only exception could be a text discussing these characters, and > their usage. I think, this sort of text would use figures rather than > characters, to show the effect of overprinting in several variants. > (The Erase, and the erased, character's positions may > slightly differ.) > > So I guess, these characters are deliberately left off Unicode. > > Best wishes, >Otto Stolz >
RE: OCR characters
I believe, Eric is talking about the characters on the attached page 8 of the OCR standard. Regards Arnold > -Original Message- > From: Eric Muller [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 15, 2002 7:44 PM > To: [EMAIL PROTECTED] > Subject: OCR characters > > > In our OCR fonts, we have two glyphs named "erase" (looks > like a black > square) and "grouperase" (looks like a long dash). I don't > have a copy > of the OCR standards, but I suspect those are mandated by these > standards. On the other hand, and I can't find traces of those in > Unicode, so I suspect they have been unified. But with which > characters? > More generally, are there other things like that we should aware of? > > Thanks, > Eric. > > > Page-8-OCR-B.pdf Description: Binary data
RE: UniCharacter (Re: Codes for codes for codes for... (RE: Chromatic font research))
Folks, WAIT A BIT. This method, as tempting as it is, would make all text "not accessible" for people with visual disabilities. And, as you all know, Section 508 requires that any electronic information from the government (e.g. web site) must be accessible to people with disabilities. Here goes a great idea unless we find an accessible way to "display" colors for the blind ! Assistive Technologies companies - here is your challenge !!! Arnold > -Original Message- > From: Rick McGowan [mailto:[EMAIL PROTECTED]] > Sent: Thursday, June 27, 2002 1:12 PM > To: [EMAIL PROTECTED] > Subject: Re: UniCharacter (Re: Codes for codes for codes for... (RE: > Chromatic font research)) > > > Tex wrote: > > > Lends a whole new meaning to unification! The single > character encoding, > > UniCharacter!. Just color what you need. > > Yeah! I like Tex's suggestion. It would eliminate all kinds > of problems. > We wouldn't have to worry about encoding anything ever again, > because users > would have all the tools they need to express whatever they > wanted just by > coloring in the bits! And nobody would have any problems decoding it! > > The only question that remains is, "how much resolution is > enough"? I > think if we have 512x512 bytes for 256x256 resolution at > 16-bits/pixel for > color, that ought to be enough resolution to satisfy anyone. So each > character would only require 2,097,152 bits. With all the > fancy compression > schemes we could cook up, that shouldn't pose any difficulty > at all. And > it really ought to appeal to the RAM manufacturers... > > Speaking of compression schemes, we could pick a space of say > 32 bits and > allow people to register the characters they like by NUMBER > (!), and we > could keep a whole technical committee engrossed for decades > in deciding > which proposed pictures were really the same and thus have > "already been > registered", and numbering things, then we could transmit > information > compactly by using the catalog numbers instead of the > pictures. That might > be helpful to users, I'm not sure... > > Rick > > >
Re: [OT] Re: The exact birthday of French: 0842-02-14
In this thread, the name "Illig" has been mentioned a few times. Here is some information about his book(s) on the subject: Heribert Illig : "Wer hat an der Uhr gedreht ?" (Wie 300 Jahre Mittelalter erfunden wurden) ISBN 3-612-26561-X, ECON Verlag This book is in German language, I have not seen a translation. Illig has also written an earlier book, called "Das erfundene Mittelalter". Its very first edition was called "Karl der Fiktive, genannt Karl der Große". Historic articles appeared also in publications such as "Vorzeit-Frühzeit-Gegenwart" and "Zeitensprünge" Keep up this discussion, I am enjoying it. Arnold F. Winkler Internationalization Evangelist Tel: 610-648-2055, NET-385-2055 Fax: 610-695-5473 E-mail: [EMAIL PROTECTED]
FW: Bar codes using unicode
Found that somewhat old e-mail from Clive, but the web site is still there ... Good luck Arnold -Original Message- From: Hohberger, Clive [mailto:[EMAIL PROTECTED]] Sent: Friday, May 11, 2001 5:34 AM To: '[EMAIL PROTECTED]' Subject: Bar codes using unicode Speaking as a member of the AIM bar code standards committee, there are new two bar codes which support Unicode. 93i (designed by Sprague Ackey of Intermec) is a linear, error-correcting barcode has issue as an AIM International Technical Standard, and it encodes Unicode 2.0/2.1. For an overview, see: http://www.aimglobal.org/standards/symbinfo/93i_overview.htm Ultracode(r) and Color Ultracode (designed by me; Zebra Technologies Corporation) are 2-dimensional error-correcting symbologies in the AIM standards process. The Ultracode symbology is a constant-height, variable length two-dimensional "linear matrix" using 9-cell high x 2-cell wide tiles containing 283 different values (orignally was 47). Ultracode can encode either 8-bit, multi-byte or the full 21-bit Unicode 3-series character sets. Because of the unique way in which characters are encoded, there is little difference in symbol length when either 8-bit or Unicode encoding is used with either Latin or non-Latin characters such as Chinese, Japanese and Korean. UTF-8 is the default input/output. Black & white Ultracode is scheduled for completion this year... Color Ultracode in 2002. Anyone wishing a copy of the current Ultracode draft spec should contact me offline ([EMAIL PROTECTED]) Clive
Shape of the US Dollar Sign
Friends, I got a request, I can't answer, but I am sure, one of you knows all about it: > > I'm in the middle of a research for my Commercial Laws > IV subject, and I need to know what's the official US > dollar sign: the s cross by one or two vertical lines? > is there any law that says so? > Thanks for your help Arnold F. Winkler Internationalization Evangelist Tel: 610-648-2055, NET-385-2055 Fax: 610-695-5473 E-mail: [EMAIL PROTECTED]
RE: Limbu script proposal
Folks, Michael Everson has produced a consolidated document N2339 on Limbu during the WG2 meeting - I don't have an electronic copy of the document, but I am sure, we can get it. It is essentially a combination of the 3 L2 documents. Arnold -Original Message- From: Asmus Freytag [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 22, 2001 1:27 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Limbu script proposal Mike, Unicode would like to submit the following script proposals with the intent for WG2 to consider them at the Singapore meeting for a future amendment to ISO/IEC 10646-1. You will find the softcopy of the documents on the L2 home page. The documents are L2 001-137 Proposal for Encoding of the Limbu Script L2 001-138 Summary proposal form (Limbu) L2 001-139 Printed samples of Limbu Please follow up with Rick or Ken if you have issues with any of the contents. A./ Asmus Freytag Unicode Liaison to WG2
RE: The Unicode Standard, Version 4.0
Made it UTC/L2 document L2/01-140. Lisa, can you please put it on the agenda for the next meeting in Pleasanton. We might even have some proposals by then. Regards Arnold -Original Message- From: Mark D.K. Whistler [mailto:[EMAIL PROTECTED]] Sent: Sunday, April 01, 2001 8:52 AM To: [EMAIL PROTECTED] Subject: The Unicode Standard, Version 4.0 We are pleased to announce the release of The Unicode Standard, Version 4.0. The character repertoire in Unicode 4.0 is so far identical to that of Unicode 3.0.1, but it will soon increase thanks to *your* help. The primary feature of Unicode 4.0 is that the addition of new code points has now been entirely *liberalized*. The Unicode Technical Committee, jointly with all involved national bodies, has verified that all characters that are likely to be needed for use on computers in the next millennium have already been added with Version 3.0.1. In a recent e-mail to this mailing list, a respected official of The Unicode Consortium wrote: "In other words, unless someone manages to wrest the standard away from the two committees and puts up a public website with an 'Encode Your Character Here For Free and Enter Our Sweepstakes!' interface, I'm not going to worry about 'precious codespace' and neither should anybody else." Well, now we can publicly announce that the words of that official were just an anticipation of the decision that the UTC was in process of making. Unicode has now so many empty slots that everybody can get one (or more) and encode whatever he or she wishes! Open the form attached to this mail and be the first one to take advantage of the new mechanism to propose Unicode characters. __ FREE Personalized Email at Mail.com Sign up at http://www.mail.com/?sr=signup