RE: Introducing the idea of a ROMAN VARIANT SELECTOR (was: Re: Proposing Fraktur)
Hi, Ken wrote: frakturDas sinkende Schiff sandte/fraktur SOSfraktur-Rufe./fraktur or conversely, perhaps better: Das sinkende Schiff sandte antiquaSOS/antiqua-Rufe. at the end, it may be more useful to rather markup the semantics than formatting properties, i.e. This is not a question of foreign origin=DEZeitgeist/foreign. It is the responsibility of the rendering engine (style sheet, ...) to map that markup to whatever font/script/typeface should be used, according to users' (or typesetters') preferences, current environment and purpose. - The author or some post-authoring process would (hopefully ;-) ) have the knowledge about where the linguistic expression originates from and can apply appropriate (semantic) markup, but doesn't need to care about typesetting conventions (which the author may not be expert in). - The rendering engine/typesetter doesn't need to have any linguistic information (such as a database of loan words), but only needs to know how to map foreign content to formatting properties in a given context. - Third, depending on the environment and purpose, different stylistic conventions may be necessary for the same linguistic expression (fraktur in one document, no special formatting in another) so that any formatting-oriented markup (or encoding, for that matter) will potentially reduce the reusability of the document. Cheers, Oli Oliver Christ TRADOS GmbH Stuttgart
[partly off-topic] A specialized kind of website, a teleutopia webspace.
The recent sending of attachments in this unicode discussion group has led me to think once again about my idea for a specialized type of website. In view of the fact that, although I can do some client side JavaScript, I have no knowledge of server side scripting, I do not know whether my idea is feasible or, if it is feasible, whether it is a relatively quick task for someone who has the right skills or a major task. Although the idea was originated as a suggested infrastructural tool for the construction of distance education packages by an informal team of people located around the world, it also potentially has applications for this unicode community, so I wondered if perhaps some of the participants of this discussion group might perhaps be willing to comment. If the Unicode Consortium would like to implement the idea, then great. Here is the idea in general terms. It is called a teleutopia webspace: the word teleutopia has five syllables, tel - eu - top - i - a and is formed by joining the prefix tel- to the word eutopia. A teleutopia webspace can be used to produce a teleutopia of people individually working at a distance in an informal manner to produce a combined result. All users of the web would be able to access a website, say, for an example here to explain the idea, www.somewhere.com and upon reaching that site an automated system would generate and display a web page that includes two lists of files, each file name in each list provided as a hyperlink, as the home page of the website. The first list is a list of all of the files that have a .htm suffix that are in the home directory of the website at that time. The second list is a list of all files that have any suffix other than .htm that are in the home directory of the website at that time. Registered users of the www.somewhere.com website are able to send emails, each email being an email that has one and only one attachment, to [EMAIL PROTECTED] which is an automated receiving system. The automated receiving system takes the attachment and stores it in the home directory of the www.somewhere.com webspace, either under the name that the attachment carried or, if that name is already taken, under the next sequentially available name of a local standard naming system. The idea is that a registered user of the facility can look through the www.somewhere.com website to find graphics and web pages to which to link as hypertext links, generate on his or her local computer a .htm file including those graphics and links, then email the .htm file as an attachment to [EMAIL PROTECTED] whereupon it will be received and placed in the home directory of the www.somewhere.com website, and thus be shown on the home page of www.somewhere.com for each subsequent web access, by anyone, of the www.somewhere.com website. That is only a simple example of use. A person could add new graphics, Java applets and so on to the www.somewhere.com website by this email method, and then use them in his or her own page and thus also make them available for other registered users of the www.somewhere.com website. This example uses a simple scenario where the information in the email other than the attachment is ignored. If someone implements the idea of a teleutopia webspace then he or she might possibly consider using the information in the email to add other features, such as sending the original sender of the email a deletion code so that he or she may later delete a particular file or send an updated version of it. If someone does implement such features could he or she please note that one possible use of a teleutopia webspace is so that people can submit a portfolio of work for assessment for a distance education qualification, in which case the files need to be sent as undeletable so that any review or assessment is based on the files submitted at a particular time and that the provenance of such a review or assessment cannot be retrospectively undermined by the files which were reviewed or assessed being altered: so, if a deletion facility is included, please also implement an option that a submitted file may be stated to be undeletable and so marked in the list that appears on the www.somewhere.com webspace. Such a website might perhaps be useful to the participants in this discussion group, so that when several people each send in graphics of glyphs for a discussion, a web page could be constructed that showed all of the graphics displayed on one page, together with hyperlinks to relevant documents that are either in the same webspace or are available at other sites on the web. William Overington 1 February 2002 www.users.globalnet.co.uk/~ngo
Re: ICU's uconv vs Linux iconv and UTF-8
Dan FYI I have reported this brain-dead mapping problem to Unicode Dan Consortium but never got an answer. Well, they are not public Dan society in a way they charge for the membership to say anything. One Dan of the reasons so many Japanese love to hate Unicode... This kind of false information is why many Japanese continue to love to hate Unicode. If you were actually on the Unicode mailing list, you wouldn't be repeating garbage like this. Sign up and send a message about the mapping tables. You will get an answer. - Mark LeisherOrthodoxy, of whatever color, seems to Computing Research Lab demand a lifeless, imitative style. New Mexico State University Box 30001, Dept. 3CRL -- Politics and the English Language, Las Cruces, NM 88003 George Orwell
Re: ICU's uconv vs Linux iconv and UTF-8
On 2002.02.02, at 00:32, Jarkko Hietaniemi wrote: So far as I see Linux iconv is ascii-preservative while ICS's is Unicode-strict. From Perl's point of view ASCII preservative should be default. Why? I have already answered in the previous mail (Subject:More on Unicode Mappings, Message-Id: [EMAIL PROTECTED] ) but this one is important so let me repeat. With a good reason. The original mapping of Unicode renders any (EUC|JIS|SHIFTJIS)-written perl scripts (or C codes) unusable. In Japan '\' has been mapped to Yen mark (Because it happened to be at localizable area in ASCII. I believe localizable area in ASCII is causing a lot of headache for such folks as Danish which exploits this feature to fullest extent). So source codes in Japan comes with lots of yen marks instead of backslash. My very first implementation of Jcode did use Unicode table as is and this problem was address in less than hour of release so I fixed it. Dan
Re: ICU's uconv vs Linux iconv and UTF-8
On 2002.02.01, at 23:57, Mark Leisher wrote: Dan FYI I have reported this brain-dead mapping problem to Unicode Dan Consortium but never got an answer. Well, they are not public Dan society in a way they charge for the membership to say anything. One Dan of the reasons so many Japanese love to hate Unicode... This kind of false information is why many Japanese continue to love to hate Unicode. If you were actually on the Unicode mailing list, you wouldn't be repeating garbage like this. Sign up and send a message about the mapping tables. You will get an answer. I have signed up to [EMAIL PROTECTED] a long ago and I thought I did since I am still getting invitation to conferences and such. But I checked [EMAIL PROTECTED] and it did subscribe my address again instead of getting an error message saying I have already subscribed. Hmm Anyway, I have resubscribed so here I go Okay. Here is. let me begin with the original message. Sorry for repetition, folks in [EMAIL PROTECTED] On 2002.02.01, at 19:24, Nick Ing-Simmons wrote: As part of the mystery of CJK encodings I notice that IBM's ICU's uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc Both converters will round-trip with themselves and give byte exact copy of table.euc Weirdly they differ in how they map '\' and '~' in ASCII space as well as some spots in higher characters. Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- [EMAIL PROTECTED] when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior Here is the exerpt from Jcode::Unicode VARIABLES $Jcode::Unicode::PEDANTIC When set to non-zero, x-to-unicode conversion becomes pedantic. That is, '\' (chr(0x5c)) is converted to zenkaku backslash and '~ (chr(0x7e)) to JIS-x0212 tilde. By Default, Jcode::Unicode leaves ascii ([0x00-0x7f]) as it is. Linux iconv will not take ICU's UTF-8. ICU's uconv will read the iconv output but does produce same as original table.euc. So far as I see Linux iconv is ascii-preservative while ICS's is Unicode-strict. From Perl's point of view ASCII preservative should be default. FYI I have reported this brain-dead mapping problem to Unicode Consortium but never got an answer. Well, they are not public society in a way they charge for the membership to say anything. One of the reasons so many Japanese love to hate Unicode... Our current euc-jp.ucm is compatible with Linux iconv. Right choice. Dan the Man with So Many Charsets to Deal With Now let me repeat the same question I have asked a long ago. Why is the Unicode - JISX2xxx map remains so that it does not preserve ASCII part? Despite the fact most converters ignores the original map and leaves ASCII part as is? One more question. Where has the contents in ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ gone? _ Dan Kogai __/ CEO, DAN co. ltd. /__ /-+-/ 2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan /--/--- mailto: [EMAIL PROTECTED] / http://www.dan.co.jp/ - __/ /Tel:+81 3-5665-6131 Fax:+81 3-5665-6132 GPG Key: http://www.dan.co.jp/~dankogai/dankogai.gpg.asc
Re: ICU's uconv vs Linux iconv and UTF-8
On 2002.02.02, at 00:37, Nick Ing-Simmons wrote: Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- [EMAIL PROTECTED] when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior Ah. I take your point. If we used ICU's pedantic form Both UNIX ~/foo and MS C:\Foo get mangled. EXACTLY! The other differences (having looked at diff in yudit) seems to be mapping (I"(B (cent),(I#(B (pound) ,(I,(B (not) and one of the longer dashes to different width variants (full width for ICU). I am going off ICU ... As I addressed to [EMAIL PROTECTED], Yet another problems that ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ is now gone so I don't have a practical way to check the mapping. I want the mapping back! Dan
Re: ICU's uconv vs Linux iconv and UTF-8
Dan As I addressed to [EMAIL PROTECTED], Yet another problems that Dan ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ is now gone so I Dan don't have a practical way to check the mapping. I want the mapping Dan back! *Sigh* Readme.txt, which *is* in the Public/MAPPINGS/EASTASIA/ directory states: The entire former contents of this directory are obsolete and have been moved to the OBSOLETE directory. The latest information may be found in the Unihan.txt file in the latest Unicode Character Database. August 1, 2001. - Mark LeisherOrthodoxy, of whatever color, seems to Computing Research Lab demand a lifeless, imitative style. New Mexico State University Box 30001, Dept. 3CRL -- Politics and the English Language, Las Cruces, NM 88003 George Orwell
Re: ICU's uconv vs Linux iconv and UTF-8
Nick ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE Nick ***HOWEVER** if you use the NON-INTUTIVE URL: Nick http://ftp.unicode.org/Public/MAPPINGS/ Nick one gets redirected to Nick http://www.unicode.org/Public/MAPPINGS/ Nick which is as you state. Quite right. The change to web access happened a couple years ago and I didn't pay attention to the URL, assuming it was web-based. Nick A URL to the location of the Unihan.txt file would be more helpful. Indeed. It is easy to locate on http://www.unicode.org, but is directly available from ftp://www.unicode.org/Public/3.1-Update1/Unihan-3.1.1.txt.gz. - Mark LeisherOrthodoxy, of whatever color, seems to Computing Research Lab demand a lifeless, imitative style. New Mexico State University Box 30001, Dept. 3CRL -- Politics and the English Language, Las Cruces, NM 88003 George Orwell
RE: ICU's uconv vs Linux iconv and UTF-8
Dan Kogai wrote: As I addressed to [EMAIL PROTECTED], Yet another problems that ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ is now gone so I don't have a practical way to check the mapping. I want the mapping back! The Unicode site is a little bit labyrinthic, sometimes. The web version of the data seems more up to date than the ftp site. But don't bother to go on http://www.unicode.org/Public/MAPPINGS/EASTASIA/, because it only contains a note which reads: The entire former contents of this directory are obsolete and have been moved to the OBSOLETE directory. The latest information may be found in the Unihan.txt file in the latest Unicode Character Database. August 1, 2001. And don't bother to download the 23 Mb http://www.unicode.org/Public/UNIDATA/Unihan.txt file, because it contains only mappings for kanji's. So, go directly to http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/, where you can find the old data, along with a note about mapping errors: [...] Below is some analysis by Asmus Freytag of specific problems raised by T. Kubota in this document: http://www.debian.or.jp/~kubota/unicode-symbols.html [...] The following are available as Full Width characters in the FFxx range. Therefore, the mappings of these characters are incorrect. This appears to be a *mapping file issue* as far as these characters are concerned FILE JIS0208.TXT-- 0x2140 U+005C Na # REVERSE SOLIDUS 0x215D U+2212 N # MINUS SIGN 0x2171 U+00A2 Na # CENT SIGN 0x2172 U+00A3 Na # POUND SIGN 0x224C U+00AC Na # NOT SIGN [...] FILE JIS0212.TXT-- 0x2243 U+00A6 Na # BROKEN BAR 0x2234 U+00AF Na # MACRON 0x2237 U+007E Na # TILDE [...] I don't know if this helps solving your issues. _ Marco
Re: ICU's uconv vs Linux iconv and UTF-8
ICU's pedantic form The goal for ICU is to be charset neutral, and support all of the conversions that are in modern use. There are a large number of variants of character sets; you can use the one you want. See: http://oss.software.ibm.com/icu/charset/index.html Mark - Original Message - From: "Dan Kogai" [EMAIL PROTECTED] To: "Nick Ing-Simmons" [EMAIL PROTECTED] Cc: "Nick Ing-Simmons" [EMAIL PROTECTED]; "SADAHIRO Tomoyuki" [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, February 01, 2002 07:46 Subject: Re: ICU's uconv vs Linux iconv and UTF-8 On 2002.02.02, at 00:37, Nick Ing-Simmons wrote: Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- [EMAIL PROTECTED] when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior Ah. I take your point. If we used ICU's pedantic form Both UNIX ~/foo and MS C:\Foo get mangled. EXACTLY! The other differences (having looked at diff in yudit) seems to be mapping $B!V(B (cent),$B!W(B (pound) ,$B%c(B (not) and one of the longer dashes to different width variants (full width for ICU). I am going off ICU ... As I addressed to [EMAIL PROTECTED], Yet another problems that ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ is now gone so I don't have a practical way to check the mapping. I want the mapping back! Dan
Re: GB 18030 question
At 11:23 -0800 2002-02-01, Deborah Goldsmith wrote: There is an error on page 10 of the GB 18030-2000 standard, in that the character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON), but is shown with a glyph that corresponds to U+FF5E (FULLWIDTH TILDE). The position of the character in its code block would also seem to indicate that tilde was intended. Does anyone have any idea of which should be considered correct, the glyph or the Unicode mapping value? Glyphs are informative in JTC1. I can only assume that the GB standards would follow suit. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: RE: ICU's uconv vs Linux iconv and UTF-8
Marco wrote... The web version of the data seems more up to date than the ftp site. They are the same files, available through different protocols! Rick
GB 18030 question
There is an error on page 10 of the GB 18030-2000 standard, in that the character with code point A3FE maps to U+FFE3 (FULLWIDTH MACRON), but is shown with a glyph that corresponds to U+FF5E (FULLWIDTH TILDE). The position of the character in its code block would also seem to indicate that tilde was intended. Does anyone have any idea of which should be considered correct, the glyph or the Unicode mapping value? Deborah Goldsmith Manager, Fonts Language Kits Apple Computer, Inc. [EMAIL PROTECTED]
RE: ICU's uconv vs Linux iconv and UTF-8
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc Both converters will round-trip with themselves and give byte exact copy of table.euc Weirdly they differ in how they map '\' and '~' in ASCII space as well as some spots in higher characters. That is understandable if they use different tables. The question is which one is the right EUC-JP, and which one do users want? ICU, as well as iconv, could have two tables with the different mappings. The question then is how to label them, and whether the labeling should be compatible between the two. Linux iconv will not take ICU's UTF-8. ICU's uconv will read the iconv output but does produce same as original table.euc. I find the same statement confusing. Are you saying that uconv's UTF-8 is ill-formed? Nick, Would you mind email me (and just me, not the list) your table.euc sample file? Thanks, YA
Re: ICU's uconv vs Linux iconv and UTF-8
I'll answer this one. On 2002.02.02, at 03:28, Yves Arrouye wrote: That is understandable if they use different tables. The question is which one is the right EUC-JP, and which one do users want? ICU, as well as iconv, could have two tables with the different mappings. The question then is how to label them, and whether the labeling should be compatible between the two. I don't know which one is 'right'. But most practical and widely-used (euc-jp) is as follows; \x00 - \x7f Maps to US-ASCII \xa1a1 - \xfefe Maps to JISX-0208 (aka Zenkaku) \x8ea1 - \x8edf Maps to JISX-0201 (aka Hankaku) In addition, extended form of euc-jp also includes; \x8fa1a1 - \x8ffefe Maps to JISX-0212 That's what iconv, Tcl's *.enc, and my humble Jcode think what euc-jp is. I find the same statement confusing. Are you saying that uconv's UTF-8 is ill-formed? Nick, Would you mind email me (and just me, not the list) your table.euc sample file? Go get Jcode.pm via http://search.cpan.org/search?dist=Jcode and check under t/ directory. You can find table.euc and x0212.euc. Dan
Re: ICU's uconv vs Linux iconv and UTF-8
It is definitely a problem to try to interpret what any given label is supposed to be. The problem is that MIME labels and others are ambiguous, and are interpreted different ways on different systems. MIME/IANA is the best registry we have, but there are a number of significant problems: - because for most mappings there is no published mapping in the registry to and from Unicode/10646 it is not clear, and certainly not easy, to figure out exactly what the unambiguous decoding is. - in practice, the industry does NOT interpret the same bytes the same way; example, you will get different decodings from SJIS on different platforms. One of the current projects under development for an upcoming release of ICU is to have a more precise API, where you can pass in a label AND a platform (AND version), and get what the platform interprets that label to mean. That way you can ask for EUC-JP as interpreted on, say, Solaris. Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com - Original Message - From: Nick Ing-Simmons [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; SADAHIRO Tomoyuki [EMAIL PROTECTED] Sent: Friday, February 01, 2002 10:21 Subject: Re: ICU's uconv vs Linux iconv and UTF-8 Mark Davis [EMAIL PROTECTED] writes: ICU's pedantic form The goal for ICU is to be charset neutral, and support all of the conversions that are in modern use. There are a large number of variants of character sets; Fair enough - but as shipped (I downloaded it earlier this week) it comes with a convrtrs.txt which maps MIME's EUC-JP onto something it calls ibm-33722 which has the behaviour I reported in at the start of this thread. you can use the one you want. It is not a question of which _I_ want - it is a question of which one(s) CJK perl users want/expect/need. In so far a _I_ want any particular one it is the one which is going to match the X11 font encoding so I can in my naive westerner's way see what it looks like - and I have not a clue which one that is ... See: http://oss.software.ibm.com/icu/charset/index.html I huge list and I don't see how to grep it for the provenance of the table (not that many seem to have any). So can the experts - ideally native reading experts not theorists - tell me which ICU (or other open source) table(s) they want/expect/need, or failing that which ones have proven troublesome. There seem to be at least 4 EUC-JP mappings in that list AIX, Solaris, glibc and Java If we cannot get any answers quickly then I think Dan is correct - we should un-bundle the whole CJK encoding stuff from the core into a family of CPAN modules. Which gives me a design choice: A. Bundle a pragmatic set of CJK which are fast and causes least build pain for non CJK users (i.e. compact precompiled form) B. Make it as easy as possible for end-user to drop in a new encoding from (say) a .ucm file. I can obvioulsy try for both - but they seem to be pulling in opposite directions at present. Meanwhile I will go fix the bugs in the core's :encoding logic ... -- Nick Ing-Simmons http://www.ni-s.u-net.com/
RE: ICU's uconv vs Linux iconv and UTF-8
It is definitely a problem to try to interpret what any given label is supposed to be. The problem is that MIME labels and others are ambiguous, and are interpreted different ways on different systems. Still, in the meantime it does make sense to have EUC-JP associated to the most common interpretation of it, doesn't it? Just for the sake of user satisfaction? I am curious: is there a better name for the EUC-JP that ICU is using, that would make everybody understand which one it is? If so, we could have EUC-JP for the one that the rest of the world wants. YA