Re: entities with breve
At 10:04 -0500 2002-09-23, [EMAIL PROTECTED] wrote: On 09/23/2002 07:50:04 AM PRANI6 wrote: I wonder how to encode two entities with one breve or inverted breve below, for example k+s, or p+f. Are there Characters for half breves left and right or something like that? The answer would be to encoded characters comparable to U+0361. A combining double breve has already been approved for version 4.0. I intend to propose (unless someone gets around to it before me) a combining double inverted breve below. Double INVERTED breve below? -- Michael Everson * * Everson Typography * * http://www.evertype.com 48B Gleann na Carraige; Cill Fhionntain; Baile Átha Cliath 13; Éire Telephone +353 86 807 9169 * * Fax +353 1 832 2189 (by arrangement)
Keys. (derives from Re: Sequences of combining characters.)
The recent discussion on sequences has led me to have a look through the various combining characters and I have found the following. U+20E3 COMBINING ENCLOSING KEYCAP It has occurred to me that the use of a sequence of a base character, then one or more combining characters so as to produce a sequence which would be otherwise unlikely, followed by U+20E3 might be a very effective way to include specialised markup systems within a plain text file without disrupting the normal textual information conveying capabilities of a file. An all-Unicode font would then produce a graphic representation of the key, without any prior arrangement being necessary, so that such marked-up sequences could be produced using just a regular all-Unicode plain text editor. A receiving program with a specialized plug-in could then decode the markup, or it could be decoded manually in some cases. For example, I am looking at using the following sequence so as to produce a special purpose key within documents. U+2604 U+0302 U+20E3 Hopefully that sequence will be so unlikely to occur other than in my specialised application that the sequence can be used uniquely for that specialised application. I am also thinking in terms of using the following sequence to indicate the end of the markup sequence. U+2604 U+0302 U+20E2 I have it in mind that characters in the range U+2460 through to U+2473 could be used before parameters within the markup system. Also, I have noticed that in the document U02D0.pdf that U+20E4 is shown, in the listing, in magenta whereas U+20DF is shown in black. Could someone say what significance the magenta colouring in the document has please? Is it perhaps to indicate additions since the previous issue of the document? William Overington 25 September 2002
Re: entities with breve
Peter Constable wrote as follows. The answer would be to encoded characters comparable to U+0361. A combining double breve has already been approved for version 4.0. I intend to propose (unless someone gets around to it before me) a combining double inverted breve below. In the mean time, one can encode these as PUA characters (which is an interim solution we're going to be using, at least for some purposes). Could you please say some more about what is going to be encoded in regular Unicode and with which code points please? In relation to your encoding these characters as Private Use Area characters, I wonder if you could please say some more about this please, both in relation to which code points you are intending to use and also as to whether encoding a combining accent character or a combining double into the Private Use Area could lead potentially to any problems over a rendering system recognizing the character as being a combining character (please know that I have no specific reason to think that it would, it is just a possibility about which I wondered when considering various uses of the Private Use Area). William Overington 25 September 2002
Re: entities with breve
On 09/25/2002 01:54:25 AM Michael Everson wrote: The answer would be to encoded characters comparable to U+0361. A combining double breve has already been approved for version 4.0. I intend to propose (unless someone gets around to it before me) a combining double inverted breve below. Double INVERTED breve below? My mistake: COMBINING DOUBLE BREVE BELOW. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: entities with breve
On 09/25/2002 04:04:40 AM William Overington wrote: Could you please say some more about what is going to be encoded in regular Unicode and with which code points please? You can look at http://www.unicode.org/unicode/alloc/Pipeline.html to see what's in the pipeline, but note that code points are not yet definite. There will be a beta period, beginning in January I believe. In relation to your encoding these characters as Private Use Area characters, I wonder if you could please say some more about this please, both in relation to which code points you are intending to use Our choice of code points s relevant only for our users and those with whom we might interchange data. Once we have implementations, such info will be available either with those implementations or on our Web site for the benefit of those using those implementations. I don't see any reason to discuss this on this list. and also as to whether encoding a combining accent character or a combining double into the Private Use Area could lead potentially to any problems over a rendering system recognizing the character as being a combining character (please know that I have no specific reason to think that it would, it is just a possibility about which I wondered when considering various uses of the Private Use Area). I can't comment about rendering systems in general: some may have issues that others do not have. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: Keys. (derives from Re: Sequences of combining characters.)
William Overington WOverington at ngo dot globalnet dot co dot uk wrote: Also, I have noticed that in the document U02D0.pdf (actually U20D0.pdf) that U+20E4 is shown, in the listing, in magenta whereas U+20DF is shown in black. Could someone say what significance the magenta colouring in the document has please? Is it perhaps to indicate additions since the previous issue of the document? Since the previous release of Unicode. The magenta characters are those added in Unicode 3.2. They were marked specially in the draft copies of the code charts to indicate the changes (and probably to highlight the fact that the assignments were still tentative), and left that way after 3.2 went live. Whether this was intentional or not, I don't know. -Doug Ewell Fullerton, California
RE: Keys. (derives from Re: Sequences of combining characters.)
William Overington wrote: The recent discussion on sequences has led me to have a look through the various combining characters and I have found the following. U+20E3 COMBINING ENCLOSING KEYCAP It has occurred to me that the use of a sequence of a base character, then one or more combining characters so as to produce a sequence which would be otherwise unlikely, followed by U+20E3 might be a very effective way to include specialised markup systems within a plain text file [...] What the hell do key caps have to do with mark up or text files!!?? Mr. Overington, why do you have this irresistible compulsion to mix up apples and horses? (I feel that the usual apples and oranges is not enough to convey the idea fully.) Regards. _ Marco
Re: no replies
Roslyn, I will head off trouble because for you because your message is likely to be otherwise ignored or semi-flamed. The best place to get information on compiling and configuring php is on a php support or developer list. There must be information on how to subscribe to such lists on the php home page, which I am guessing is php.org. Another great source to find answers that I use at least 10 times a day with a 90%+ success rate is to search on related keywords on google.com and groups.google.com. OTTOMH, in your case I would try searching php enable-mbstring inthose places and see what you find. This list is for questions related to Unicode. That is probably no one has replied previously. Few if any people here are php developers, and even fewer are going to be versed in the details of configuring and compiling php. Hope this helps! Barry Caplan www.i18n.com At 04:35 AM 9/24/2002 -0700, you wrote: aaah finally, one reply to that question!! thankyou BOB. anyways, could anyone tell me how i can recompile php to include mbstring support. i used the ./configure enable-mbstring option,did the make install..etc etc, but i still can seem to run any of the mbstring functions in my php code, i get fatal error: call to undefined function mb_(whatever)...could anyone pls assist me here. thanks regards, roslyn
Small 's' with grave?
Wednesday, September 25, 2002 A friend of a friend asked me if Unicode has a code for small s with a grave. I can't find one; am I overlooking it? Has it been added since 3.0? Thanks in advance. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
Old Hungarian?
You can look at http://www.unicode.org/unicode/alloc/Pipeline.html to see what's in the pipeline, but note that code points are not yet definite. There will be a beta period, beginning in January I believe. Whatever happened to Old Hungarian, aka Hungarian Runic, aka rovasiras? (sorry for missing diacritics) I can see a proposal by Mr Everson back in 1998 (http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n1686/n1686.htm, http://www.dkuug.dk/jtc1/sc2/wg2/docs/n1758.pdf) but I cannot see it in the above pipeline.
glyph selection for Unicode in browsers
Declaring lang for text, should help a browser display the text more appropriately for the specified language. (e.g. span lang=esHola/span It seems especially appropriate for Unicode text, since an Asian character may have very different display requirements in different languages (CJKT), and the Han unification brought many of these glyph variants together. However, I am finding that browsers are not supporting this in a way that is useful for Unicode. What has been working so far is that the browsers can associate different fonts with different languages. So I might use a Japanese font such as Mincho for Japanese text and another font for Chinese text. However, now that there are Unicode fonts, if I assign a Unicode font such as Arial Unicode MS, or CODE2000, to all languages, then I see the same glyph for a character, regardless of the lang assignment. I would like to understand why this is. (Bear in mind, I don't know much more than the rudiments of font technology.) a) Do Unicode fonts include the language-based glyph variants of characters, so that a display system is capable of identifying or hinting which glyph should be used in a particular scenario? b) If the above is possible, then I assume the browsers have not implemented language-based selection yet. Are any browsers moving to using the appropriate glyphs based on language without depending on each language being assigned a different font? c) If the above is not possible, then configuring browsers for Unicode usage is greatly complicated by the need to have a lengthy list of fonts assigned to different languages. Is there an alternative approach that can be used, so users can easily view Unicode text and get the correct display while using a single Unicode font? Ideally (to my mind) I should be able to create web pages in Unicode, with appropriate lang declarations and get reasonable displays on systems where a user does not do much more than have available a Unicode font. However, this does not seem to be the case at the moment. If it will help I can post some test pages I have been using where I take a string of characters and repeat them with different lang assignments. The text looks the same unless I choose to assign different fonts to each language in the browser preferences. The examples are trivial so I haven't bothered to post them. I would be glad to learn if there is another approach which is easy for users to configure, that gives appropriate text rendering. -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
RE: glyph selection for Unicode in browsers
I would be happy if just this meta http-equiv=Content-Type content=text/html; charset=utf-8/ would be enough to convince the browsers that the page is in UTF-8... It isn't if the HTTP server claims that the pages it serves are in ISO 8859-1. A sample of this is http://www.iki.fi/jhi/jp_utf8.html, it does have the meta charset, but since the webserver (www.hut.fi, really, a server outside of my control) thinks it's serving Latin 1, I cannot help the wrong result. (I guess some browsers might do better work at sniffing the content of the page, but at least IE6 and Opera 6.05 on Win32 seem to believe the server rather than the (HTML of the) page.
Re: Small 's' with grave?
A friend of a friend asked me if Unicode has a code for small s with a grave. U+0073 U+0300 Has it been added since 3.0? Thanks in advance. Afaik, there is not and will not be any new precomposed characters since Unicode 3.0
RE: glyph selection for Unicode in browsers
You would be happy, but others might not- the standard specifically says that the http charset takes precedence. http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2 Yup. I guess I could argue both ways. The server admins want control; the users want control, the latter lose :-) However, what you say about user control of web server facilities being up to the administrator and not the page's author is true. Some of the servers allow users some control through directory-based files. I can send you a sample .htaccess file privately, if it will be of use to you. Please.
Re: glyph selection for Unicode in browsers
Done. I almost forgot, I have a web page that also describes how to use .htaccess with Apache. See tip #1 in: http://www.i18nguy.com/markup/serving.html tex [EMAIL PROTECTED] wrote: You would be happy, but others might not- the standard specifically says that the http charset takes precedence. http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2 Yup. I guess I could argue both ways. The server admins want control; the users want control, the latter lose :-) However, what you say about user control of web server facilities being up to the administrator and not the page's author is true. Some of the servers allow users some control through directory-based files. I can send you a sample .htaccess file privately, if it will be of use to you. Please. -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: glyph selection for Unicode in browsers
Tex Texin wrote, ... However, I am finding that browsers are not supporting this in a way that is useful for Unicode. What has been working so far is that the browsers can associate different fonts with different languages. So I might use a Japanese font such as Mincho for Japanese text and another font for Chinese text. However, now that there are Unicode fonts, if I assign a Unicode font such as Arial Unicode MS, or CODE2000, to all languages, then I see the same glyph for a character, regardless of the lang assignment. I would like to understand why this is. (Bear in mind, I don't know much more than the rudiments of font technology.) a) Do Unicode fonts include the language-based glyph variants of characters, so that a display system is capable of identifying or hinting which glyph should be used in a particular scenario? ... OpenType allows for substitution of language-specific glyphs and many script and language tags are already registered. However, the last time I checked (quite recently), the Uniscribe engine only implements one language tag per script. OpenType is still nascent and tremendous strides have been made within the past few years. Once implementations do allow for multiple language based substitutions under a single script tag, there should be much improvement in browser display. (As long as the fonts get updated, too!) Meanwhile, the workable approach seems to remain assigning specific fonts in the style declaration. Best regards, James Kass.
Re: glyph selection for Unicode in browsers
Thanks James. Which registry are you referring to for script and language tags? Is this in the context of glyphs or do you just mean the IANA language tag registry? Given the (un)workable approach, do you then intend to have variants of code2000 for CJKT, so one can make the appropriate assignments? (ugh!) Also, this approach means I have to ask each Unicode font vendor, Which language is your multilingual font designed for? so I know which CJKT assignment is appropriate for that font... (I hope this doesn't read like I am attacking you, I am not. I am just trying to highlight the difficulty I am having with this.) tex [EMAIL PROTECTED] wrote: Tex Texin wrote, a) Do Unicode fonts include the language-based glyph variants of characters, so that a display system is capable of identifying or hinting which glyph should be used in a particular scenario? ... OpenType allows for substitution of language-specific glyphs and many script and language tags are already registered. However, the last time I checked (quite recently), the Uniscribe engine only implements one language tag per script. OpenType is still nascent and tremendous strides have been made within the past few years. Once implementations do allow for multiple language based substitutions under a single script tag, there should be much improvement in browser display. (As long as the fonts get updated, too!) Meanwhile, the workable approach seems to remain assigning specific fonts in the style declaration. Best regards, James Kass. -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
RE: glyph selection for Unicode in browsers
*sigh* Time for me to call it the day and go home, it seems. Opera 6.05/Win32 does *not* get it right if you have it on View - Encoding - Automatic detection. Why I was fooled in the below message was that the Encoding setting seems to stick even if I exit and restart Opera, that's why my test page seemed to be working. If I turn it back to autodetect, it doesn't autodetect the UTF-8-ness. (If nothing else this bumbling saga of mine illustrates how difficult it still is to get all this just to work.) -Original Message- From: Hietaniemi Jarkko (NRC/Boston) Sent: 25 September, 2002 04:56 PM To: Hietaniemi Jarkko (NRC/Boston); 'ext Tex Texin'; 'WWW International'; 'Unicoders' Subject: RE: glyph selection for Unicode in browsers I cannot help the wrong result. (I guess some browsers might do better work at sniffing the content of the page, but at least IE6 and Opera 6.05 on Win32 seem to believe the server rather than the (HTML of the) page. After some experimentation it seems that I blamed Opera 6.05/Win32 wrongly, it guesses the charset right. But as pointed out by Tex, HTTP/HTML charset ponderings are probably not Unicode issue as such, they are more a WWW issue, sorry about the slight off-topicalness.
Re: glyph selection for Unicode in browsers
On 09/25/2002 01:51:28 PM Tex Texin wrote: a) Do Unicode fonts include the language-based glyph variants of characters, so that a display system is capable of identifying or hinting which glyph should be used in a particular scenario? They *can*, and some do. When this is the case, then there needs to be some mechanism to modify the relationship between sequences of characters and sequences of glyphs to arrive at the particular glyphs intended for the given language. In general terms, the same kinds of mechanisms than can be used for rendering complex scripts can also be used here -- it's a glyph substitution, comparable to substituting an initial or final form of an Arabic character. Of course, there is a different triggering condition involved in these situations than in the case of a complex script such as Arabic: in the complex-script situation, the triggers are the character context (e.g. preceded by non-word-forming character and followed by word-forming character), whereas here the trigger is a metadata tag. Let's consider how this would be dealt with in term of implementation, using OpenType as an example. The OpenType font format provides means for storing different glyph-transformation rules according to language. (1) The question is, then, what does it take for the rendering process to make use of one set of language-specific rules rather than another, or rather than a set of default rules (OT allows the font developer to specify a default). In OpenType, glyph-transformation rules are grouped by features, and a set of rules will be applied when the associated feature has been activated. (Thus, in OT text layout, what's processed is a feature-marked-up string of characters.) This applies to the language distinctions as well: the desired language must be specified in the input, otherwise the default rules will apply. (2) The idea is that application software must determine what features are activated at what point. Now, hardly any software gets written to interact directly with the OpenType layout engine. Instead, higher-level text layout libraries have been written that wrap the OpenType functionality. Uniscribe is one example; indeed, in Win32 on Windows 2000 and later, there is even another layer, since the standard text-drawing functions (TextOut and ExtTextOut) wrap Uniscribe's functionality. Other examples of libaries that wrap up the OT interface and expose a higher-level interface include Adobe's CoolType engine (not a published interface, that I know of), ICU, Pango and Sun's recent Standard Type Services Framework project. So, at the OT interface, a language tag (3) has to be specified in order to get language-specific glyphs. But apps generally don't write to that interface (for good reason); they usually write to a higher interface. The crux of the issue is that none of the higher-level interfaces, that I know of, yet provide any mechanism for the app to specify a language tag. (4) Hence, the building blocks are there, but more infrastructure is still needed. Note that there's a bit more involved that simply re-writing higer-level APIs to expose a way to specify OT featues. In particular, a critical issue has to do with the relationship between OpenType's language tags, and whatever system of language or locale tagging might be used elsewhere in a given platform. I've described the situation in terms of OpenType. Neither AAT or Graphite provide exactly the same kind of mechanism for providing different glyph transformations for different languages, though I believe some consideration has been given to possibilities for both technologies. Both use feature mechanisms, so can certainly do what you're looking for; but neither has specifically defined features specifically related to languages, let alone decided how these should be handled in terms of APIs. It would be possible to implement an AAT or Graphite font that used a feature to get at language-specific glyphs, and apps that exposed a user-interface for setting AAT or Graphite features (5) would offer the user a way to control this. But there would not be any automation whereby an app would specify this based on other language or locale tagging. Notes: (1) I put language in quotation marks since it has not really been adequately worked out what these distinctions are; I think these are probably groups of writing systems. (2) OpenType glyph-transformation rules are organised hierarchically, first by script, then by language, and then according to the other features they are associated with. (3) OpenType's language tags have no specified relationship with ISO 639, RFC 3066 or any other system of language tags. (4) The same issue applies to OpenType features that pertain to optional aspects of typography and rendering that are up to the user's discretion rather than being obligatory behaviour for a script. For instance, there is an OpenType feature for selecting small cap forms, which a font developer can use to provide
RE: glyph selection for Unicode in browsers
I cannot help the wrong result. (I guess some browsers might do better work at sniffing the content of the page, but at least IE6 and Opera 6.05 on Win32 seem to believe the server rather than the (HTML of the) page. After some experimentation it seems that I blamed Opera 6.05/Win32 wrongly, it guesses the charset right. But as pointed out by Tex, HTTP/HTML charset ponderings are probably not Unicode issue as such, they are more a WWW issue, sorry about the slight off-topicalness.
Re: glyph selection for Unicode in browsers
On 09/25/2002 03:34:00 PM Tex Texin wrote: Thanks James. Which registry are you referring to for script and language tags? Is this in the context of glyphs or do you just mean the IANA language tag registry? The OpenType script and language tags are specific to OpenType. As I mentioned in my previous message, one of the problems yet to be solved is how to associate OT language tags with the kind of things used for metadata, e.g. RFC 3066 (and also determining whether resolving those associations is the responsibility of the app, of a higher-level layout engine, or of the OpenType layout engine), and it hasn't even been worked out yet (IMO) just what the OT language tags are. Given the (un)workable approach, do you then intend to have variants of code2000 for CJKT, so one can make the appropriate assignments? (ugh!) Also, this approach means I have to ask each Unicode font vendor, Which language is your multilingual font designed for? so I know which CJKT assignment is appropriate for that font... Unfortunately, that's where we're stuck for the time being. I wish it were otherwise, since we're in the process of coming up with new Latin / Cyrillic fonts for our users throughout the world, and there are various Latin characters for which different glyphs are preferred in different language communities. And the variations for one character don't necessarily correlate with those for another, so you get lots of possible combinations needed -- which would make it a pain to come up with a bunch of language-specific fonts. For now, we're going to give them the ability to select alternate glyphs via Graphite features,* but they'll only be able to use that in Graphite-enabled apps -- it won't work in Word! *Since our software tools are intended for use by linguists working in hundreds of languages / writing systems for which there is no support in commercial software platforms, we have for a long time provided mechanisms to specify writing-system-specific behaviours, such as sorting or character properties determining basic things like word-boundary detection and line breaking. In our new tools that support Graphite, there's an ability for the linguist setting up a system for their writing system to specify what features should be active by default for their writing system. This gives us an interim mechanism to handle language-specific typography requirements. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: glyph selection for Unicode in browsers
Tex Texin wrote, Which registry are you referring to for script and language tags? Is this in the context of glyphs or do you just mean the IANA language tag registry? As Peter Constable already noted, in this case registered only means registered as an OpenType tag. More info about this can be found on Adobe's page: http://partners.adobe.com/asn/developer/opentype/appendices/ttoreg.html Given the (un)workable approach, do you then intend to have variants of code2000 for CJKT, so one can make the appropriate assignments? (ugh!) Code2000's coverage of CJKTV ideographs isn't adequate to support any language yet. Eventually and hopefully the repertoire will be completed. Given the current ceiling of 65536 max glyphs per font, it might not be feasible to try to have one font cover all scripts and variants, but time will tell. Also, this approach means I have to ask each Unicode font vendor, Which language is your multilingual font designed for? so I know which CJKT assignment is appropriate for that font... Sad but true. On a happier note, most Japanese users will already have a Japanese font set as default, Chinese users will have a Chinese (Simp or Trad) font installed, and so forth. Still, when you're trying to publish a multilingual page which can be properly displayed anywhere, this isn't much consolation. (I hope this doesn't read like I am attacking you, I am not. I am just trying to highlight the difficulty I am having with this.) You are not alone... Best regards, James Kass.
Re: glyph selection for Unicode in browsers
James, thanks as always for your reply. The 65K limit is ugly... With respect to CJKT comment below, I guess it is true because of catch-22. For example, I set my browser to default to a Unicode font. I think everyone would if they could- -it's a knee-jerk response if the solution is adequate everywhere. You don't have to know which fonts work for which languages. For Americas, and Europe, users can easily just set a Unicode font. However, a Japanese user might have to choose a Japanese font, if the Unicode font does not favor (and cannot be made to favor with language tags) Japanese renderings. So it's catch 22. They have native fonts because Unicode fonts are inadequate, but we can be relieved that although Unicode fonts are inadequate, we are lucky the users don't use them. ugh! So where the differences are important, users are forced to select native fonts instead of unicode fonts. This then creates the difficulty that to view a multilingual page, you need to a)acquire specialized fonts,(tedious and costly perhaps), b) install them, c) assign them d) finally view the page. Sadder still: Content developers that want to use Unicode: a) can invest a lot of time in declaring lang around sections of text, and really get no bang for it at the moment. In truth browsers do very little with this information as far as I can tell. (I suspect it helps search engines, but I need to test that assumption more). b) It is actually more beneficial to use native code pages than unicode, since the browsers seem to do a better job of font selection here. (I need to test this statement more. However, from my own coding experience on windows, knowing the code page allows easy setting of the script for the font, which has a major influence on Windows font selection. The language information wouldn't be available so easily for a Unicode file without it being carefully designed in to be passed from the markup layers down to the primitive font selection layers.) To be fair, I think font coverage for Unicode has been steadily improving and it is much easier today to produce multilingual docs than in the past. But I am disappointed in the state of the art for Browsers, and I suspect it is also true for other products that are not professional publishing software of one kind or another. I suspect at the heart of the problem is rendering architecture has not carried language (as opposed to code page) to the primitive layers, and this needs to be addressed throughout the architecture, since the language information can no longer be deduced or presumed when the encoding is Unicode. Whatever the reason, this needs to be fixed a) so Unicode can be recommended as best practice and b) documents are rendered with appropriate glyphs, without extraordinary effort by users. tex [EMAIL PROTECTED] wrote: Also, this approach means I have to ask each Unicode font vendor, Which language is your multilingual font designed for? so I know which CJKT assignment is appropriate for that font... Sad but true. On a happier note, most Japanese users will already have a Japanese font set as default, Chinese users will have a Chinese (Simp or Trad) font installed, and so forth. Still, when you're trying to publish a multilingual page which can be properly displayed anywhere, this isn't much consolation. Best regards, James Kass. -- - Tex Texin cell: +1 781 789 1898 mailto:[EMAIL PROTECTED] Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
script detection program
Does anyone have a program or tool that can identify the scripts which the characters in a UTF-16 encoded file belong to? I'd like a program that can scan the data and return script tag such as used in http://www.unicode.org/unicode/reports/tr24/ so if I had a UTF-16 encoding file with latin and cyrillic characters, the tool/program would scan the text and return the name latn and cyrl _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Composition chart
Apropos the discussion on combining sequences: I had needed a quick reference for (canonical) composite characters, and written a quick chart (well, actually, a quick program for generating one). In case anyone is interested I posted it on: http://www.macchiato.com/unicode/composition_chart.html I found it a reasonably compact way to reference all composites, including the ones that are excluded from NFC. The presentation is not very polished, but tool-tips are enabled with character names. Red means 'excluded from composition'. The rest of the structure should be more or less clear. Curiously enough, it seems to expose a bug in Arial Unicode MS: the character U+1E4B gets a funny glyph. (This is on the AUMS from Office 2000; I have Office XP but haven't yet gone through the process of installing it yet, so it may be fixed there.) Mark __ http://www.macchiato.com ► “Eppur si muove” ◄