Re: Script of U+0951 .. U+0954
There were some errors in my suggested update to Scripts.txt. A correction has been posted. Sorry about that. Mark Davis mark dot davis at jtcsv dot com wrote: Whatever their script property values, characters with general categories of Mn and Me should also inherit their script from their base character. The nominal script property value for these characters may be different from INHERITED in cases where the best interpretation of that character in isolation would be a specific script. This is more than an explanatory or clarifying passage. It would add noticeable complexity to the Scripts model, because it would now be necessary to distinguish TWO types of inherited characters: (1) those marked as belonging to the INHERITED meta-script, which inherit their script from a base character if any, but remain in INHERITED if they occur in isolation (for whatever reason), and (2) those marked as belonging to a real script, but with general category Mn or Me, which also inherit their script from a base character if any, but remain in their original script (not INHERITED) if they occur in isolation. Implementations of the Scripts model would need to haul around the general-category information for every character, which was not necessary before and which imposes significant overhead. (Yes, I know ICU already supports this, but suppose I want to roll my own lightweight implementation?) Isn't there some way to keep the inherited logic relatively simple? -Doug Ewell Fullerton, California
Re: Unihan Mandarin Readings
On Tuesday, December 3, 2002, at 03:17 AM, Andrew C. West wrote: BTW, is it possible for Unicode to provide a Unihan.xml version of the Unihan database ? The first thing I do is convert the Unihan.txt file into XML format for ease of processing. As a rule, we tend to stick to older formats so that people don't have to rewrite their perl scripts and other parsers. I know you're asking if we could add an XML format *in addition* to the non-XML one, but given the size of Unihan.txt, that isn't likely. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.tejat.net/
Order in which unicode charactoers displayed
Hello I am attempting to utilise the unicode values from the CJK Unified Ideographs to undertake searches for the occurance of the corresponding characters on a hard disk drive. When I look at a chinese character with a hex editor I get a certain order for the hex or unicode value for the character. For example, the english word 'abalone' in chinese has a code of '8D9C 7C9C' when viewed in a hex editor, but when I referred to the CJK unicode table, the value came out as '9C8D 9C7C'. Can you explain the different ordering of the code? When I conducted a test search of the contents of a hard drive known to contain the chinese characters for 'abalone' I only found hits on the hex values not onth e CJK unicode values. Any assistance woudl be appreciated. Regards Mike Smith
Re: Order in which unicode charactoers displayed
This is an expected behavior. Ii has to do with the Endian-ness of the processor on which you are running. From the glossary (http://www.unicode.org/glossary/): Little-endian. A computer architecture that stores multiple-byte numerical values with the least significant byte (LSB) values first. Big-endian. A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first. You can also find several thousand web pages describing the issue by searching in google for these two terms. MichKa - Original Message - From: Smith, Mike [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, December 07, 2002 1:16 PM Subject: Order in which unicode charactoers displayed Hello I am attempting to utilise the unicode values from the CJK Unified Ideographs to undertake searches for the occurance of the corresponding characters on a hard disk drive. When I look at a chinese character with a hex editor I get a certain order for the hex or unicode value for the character. For example, the english word 'abalone' in chinese has a code of '8D9C 7C9C' when viewed in a hex editor, but when I referred to the CJK unicode table, the value came out as '9C8D 9C7C'. Can you explain the different ordering of the code? When I conducted a test search of the contents of a hard drive known to contain the chinese characters for 'abalone' I only found hits on the hex values not onth e CJK unicode values. Any assistance woudl be appreciated. Regards Mike Smith