Re: Script of U+0951 .. U+0954

2002-12-07 Thread Doug Ewell
There were some errors in my suggested update to Scripts.txt.  A
correction has been posted.  Sorry about that.

Mark Davis mark dot davis at jtcsv dot com wrote:

 Whatever their script property values, characters with general
 categories of Mn and Me should also inherit their script from their
 base character. The nominal script property value for these characters
 may be different from INHERITED in cases where the best interpretation
 of that character in isolation would be a specific script.

This is more than an explanatory or clarifying passage.  It would add
noticeable complexity to the Scripts model, because it would now be
necessary to distinguish TWO types of inherited characters:

(1) those marked as belonging to the INHERITED meta-script, which
inherit their script from a base character if any, but remain in
INHERITED if they occur in isolation (for whatever reason), and

(2) those marked as belonging to a real script, but with general
category Mn or Me, which also inherit their script from a base character
if any, but remain in their original script (not INHERITED) if they
occur in isolation.

Implementations of the Scripts model would need to haul around the
general-category information for every character, which was not
necessary before and which imposes significant overhead.  (Yes, I know
ICU already supports this, but suppose I want to roll my own lightweight
implementation?)  Isn't there some way to keep the inherited logic
relatively simple?

-Doug Ewell
 Fullerton, California





Re: Unihan Mandarin Readings

2002-12-07 Thread John H. Jenkins

On Tuesday, December 3, 2002, at 03:17 AM, Andrew C. West wrote:


BTW, is it possible for Unicode to provide a Unihan.xml version of the 
Unihan
database ? The first thing I do is convert the Unihan.txt file into 
XML format
for ease of processing.


As a rule, we tend to stick to older formats so that people don't have 
to rewrite their perl scripts and other parsers.  I know you're asking 
if we could add an XML format *in addition* to the non-XML one, but 
given the size of Unihan.txt, that isn't likely.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/




Order in which unicode charactoers displayed

2002-12-07 Thread Smith, Mike
Hello

I am attempting to utilise the unicode values from the CJK Unified
Ideographs to undertake searches for the occurance of the corresponding
characters on a hard disk drive.

When I look at a chinese character with a hex editor I get a certain
order for the hex or unicode value for the character.  For example, the
english word 'abalone' in chinese has a code of '8D9C 7C9C' when viewed
in a hex editor, but when I referred to the CJK unicode table, the value
came out as '9C8D 9C7C'.

Can you explain the different ordering of the code?

When I conducted a test search of the contents of a hard drive known to
contain the chinese characters for 'abalone' I only found hits on the
hex values not onth e CJK unicode values.

Any assistance woudl be appreciated.

Regards

Mike Smith





Re: Order in which unicode charactoers displayed

2002-12-07 Thread Michael \(michka\) Kaplan
This is an expected behavior. Ii has to do with the Endian-ness of the
processor on which you are running. From the glossary
(http://www.unicode.org/glossary/):

Little-endian. A computer architecture that stores multiple-byte numerical
values with the least significant byte (LSB) values first.

Big-endian. A computer architecture that stores multiple-byte numerical
values with the most significant byte (MSB) values first.

You can also find several thousand web pages describing the issue by
searching in google for these two terms.

MichKa


- Original Message -
From: Smith, Mike [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, December 07, 2002 1:16 PM
Subject: Order in which unicode charactoers displayed


 Hello

 I am attempting to utilise the unicode values from the CJK Unified
 Ideographs to undertake searches for the occurance of the corresponding
 characters on a hard disk drive.

 When I look at a chinese character with a hex editor I get a certain
 order for the hex or unicode value for the character.  For example, the
 english word 'abalone' in chinese has a code of '8D9C 7C9C' when viewed
 in a hex editor, but when I referred to the CJK unicode table, the value
 came out as '9C8D 9C7C'.

 Can you explain the different ordering of the code?

 When I conducted a test search of the contents of a hard drive known to
 contain the chinese characters for 'abalone' I only found hits on the
 hex values not onth e CJK unicode values.

 Any assistance woudl be appreciated.

 Regards

 Mike Smith