Re: U+0140

2004-04-19 Thread Adam Twardoch
From: John Hudson [EMAIL PROTECTED]
 'Careful hairsplitting' always takes place when people care about
typography.

How very true.

On one hand, there's people who put a cedilla under a when typesetting
Polish, on the other hand, there's people who adjust the vertical position
of hyphens when typesetting all-caps. And there's lot in-between. But it is
important to realize that there _always_ were people who adjusted the hyphen
in all-caps settings. Gutenberg's own typesetting was careful hairsplitting.

This is a very typical and essential dilemma, which is one of the reasons
why there is no easy answer to the glyph vs. character question, or more
precisely, why the character definition in Unicode is so, well, vague.
Since the decision on what is a character and what is merely a glyph
variant is made somewhat arbitrarily (albeit in a committee process). There
are far too many exceptions to the rule for Unicode to be consistent and
easy-to-use. But since written human language never was consistent and
easy-to-use, I guess it's something very natural and we will all live with
that.

Adam





Downloading UCD 4.0.0

2004-04-19 Thread Theo Veenker
Hi,

Until now I always downloaded the lastest version of the UCD
and worked with that. Now I want to download the UCD files for
4.0.0 again. I know it is all in http://www.unicode.org/Public/-
4.0-Update/, but in http://www.unicode.org/ucd/ I read this:
  The complete set of all files for a given version of the UCD
   consists of the files in the update directory for that version,
   together with all the files unchanged from earlier versions,
   which are kept in their respective update directories.
Do I really need to find out and download all unchanged files
from 3.2.0 and earlier, just to get the files for 4.0.0?
Theo




JIS X 0213: 2000 AMD-1 and Unihan.txt

2004-04-19 Thread Ernest Cline

Would it be reasonable to expect that data concerning the
ten characters added to JIS X 0213 by Amendment 1 will
make it into the next version of Unihan.txt?  I'm presuming
that this is official since ISO-IR-233, which updates
ISO-IR-228, was released on 13 April.

[Relevant data from ISO-IR-233]

Unicode = Min,Ku,Ten
U+4FF1 = 1,14,01
U+525D = 1,15,94
U+541E = 1,47,94
U+5653 = 1,84,07
U+59F8   = 1,94,90
U+5C5B = 1,94,91
U+5E77   = 1,94,92
U+7626 = 1,94,93
U+7E6B =1,94,94
U+20B9F = 1,47,52

Ernest Cline
[EMAIL PROTECTED]






Re: Downloading UCD 4.0.0

2004-04-19 Thread Kenneth Whistler
Theo Venker asked:

 Until now I always downloaded the lastest version of the UCD
 and worked with that. Now I want to download the UCD files for
 4.0.0 again. I know it is all in http://www.unicode.org/Public/-
 4.0-Update/, 

That is an incorrect assumption.

 but in http://www.unicode.org/ucd/ I read this:
 
The complete set of all files for a given version of the UCD
 consists of the files in the update directory for that version,
 together with all the files unchanged from earlier versions,
 which are kept in their respective update directories.
 
 Do I really need to find out and download all unchanged files
 from 3.2.0 and earlier, just to get the files for 4.0.0?

Yes. The relevant information for *each* version of the
Unicode Standard is at:

http://www.unicode.org/standard/Versions/enumeratedversions.html

As it happens, almost *every* data file was updated for
Unicode 4.0, so almost everything is available specifically
in http://www.unicode.org/Public/4.0-Update/ The only
normative files that were unchanged from an earlier version were:

http://www.unicode.org/Public/3.2-Update/Jamo-3.2.0.txt
http://www.unicode.org/Public/3.2-Update/Unihan-3.2.0.zip

Of course, the update for Unihan.txt was one of the main reasons
for the Unicode 4.0.1 release.

The only other file that was unchanged was the character index
to the book:

http://www.unicode.org/Public/3.2-Update/Index-3.2.0.txt

which, for production reasons, was not updated again until the
release of Unicode 4.0.1.

--Ken




Re: Downloading UCD 4.0.0

2004-04-19 Thread Asmus Freytag
At 08:42 AM 4/19/2004, Theo Veenker wrote:
Hi,

Until now I always downloaded the lastest version of the UCD
and worked with that. Now I want to download the UCD files for
4.0.0 again. I know it is all in http://www.unicode.org/Public/-
4.0-Update/, but in http://www.unicode.org/ucd/ I read this:
  The complete set of all files for a given version of the UCD
   consists of the files in the update directory for that version,
   together with all the files unchanged from earlier versions,
   which are kept in their respective update directories.
Do I really need to find out and download all unchanged files
from 3.2.0 and earlier, just to get the files for 4.0.0?
Yes.

And depending on what version of the UCD you are trying to piece together 
you may need potentially versions of some files from several earlier updates.

A./

PS: we are looking into ways to make access to older versions
more straightforward. 





FW: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler

2004-04-19 Thread Magda Danish \(Unicode\)
Mino,

I am sending your question to the Unicode public email list 
http://www.unicode.org/consortium/distlist.html for a possible answer from one of the 
list subscribers.

Regards, 

---
Magda Danish
Sr. Administrative Director
The Unicode Consortium
650-693-3921
[EMAIL PROTECTED]
 

 -Original Message-
 Date/Time:Mon Apr 19 05:09:20 EDT 2004
 Contact:  [EMAIL PROTECTED]
 Report Type:  Other Question, Problem, or Feedback
 Opt Subject:  Unicode conversion
 
 I would like to convert a 2 byte Unicode code into its
 corresponding Unicode character (for instance the decimal 1063 or the
 hexadecimal 0427 into 'Ч'). Is there a C function in order to make the
 conversion? What file .h do I need to include in the C program? Can I
 use the 6.0 version of the Microsoft Visual C++ compiler, or do i
 need a newer version?
 Thanks a lot in advance.
 Mino Napoletano
 
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 (End of Report)




Re: JIS X 0213: 2000 AMD-1 and Unihan.txt

2004-04-19 Thread John Jenkins
Yes, it's reasonable.  In fact, the data have already been added, but 
this was done just too late for inclusion in the 4.0.1 release.

On Apr 19, 2004, at 12:23 PM, Ernest Cline wrote:

Would it be reasonable to expect that data concerning the
ten characters added to JIS X 0213 by Amendment 1 will
make it into the next version of Unihan.txt?  I'm presuming
that this is official since ISO-IR-233, which updates
ISO-IR-228, was released on 13 April.
[Relevant data from ISO-IR-233]

Unicode = Min,Ku,Ten
U+4FF1 = 1,14,01
U+525D = 1,15,94
U+541E = 1,47,94
U+5653 = 1,84,07
U+59F8   = 1,94,90
U+5C5B = 1,94,91
U+5E77   = 1,94,92
U+7626 = 1,94,93
U+7E6B =1,94,94
U+20B9F = 1,47,52
Ernest Cline
[EMAIL PROTECTED]





John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler

2004-04-19 Thread Raymond Mercier
Mino,
This is not at clear:
the character U+0427 is  in the Cyrillic block, and what does this have to
do with the two characters  and , which are U+ 00D0 and U+00A7 ?
Are you wondering how to store 0x0427 in a binary file ? Or what ?

Raymond Mercier

  Contact:  [EMAIL PROTECTED]
  Report Type:  Other Question, Problem, or Feedback
  Opt Subject:  Unicode conversion
 
  I would like to convert a 2 byte Unicode code into its
  corresponding Unicode character (for instance the decimal 1063 or the
  hexadecimal 0427 into ''). Is there a C function in order to make the
  conversion? What file .h do I need to include in the C program? Can I
  use the 6.0 version of the Microsoft Visual C++ compiler, or do i
  need a newer version?
  Thanks a lot in advance.
  Mino Napoletano
 
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
  (End of Report)






Re: U+0140

2004-04-19 Thread Kenneth Whistler
John Hudson responded to Michael Everson:

 Michael Everson wrote:
 
  This would make the mid-dot too high. The top dot of the colon usually 
  sits toward the top of the x-height; the *mid*-dot should sit lower, 

  John, I just don't believe you. I don't believe that in all the history 
  of Greek and Catalan typography this careful hairsplitting has *always* 
  taken place; certainly in scientific transcription the HALF TRIANGULAR 
  COLON is just the top dot in the TRIANGULAR COLON, and in Americanist 
  transcription where the dot-colons are used instead of triangles I would 
  say the same applies.
 
 I never contested that the dots of a colon correspond to the triangles of the 
 linguistic 
 long vowel marker. They clearly do. What I contested was that the typographic 
 mid-point 
 (U+00B7) corresponded to the top dot of a colon. It clearly does not. It is called a 
 mid-point because it sits midway up the x-height. It is used in this position for a 
 variety of stylistic purposes, ...

I think we have two typographers here arguing somewhat at cross-purposes.
Clearly the typographic mid-point behaves as John has mentioned, and is
designed as such in many fine fonts (examples seen among the exhibits that
Asmus gathered).

But just a clearly, there is a long, long tradition in Americanist
orthographic practice (which is used widely for linguistic orthographies
outside of Native America as well) of using a raised dot for an indication
of vocalic (and occasionally consonantal) length. For 100 years, that
raised dot was mechanically generated by, among other means, filing the
lower dot off a colon key on a mechanical typewriter. (I have such a
typewriter sitting on my desk.) Linguists got used to this raised dot
height, coordinated with a colon in design (which then could be used, among
other things to indicate a prolonged length, when two degrees of length
were in question), and that preference made its way into print, at least
for many North American languages, where the raised dot could be printed
at x-height, rather than at midway up the x-height, which would be too
low for most of the linguistic usage.

Enter the electronic age. ASCII had no MIDDLE DOT. It was period (.), colon (:)
or the highway. Early linguistic material on computers made do with those,
because they had no choice. The IBM PC and the Macintosh introduced a
MIDDLE DOT (0xFA [= IBM CDRA SD63 Middle Dot] and 0xE1, respectively).
When ISO 8859-1 was defined, it also had a MIDDLE DOT (0xB7). *Everybody*
made use of that MIDDLE DOT for anything that was vaguely in the ballpark --
the typographical mid-point, the linguistic length mark, the mathematical
multiplication operator, the Greek ano teleia, the dictionary hyphenation
point, and, yes, the Catalan middle dot. The fact that each of those usages 
might have extremely fine typographical hairs to split regarding the rendering
was so much horsepucky as far as the character identity was concerned. You
used what you had available to represent your data.

The Unicode Standard, for a variety of reasons -- some of which included
compatibility mapping concerns to other standards which had started to
proliferate middle dots -- added a collection of middle dots *besides*
U+00B7, *the* middle dot, to its repertoire. Those other middle dots give
people textual representation alternatives now, if they need to make
distinctions, and textual rendering alternatives, if they need to make
middle dots which display with slightly different heights, sizes, or
spacings, depending on the rendering requirements.

What is clear, however, is that it is utterly impossible to satisfy
everybody regarding middle dots. Typographical purists will always want
plain text to make more distinctions. Text processing requirements will
abhor the splitting of text representation into more and more difficult-to-
distinguish glyph representations without clear semantic differences.
And dot proliferation *always* poses difficulty for establishing
character properties.

Before people bluster on too much further on this thread, it would
be good for everyone to recall that the *reason* why U+00B7 has
problematical properties is that it was inherently ambiguous in
*preexisting* usage (that is, prior to Unicode altogether) as punctuation
versus length mark (and other things as well). This puts it in the
same grabbag of very difficult, ambiguous ASCII characters, such as
~, *, and ' which also acquired conflicting usages during their
reign among the small set of available punctuation and symbols in
ASCII.

History has consequences. The history of a character's encoding also
has consequences for how the Unicode Standard is to be used and
interpreted.

--Ken




Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler

2004-04-19 Thread Kenneth Whistler
I think this was just a confused way of asking how to
convert UTF-16 into UTF-8:

U+0427 is the Unicode encoded character.

0x0427 is the UTF-16 character encoding form for it.

0xD0 0xA7 is the UTF-8 character encoding form for it.

Mino, sample code for how to do this is available at:

http://www.unicode.org/Public/PROGRAMS/CVTUTF/

Many Unicode support libraries will have a UTF-16 -- UTF-8
conversion routine built in somewhere. Check in the documentation
of the libraries for details.

This isn't a standard C function call -- it is in the libraries.

--Ken

 Mino,
 This is not at clear:
 the character U+0427 is Ч in the Cyrillic block, and what does this have to
 do with the two characters Ð and §, which are U+ 00D0 and U+00A7 ?
 Are you wondering how to store 0x0427 in a binary file ? Or what ?
 
 Raymond Mercier
 
   Contact:  [EMAIL PROTECTED]
   Report Type:  Other Question, Problem, or Feedback
   Opt Subject:  Unicode conversion
  
   I would like to convert a 2 byte Unicode code into its
   corresponding Unicode character (for instance the decimal 1063 or the
   hexadecimal 0427 into 'Ч'). Is there a C function in order to make the
   conversion? What file .h do I need to include in the C program? Can I
   use the 6.0 version of the Microsoft Visual C++ compiler, or do i
   need a newer version?
   Thanks a lot in advance.
   Mino Napoletano




Re: U+0140

2004-04-19 Thread Peter Kirk
On 19/04/2004 13:03, Kenneth Whistler wrote:

... Those other middle dots give
people textual representation alternatives now, if they need to make
distinctions, and textual rendering alternatives, if they need to make
middle dots which display with slightly different heights, sizes, or
spacings, depending on the rendering requirements.
 

Ken, does Unicode specify height, size and spacing distinctions between 
the various middle dots which you listed? If I understand correctly, it 
certainly doesn't do so exhaustively. So in effect what you are 
suggesting here is that people make and use their own private 
distinctions between characters which are not defined by Unicode. This 
sounds very like advising people to ignore Unicode character identiies 
and properties and do their own thing. Rather strange advice from 
someone in your position, surely?

Surely, in the current situation and if further proliferation of middle 
dots is considered undesirable, users should be advised to presume that 
distinctions between middle dots are not a plain text matter and so 
should be handled by markup, including language selection.

And if (as I just suggested on the Hebrew list might be true of some 
variant Hebrew pointing systems) someone finds a well documented script 
in which a true middle dot and an x-height dot are used contrastively, 
the correct approach would be either to accept, reluctantly, that at 
least one new dot needs to be encoded; or else for Unicode to define 
clearly which existing character should be used for which dot in this 
script. The worst thing that could happen would be for different text 
providers to make different and incompatible selections among the 
existing characters, leading to total confusion. But that seems to be 
the approach which you, Ken, are advocating.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Web Form: Subj: Unicode conversion- Microsoft Visual C++ comp iler

2004-04-19 Thread Rick Cameron
It may be even simpler than that: U+0427 may have appeared in his message in
UTF-8 because of his mail client.

It could be that he's asking how to convert from an int holding the number
1063 to a wchar_t holding U+0427.

The answer to this question is:

int charValue = 1063;

wchar_t utf16Char = (wchar_t)charvalue;

Cheers

- rick 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Kenneth Whistler
 Sent: April 19, 2004 13:47
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: Web Form: Subj: Unicode conversion- Microsoft 
 Visual C++ compiler
 
 I think this was just a confused way of asking how to convert 
 UTF-16 into UTF-8:
 
 U+0427 is the Unicode encoded character.
 
 0x0427 is the UTF-16 character encoding form for it.
 
 0xD0 0xA7 is the UTF-8 character encoding form for it.
 
 Mino, sample code for how to do this is available at:
 
 http://www.unicode.org/Public/PROGRAMS/CVTUTF/
 
 Many Unicode support libraries will have a UTF-16 -- UTF-8 
 conversion routine built in somewhere. Check in the 
 documentation of the libraries for details.
 
 This isn't a standard C function call -- it is in the libraries.
 
 --Ken
 
  Mino,
  This is not at clear:
  the character U+0427 is Ч in the Cyrillic block, and what 
 does this 
  have to do with the two characters Ð and §, which are U+ 
 00D0 and U+00A7 ?
  Are you wondering how to store 0x0427 in a binary file ? Or what ?
  
  Raymond Mercier
  
Contact:  [EMAIL PROTECTED]
Report Type:  Other Question, Problem, or Feedback Opt 
 Subject:  
Unicode conversion
   
I would like to convert a 2 byte Unicode code into its 
corresponding Unicode character (for instance the 
 decimal 1063 or 
the hexadecimal 0427 into 'Ч'). Is there a C 
 function in order 
to make the conversion? What file .h do I need to 
 include in the C 
program? Can I use the 6.0 version of the Microsoft Visual C++ 
compiler, or do i need a newer version?
Thanks a lot in advance.
Mino Napoletano
 
 



RE: U+0140

2004-04-19 Thread Peter Constable
 And if... someone finds a well documented script
 in which a true middle dot and an x-height dot are used contrastively,

That would be a somewhat surprising and not-to-be-recommended design for
a writing system. Not to be completely ruled out, though. But we can
probably wait to cross that encoding bridge when we come to it.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division



Re: U+0140

2004-04-19 Thread Kenneth Whistler
Peter Kirk continued this...

 On 19/04/2004 13:03, Kenneth Whistler wrote:
 
 ... Those other middle dots give
 people textual representation alternatives now, if they need to make
 distinctions, and textual rendering alternatives, if they need to make
 middle dots which display with slightly different heights, sizes, or
 spacings, depending on the rendering requirements.
   
 
 
 Ken, does Unicode specify height, size and spacing distinctions between 
 the various middle dots which you listed? 

No.

 If I understand correctly, it 
 certainly doesn't do so exhaustively. 

Correct.

 So in effect what you are 
 suggesting here is that people make and use their own private 
 distinctions between characters which are not defined by Unicode. 

Not at all.

I am suggesting that people who use Unicode characters *will* use them
according to their identity. However, that doesn't mean that identification
of a character neatly solves all issues of their rendering, nor will it
automatically make things neat and tidy when people use characters in
different contexts which may have different rendering concerns.

The Unicode Standard is not prescriptive about rendering, beyond the
basics required to simply ensure correct mapping of textual content
into streams of characters. If one font vendor wants to have a raised
glyph for the MIDDLE DOT and another wants to have a lowered glyph for
the same character, it is not the Unicode Standard's business to put
the two vendors in a room until one gives up and admits the other one
is correct.

 This 
 sounds very like advising people to ignore Unicode character identiies 
 and properties and do their own thing. Rather strange advice from 
 someone in your position, surely?

I love the way you put positions in peoples' mouths.

By the way, I challenge you to point to the Unicode character properties
in the Unicode Character Database which define the relative position for
middle dots with respect to x-height of a font, or the spacing of
middle dots, for example.

 
 Surely, in the current situation and if further proliferation of middle 
 dots is considered undesirable, 

It is undesirable, yes.

 users should be advised to presume that 
 distinctions between middle dots are not a plain text matter 

No, they should not. Because the existence of multiple different
middle dots in the standard which are *not* canonical equivalents
of each other makes it a plain text matter.

 and so 
 should be handled by markup, including language selection.

In some cases, yes -- it depends on the effect which is intended,
and the context and application it occurs in.

 
 And if (as I just suggested on the Hebrew list might be true of some 
 variant Hebrew pointing systems) someone finds a well documented script 
 in which a true middle dot and an x-height dot are used contrastively, 
 the correct approach would be either to accept, reluctantly, that at 
 least one new dot needs to be encoded; or else for Unicode to define 
 clearly which existing character should be used for which dot in this 
 script. 

Or: None of the Above

The users of characters for particular domains bear their own
responsibility to define their usage. It is not up to the Unicode
Consortium to go around defining everyone's spelling rules and
orthographic conventions for them.

If there are things unclear in the standard which are making its
use difficult for people in certain cases, then that is certainly
a concern of the Unicode Technical Committee. And if someone
brings in convincing evidence of the existence of a semantically
significant plain text distinction between two dots that cannot
plausibly be handled by *any* combination of the multitudinous dot
characters already present in the standard, then the UTC might
consider that sufficient justification to encode yet another
middle dot.

Given, however, the fact that there already are so many dot characters,
and given that their rendering often varies by font, the chance of
getting some additional pair of dot distinctions by height on the
line canonized with yet another dot encoding seems unlikely to me.

It is a will-'o-the-wisp to expect any and all multilingual
Unicode text to display correctly to any arbitrary n-th degree
of typographical rectitude with any and all Unicode-conformant
fonts. The use of specific fonts with specific designs is
*precisely* to enable plain text (or marked-up text, for that
matter) to be displayed as desired for particular contexts.

The criterion for Unicode plain text is basically *legible*
text. 

 The worst thing that could happen would be for different text 
 providers to make different and incompatible selections among the 
 existing characters, leading to total confusion. But that seems to be 
 the approach which you, Ken, are advocating.

I see. And thank you, Peter, for pointing that error out to me.

Text providers have their own responsibility to ensure that
they are using interoperable conventions for the representation
of 

Re: Diacritic Property and Phillipine Viramas

2004-04-19 Thread Kenneth Whistler
Ernest Cline asked:

 Is there a reason for the lack of the Diacritic property on
 the Tagalog and Hanunoo virama characters (U+1714
 and U+1734)? 

Human fallibility?

 All of the other virama characters (i.e.,
 those of combining class 9) have this property and it
 seems appropriate based on the description of these
 characters in Chapter 10.

I think you are correct.

--Ken

 
 Ernest Cline
 [EMAIL PROTECTED]




Re: U+0140

2004-04-19 Thread John Hudson
Peter Constable wrote:

And if... someone finds a well documented script
in which a true middle dot and an x-height dot are used contrastively,
That would be a somewhat surprising and not-to-be-recommended design for
a writing system. Not to be completely ruled out, though. But we can
probably wait to cross that encoding bridge when we come to it.
We already have conrasted use of a baseline dot (period or full stop) and a mid-dot (word 
separator or stylistic hyphen), so why would you be surprised by contrasted use of mid-dot 
and x-height dot? Vertical alignment is clearly sometimes a semantic feature. I've seen 
plenty of business cards in which the mid-dot is used as a stylistic division between 
parts of a telephone number instead of spaces, periods or hyphens. I don't like the style, 
but people do it. Presumably some Greek people do it also, in which case they are 
contrasting the mid-dot and the ano teleia.

John Hudson

--

Tiro Typeworkswww.tiro.com
Vancouver, BC[EMAIL PROTECTED]
I often play against man, God says, but it is he who wants
  to lose, the idiot, and it is I who want him to win.
And I succeed sometimes
In making him win.
 - Charles Peguy


Re: Downloading UCD 4.0.0

2004-04-19 Thread Doug Ewell
Theo Veenker Theo dot Veenker at let dot uu dot nl wrote:

 Until now I always downloaded the lastest version of the UCD
 and worked with that. Now I want to download the UCD files for
 4.0.0 again. I know it is all in http://www.unicode.org/Public/-
 4.0-Update/,
 ...
 Do I really need to find out and download all unchanged files
 from 3.2.0 and earlier, just to get the files for 4.0.0?

and Kenneth Whistler kenw at sybase dot com responded:

 Yes. The relevant information for *each* version of the
 Unicode Standard is at:

 http://www.unicode.org/standard/Versions/enumeratedversions.html
 ...

I think the answer depends on what Theo really wants.  He asked about
downloading the data files for 4.0.0, but before that he mentioned
downloading the latest version, which is not 4.0.0 but 4.0.1.

If Theo really wants the 4.0.0 data files, he needs to download not only
from 4.0-Update but also from 3.2-Update, as Ken said.

If all he wants is the latest version (4.0.1), he can go to:

http://www.unicode.org/Public/UNIDATA/

which not only has all the files, but has the added advantage that he
doesn't have to strip the -x.x.x version number from the file names if
he's only interested in replacing old files with new ones.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Unihan.txt and the four dictionary sorting algorithm

2004-04-19 Thread Ernest Cline

While I would expect the answer to my question to be true,
one never knows what lurks in the heart of data files.
Unihan.txt contains at least two properties for each of the
four dictionaries used in the sorting algorithm.  One property
contains only characters that are actually in the dictionary
while the other contains interpolations as well.  Is it always
the case that a character is in one of these dictionaries
if and only if the two properties have the same value
and always end in 0.

For example, if there is a value of kIRGKungXi of the form
.YY0 there will always be the same value for the
kKangXi for that character and vice versa.

I'm trying to pare Unihan.txt down to a less unwieldy size
for my own use by eliminating properties that are of no
interest to me and would like to be certain that eliminating
the four properties containing the actual values for those
dictionaries can be done safely because the information
can be reconstituted if necessary from the kIRG*
properties since I'm not certain if those properties
are of interest to me.

Ernest Cline
[EMAIL PROTECTED]






Re: Downloading UCD 4.0.0

2004-04-19 Thread Doug Ewell
I wrote:

 I think the answer depends on what Theo really wants.  He asked about
 downloading the data files for 4.0.0, but before that he mentioned
 downloading the latest version, which is not 4.0.0 but 4.0.1.

Reading Theo's question again, I see that he was talking about having
downloaded the latest version until now and now wants to download
4.0.0 again, which he recognizes is not the latest version.  So Ken's
answer was the appropriate one.  Read first, Doug, then write.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/