Jonathan Kaye wrote:
Sorry for being thick. I suspected this might be a "feature" rather than a
bug. As I'm working on Namibian languages (Nama and Khoekhoegowab,
specifically) this is going to be a problem. "Phonemes" in these languages
are often expressed by digraphs or even trigraphs. They need to be encoded
into a special sort field which respects their identity. For example "kh"
is not the same as k+h (just like English sh is not the same as s+h) and
the Namibians want it to occupy a special place in the collating sequence.
I have encode all such cases as a single character and quickly run out of
normal characters. Way back in the days when this project was started I
used qsort to do the sorting. It worked on the 256 (- the reserved codes)
ascii codes and gave me a sort in strict numeric order regardless of
the "semantics" of the code. So ö ascii f6 was just a number, f6, and bore
no special relation to "o" which is 6f. When you consider that Namibian
languages have tones (up to 4 level plus more contour ones) and these come
out as accents in the final written form, you can start to appreciate the
scope of the problem. What's more, tones are not taken into consideration
for sorting unless they are the sole means of distinguishing two otherwise
identical forms. My coding strips off the tones (represented by numbers and
normally following the vowel they sit on) and puts them at the end of the
recoded string used for sorting.
So you are quite correct in saying that normally accented characters should
be treated this way but in my case this is a disaster. My question would
then be, is there a way of turning off this feature and having my codes
purely in terms of their ascii values.
Thanks for your patience and sorry about being thick in seeing what you are
talking about. If you have any suggestions maybe we continue this thread
offlist between the two of us as it's getting rather technical and probably
doesn't interest the typical OO user.

There is no way of turning off this feature. (But I think that you might suggest to OpenOffice.org that they also allow a binary sort in addition to the language sorts.)

Now if Nama were one of the languages supported by Unicode, then a sort would have been set up which would allow for words in Nama to be sorted according to normal Nama rules, if that is what you want. (But I gather that you don't want standard Nama sorting, rather you want a tailored linguistic sort.)

Still, this may not be a problem if you don't use accented characters for your sort codes. See http://unicode.org/Public/UCA/latest/allkeys.txt for the Unicode sort codes. Search for "LATIN SMALL LETTER A" which marks the beginning of the Latin alphabetic characters. The first element of a sort code indicates the letter value. When the first element changes, this indicates a different letter. You could therefore select from the different non-standard letter codes for your values rather than using the letter-diacritic combinations which no longer work in many modern sort programs because they have the same initial sort code. You might limit yourself mostly to characters that appear in the WGL4 character set which is supported by many fonts. (See http://www.microsoft.com/typography/otspec/WGL4.htm .)

Note also, that if you indicate tones by diacritics, then the tones will automatically not affect the sort except in cases where forms are identical except for diacritics. Thus you need not mess around with extracting and dropping tones if you use diacritics to represent tones.

You also may be able to use the proper Unicode click symbols instead of kludging.

But if you also want to sort by using a database which doesn't support Unicode, then you are stuck with using a 256-character set.

Note that if you are using Windows, Windows has its own sortation algorithm which is not always identical to the Unicode algorithm. (It predates the Unicode algorthim and is supposedly in some ways better.) Accordingly some Windows programs may sort characters differently than does OpenOffice.org. And the Unicode sort-values have changed in the past and may possibly change in the future in respect to particular characters, so the order of characters might be changed.

For free fonts that support a large number of Unicode characters see http://www.alanwood.net/unicode/fonts.html .

In short, I think you may be able to still do what you wish to do.

Another possibility, if you have triglyphs, is to represent every character in the original word by a quadraglyph, that is "A" might be "aaaa", "Á" might be "áaaa", B might be "baaa", K might be "kaaa", KH might be "kzha" and so forth. This would allow creation of source codes without using special characters, since each sort-character is four characters wide.

Indeed, you might be able to use only three characters or two characters to create such codes depending on how complex your system is.

Jim Allan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to