Jonathan Kaye wrote:
Sorry for being thick. I suspected this might be a "feature" rather than a
bug. As I'm working on Namibian languages (Nama and Khoekhoegowab,
specifically) this is going to be a problem. "Phonemes" in these languages
are often expressed by digraphs or even trigraphs. They need to be encoded
into a special sort field which respects their identity. For example "kh"
is not the same as k+h (just like English sh is not the same as s+h) and
the Namibians want it to occupy a special place in the collating sequence.
I have encode all such cases as a single character and quickly run out of
normal characters. Way back in the days when this project was started I
used qsort to do the sorting. It worked on the 256 (- the reserved codes)
ascii codes and gave me a sort in strict numeric order regardless of
the "semantics" of the code. So ö ascii f6 was just a number, f6, and bore
no special relation to "o" which is 6f. When you consider that Namibian
languages have tones (up to 4 level plus more contour ones) and these come
out as accents in the final written form, you can start to appreciate the
scope of the problem. What's more, tones are not taken into consideration
for sorting unless they are the sole means of distinguishing two otherwise
identical forms. My coding strips off the tones (represented by numbers and
normally following the vowel they sit on) and puts them at the end of the
recoded string used for sorting.
So you are quite correct in saying that normally accented characters should
be treated this way but in my case this is a disaster. My question would
then be, is there a way of turning off this feature and having my codes
purely in terms of their ascii values.
Thanks for your patience and sorry about being thick in seeing what you are
talking about. If you have any suggestions maybe we continue this thread
offlist between the two of us as it's getting rather technical and probably
doesn't interest the typical OO user.
There is no way of turning off this feature. (But I think that you might
suggest to OpenOffice.org that they also allow a binary sort in addition
to the language sorts.)
Now if Nama were one of the languages supported by Unicode, then a sort
would have been set up which would allow for words in Nama to be sorted
according to normal Nama rules, if that is what you want. (But I gather
that you don't want standard Nama sorting, rather you want a tailored
linguistic sort.)
Still, this may not be a problem if you don't use accented characters
for your sort codes. See
http://unicode.org/Public/UCA/latest/allkeys.txt for the Unicode sort
codes. Search for "LATIN SMALL LETTER A" which marks the beginning of
the Latin alphabetic characters. The first element of a sort code
indicates the letter value. When the first element changes, this
indicates a different letter. You could therefore select from the
different non-standard letter codes for your values rather than using
the letter-diacritic combinations which no longer work in many modern
sort programs because they have the same initial sort code. You might
limit yourself mostly to characters that appear in the WGL4 character
set which is supported by many fonts. (See
http://www.microsoft.com/typography/otspec/WGL4.htm .)
Note also, that if you indicate tones by diacritics, then the tones will
automatically not affect the sort except in cases where forms are
identical except for diacritics. Thus you need not mess around with
extracting and dropping tones if you use diacritics to represent tones.
You also may be able to use the proper Unicode click symbols instead of
kludging.
But if you also want to sort by using a database which doesn't support
Unicode, then you are stuck with using a 256-character set.
Note that if you are using Windows, Windows has its own sortation
algorithm which is not always identical to the Unicode algorithm. (It
predates the Unicode algorthim and is supposedly in some ways better.)
Accordingly some Windows programs may sort characters differently than
does OpenOffice.org. And the Unicode sort-values have changed in the
past and may possibly change in the future in respect to particular
characters, so the order of characters might be changed.
For free fonts that support a large number of Unicode characters see
http://www.alanwood.net/unicode/fonts.html .
In short, I think you may be able to still do what you wish to do.
Another possibility, if you have triglyphs, is to represent every
character in the original word by a quadraglyph, that is "A" might be
"aaaa", "Á" might be "áaaa", B might be "baaa", K might be "kaaa", KH
might be "kzha" and so forth. This would allow creation of source codes
without using special characters, since each sort-character is four
characters wide.
Indeed, you might be able to use only three characters or two characters
to create such codes depending on how complex your system is.
Jim Allan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]