[users] Re: an arcane property of OO sorting

Jim Allan Fri, 22 Jun 2007 12:22:10 -0700

Jonathan Kaye wrote:

Sorry for being thick. I suspected this might be a "feature" rather than a
bug. As I'm working on Namibian languages (Nama and Khoekhoegowab,
specifically) this is going to be a problem. "Phonemes" in these languages
are often expressed by digraphs or even trigraphs. They need to be encoded
into a special sort field which respects their identity. For example "kh"
is not the same as k+h (just like English sh is not the same as s+h) and
the Namibians want it to occupy a special place in the collating sequence.
I have encode all such cases as a single character and quickly run out of
normal characters. Way back in the days when this project was started I
used qsort to do the sorting. It worked on the 256 (- the reserved codes)
ascii codes and gave me a sort in strict numeric order regardless of
the "semantics" of the code. So ö ascii f6 was just a number, f6, and bore
no special relation to "o" which is 6f. When you consider that Namibian
languages have tones (up to 4 level plus more contour ones) and these come
out as accents in the final written form, you can start to appreciate the
scope of the problem. What's more, tones are not taken into consideration
for sorting unless they are the sole means of distinguishing two otherwise
identical forms. My coding strips off the tones (represented by numbers and
normally following the vowel they sit on) and puts them at the end of the
recoded string used for sorting.
So you are quite correct in saying that normally accented characters should
be treated this way but in my case this is a disaster. My question would
then be, is there a way of turning off this feature and having my codes
purely in terms of their ascii values.
Thanks for your patience and sorry about being thick in seeing what you are
talking about. If you have any suggestions maybe we continue this thread
offlist between the two of us as it's getting rather technical and probably
doesn't interest the typical OO user.

There is no way of turning off this feature. (But I think that you mightsuggest to OpenOffice.org that they also allow a binary sort in additionto the language sorts.)

Now if Nama were one of the languages supported by Unicode, then a sortwould have been set up which would allow for words in Nama to be sortedaccording to normal Nama rules, if that is what you want. (But I gatherthat you don't want standard Nama sorting, rather you want a tailoredlinguistic sort.)

Still, this may not be a problem if you don't use accented charactersfor your sort codes. Seehttp://unicode.org/Public/UCA/latest/allkeys.txt for the Unicode sortcodes. Search for "LATIN SMALL LETTER A" which marks the beginning ofthe Latin alphabetic characters. The first element of a sort codeindicates the letter value. When the first element changes, thisindicates a different letter. You could therefore select from thedifferent non-standard letter codes for your values rather than usingthe letter-diacritic combinations which no longer work in many modernsort programs because they have the same initial sort code. You mightlimit yourself mostly to characters that appear in the WGL4 characterset which is supported by many fonts. (Seehttp://www.microsoft.com/typography/otspec/WGL4.htm .)

Note also, that if you indicate tones by diacritics, then the tones willautomatically not affect the sort except in cases where forms areidentical except for diacritics. Thus you need not mess around withextracting and dropping tones if you use diacritics to represent tones.

You also may be able to use the proper Unicode click symbols instead ofkludging.

But if you also want to sort by using a database which doesn't supportUnicode, then you are stuck with using a 256-character set.

Note that if you are using Windows, Windows has its own sortationalgorithm which is not always identical to the Unicode algorithm. (Itpredates the Unicode algorthim and is supposedly in some ways better.)Accordingly some Windows programs may sort characters differently thandoes OpenOffice.org. And the Unicode sort-values have changed in thepast and may possibly change in the future in respect to particularcharacters, so the order of characters might be changed.

For free fonts that support a large number of Unicode characters seehttp://www.alanwood.net/unicode/fonts.html .


In short, I think you may be able to still do what you wish to do.

Another possibility, if you have triglyphs, is to represent everycharacter in the original word by a quadraglyph, that is "A" might be"aaaa", "Á" might be "áaaa", B might be "baaa", K might be "kaaa", KHmight be "kzha" and so forth. This would allow creation of source codeswithout using special characters, since each sort-character is fourcharacters wide.

Indeed, you might be able to use only three characters or two charactersto create such codes depending on how complex your system is.


Jim Allan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[users] Re: an arcane property of OO sorting

Reply via email to