What Simon says extends to another case: if you change the visible behavior of that function, even if it's a necessary bug fix, you end up with broken indices.
That's true for every collation sequence, though. If I interpret your code correctly, however, the user can switch the preferred language on the fly, changing the collation locale and thus the "response" of your function. This will *definitely* break the index. i.a.W. if s1 < s2 in one language, but bot in another (somethign that can occur AFAIK), your index is valid only for the language it was created with. You would either have to avoid using an index at those columns (making sort not exactly fast), or re-index when switching language (making switching expensive and file-global) --- Other things: Should "AD" + "ZV" really compare as a "A" + "DZ" digraph +"V" in the respective language? I am not sure about the intended behavior, but it seems strange. (OTOH, language. It's always strange.) Anyway, I would definitely unicode-normalize the strings *before* putting them into the database. You might avoid the special handling for the digraphs if you normalize /towards/ the digraph code points: only strings actually containing digraphs would escape your optimization. ---- With the problem above in mind: If you have to change the DB anyway when switching language, you could also store the sequence of comparison weights in a separate column, and index/sort on this. I don't know the collation algorithm well enough to say, but it seems to me you could just throw away the "case" weight, and then do a full binary comparison. But pure ascii strings could be stored in a native lowercase representaiton and take "only" twice the space. This trades of course file size, working set and "switch performance" for comparison speed, not sure if this makes sense for you. ---- Unit Tests: I would isolate the comparison in a core function, and primarily test that core function, basically assert( cmp("Ape", "monkey") < 0); Select strings for each branch (pure ascii, "bad european ascii", non-ascii). For each pair, check that you get the opposite result when you switch the arguments. There's no need to test that, say, "DZ" sorts differently in huingarian than in Albanian - unless you distrust the core functions. You just want to make sure to pick the right branch (memcmp, ASCII compare, full compare). It is hard to test for speed optimizations, as the variance of normal execution speed trumps most gains. What I do in this case is adding "performance counters" and test for them: in your case, three simple counters for each of the branches: assert(mycollate.perf.asciicmp == 0); assert( cmp("Ape", "monkey") < 0); assert(mycollate.perf.asciicmp == 1); This is most brittle as such optimizations tend to change, so don't overindulge. Of course I assume you make a "manual" performance measurement, too. ---- I am not sure why you go the full length of doing it right, but then balk at the A vs B define, pickign the "dangerous" one. There are some oddities about the code, e.g. comparing a g_bUseAscii to explicit 0 / 1 / -1 values, but that's not the topic of your post... -- View this message in context: http://sqlite.1065341.n5.nabble.com/Collation-advice-tp70668p70679.html Sent from the SQLite mailing list archive at Nabble.com. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users