What Simon says extends to another case: if you change the visible behavior
of that function, even if it's a necessary bug fix, you end up with broken
indices. 

That's true for every collation sequence, though. 

If I interpret your code correctly, however, the user can switch the
preferred language on the fly, changing the collation locale and thus the
"response" of your function. This will *definitely* break the index. 

i.a.W. if s1 < s2 in one language, but bot in another (somethign that can
occur AFAIK), your index is valid only for the language it was created with.  

You would either have to avoid using an index at those columns (making sort
not exactly fast), or re-index when switching language (making switching
expensive and file-global)

---

Other things:

Should "AD" + "ZV" really compare as a "A" + "DZ" digraph +"V" in the
respective language? I am not sure about the intended behavior, but it seems
strange. (OTOH, language. It's always strange.)
 
Anyway, I would definitely  unicode-normalize the strings *before* putting
them into the database. You might avoid the special handling for the
digraphs if you normalize /towards/ the digraph code points: only strings
actually containing digraphs would escape your optimization.  

----

With the problem above in mind: If you have to change the DB anyway when
switching language, you could also store the sequence of comparison weights
in a separate column, and index/sort on this. 

I don't know the collation algorithm well enough to say, but it seems to me
you could just throw away the "case" weight, and then do a full binary
comparison. But pure ascii strings could be stored in a native lowercase
representaiton and take "only" twice the space. 

This trades of course file size, working set and "switch performance" for
comparison speed, not sure if this makes sense for you. 

----

Unit Tests:  
I would isolate the comparison in a core function, and primarily test that
core function, basically

assert( cmp("Ape", "monkey") < 0);

Select strings for each branch (pure ascii, "bad european ascii",
non-ascii).   
For each pair, check that you get the opposite result when you switch the
arguments. 

There's no need to test that, say, "DZ" sorts differently in huingarian than
in Albanian -  unless you distrust the core functions. You just want to make
sure to pick the right branch (memcmp, ASCII compare, full compare). 

It is hard to test for speed optimizations, as the variance of normal
execution speed trumps most gains. What I do in this case is adding
"performance counters" and test for them: in your case, three simple
counters for each of the branches:

  assert(mycollate.perf.asciicmp == 0);
  assert( cmp("Ape", "monkey") < 0);
  assert(mycollate.perf.asciicmp == 1);

This is most brittle as such optimizations tend to change, so don't
overindulge. 

Of course I assume you make a "manual" performance measurement, too. 


----


I am not sure why you go the full length of doing it right, but then balk at
the A vs B define, pickign the "dangerous" one. 

There are some oddities about the code, e.g. comparing a g_bUseAscii  to
explicit 0 / 1 / -1 values, but that's not the topic of your post...



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Collation-advice-tp70668p70679.html
Sent from the SQLite mailing list archive at Nabble.com.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to