Hello group,

I'd like to have the group opinion about a feature I would find utterly 
useful in _standard_ SQLite.

Here's a rewrite of mails sent to hwaci about it, without success so far.

Note: I guess that non pure ASCII characters in the sample strings 
below will translate to '?', but you can obviously get the idea.

---------------
First I'd like to congratulate contributors to SQLite. You know who and 
why!

Now I'd like to either submit a request or ask for a quotation wrt (yet 
another) new feature in SQLite.

SQLite has been supporting Unicode for a long time but Unicode brings 
its own problems. Collating is one of the most important. I know one 
can add a user-defined collating function but I think it doesn't solve 
all situations.

I feel the need for a different implementation of collating support. 
Not every user of SQLite needs full universal collating support à la 
ICU. It's huge and slows things down significantly.

Most users only have a very small number of close languages to deal 
with, e.g. European Western ones.  For 98% of those users only the 
diacritics are a problem, but these accented letters are only a tiny 
subset of Unicode for a given language.

Hence I believe a large user base could benefit of having a new 
"USERSTRING" collating support which could be implemented in only two 
couples of strings of equal length _stored_ in a reserved table of a 
database and a few lines of fast C code in the SQLite library.

The first string, call it OrderString, would define the specific 
character set required by this precise application and its sort order 
(ORDER BY). Any char not found there should be handled as per its 
Unicode value.  This would allow users to explicit the charset and 
sorting order they need, without having to bring giant tables and slow 
code in the picture.

The second string, call it ClassString, would hold the character class 
for each character in the first string.  Its use is for LIKE operator 
and FTS3 support, after lookup in Orderstring to get letter index in 
this string. You can choose to have both upper- and lower-case letters 
there (case sensitive LIKE) or only lowercase (case insensitive LIKE).

The third string, call it UpperString, would hold the uppercase version 
of the charset.

The fourth string, call it LowerString, would hold the lowercase 
version of the charset. Should be identical to UpperString for 
languages that don't have casing.

Let me elaborate how the OrderString can cope with letters in two or 
more intervals. To compare strings, we need only know how to compare 
single characters.
Let C and D be the characters to compare and let A be the first and Z 
the last character in OrderString. Also let Idx(char) be the position 
of char in Orderstring. We can proceed as follows:

ic = Index(C, OrderString)       // index of C or -1 if not found
id = Index(D, OrderString)       // index of D or -1 if not found
If C != -1
   If D != -1
     Return Sign(ic - id)
   Else
     Return Sign(D - A)
   Endif
Else
   If D != -1
     Return Sign(C - A)
   Else
     Return Sign(C - D)
   Endif
Endif

In plain language, if C & D are both inside or outside the user 
charset, then compare them directly within their set (user or 
rest-of-Unicode). Else compare the char not in userset against the 
first char in userset. This trick makes the userset a separate "branch" 
off Unicode, sorted apart from the Unicode "sequence".

The pseudo-code above should be compiled very efficiently by any decent 
compiler. OrderString not only defines the userset and its sort order, 
but it also defines the "letter" class for REGEXes, upper and lower 
classes being also well defined by UpperString and LowerString.

The LIKE operator goes along quite the same line. With notation above:

ic = Index(C, OrderString)       // index of C or -1 if not found
id = Index(D, OrderString)       // index of D or -1 if not found
If     C != -1 AND If D != -1
     Return (ClassString[ic] == ClassString[id])
ElseIf C != -1 AND If D != -1
     Return (C == D)
Else
     Return False
Endif

If we would like to have LIKE act depending on a CASE_SENSITIVE flag 
[sounds it could be the compareInfo.noCase flag], then it would require 
ClassString be in fact TWO strings, UpperClassString and 
LowerClassString. Anyway, once the choice made, it's as easy to 
implement with straightforward code.

As an example, I'm interested in finding customers first or last names, 
street addresses or city names mostly for French customers. I may also 
have some Spanish and Portugese names around, as well as names from 
Netherlands, Germany, Switzerland, Italy. I would use FTS3 virtual 
tables knowing that FTS3 will use my user supplied strings to record 
words in pure ASCII, but I could use any kind of search with data 
containing diacritics: it would work like a charm transparently.

I would start with something like (mailing this could partly destroy it!):

OrderString =
  
'AÀÁÂÃÄÅÆaàáâãäåæBbCÇcçDdEÈÉÊËeèéêëFfGgHhIÌÍÎÏiìíîïJjKkLlMmNÑnñOÒÓÔÕ֌oòóôõöœPpQqRrSßsTtUÙÚÛÜuùúûüVvWwXxYyýÿZz'

ClassString =
  
'AAAAAAAAaaaaaaaaBbCCccDdEEEEEeeeeeFfGgHhIIIIIiiiiiJjKkLlMmNNnnOOOOOOOoooooooPpQqRrSssTtUUUUUuuuuuVvWwXxYyyyZz';

UpperString =
  'AÀÁÂÃÄÅÆBCÇDEÈÉÊËFGHIÌÍÎÏJKLMNÑOÒÓÔÕ֌PQRSßTUÙÚÛÜVWXYYYZ';
                                                        ^^---- 
Uppercase version don't seem to exist, but no problem
LowerString =
  'aàáâãäåæbcçdeèéêëfghiìíîïjklMnñoòóôõöœpqrsstuùúûüvwxyýÿz';

If ever I would have to handle, say Hebrew, names as well (very 
unlikely because I can't read/write Hebrew!) I would only have to 
append [or prepend] the relevant Hebrew letters in the required order 
to be ready within minutes.

The only limitation in this scheme is that the user should be aware 
that using a character range in a regex class may span a non-trivial 
range. I spare you more detailed explanation, but it can be made to 
work rather intuitively without much complexity.


Of course such feature can't compete in universality with ICU and can't 
even deal with "multichar" cases like the german ß --> ss and similar 
"multiple characters" situations. BTW I believe most german users 
wouldn't object much at having 'Fuß' LIKE 'fus' returning true.

But this approach has many advantages in my view:
   -) it is very simple
   -) it is very flexible
   -) it is able to solve the collating problem in _most_ simple situations
   -) it can be changed on the fly without recompile
   -) it can be implemented with small and fast code
   -) it keeps size requirements in scale with what the user really needs
   -) it brings the metadata into the database (where it belongs), in a 
portable way
   -) it allows non standard collation / case translation (can fit ad 
hoc needs)
   -) it has near zero size and cycles overhead if not used
   -) devoting a special "system" table for holding such data allows to 
store
      more than one. Each set of four strings can be given a handy user 
name
      and be used when the need arises with very low penalty.
   -) it doesn't preclude the use of ICU for those who need caterpillars
   -) it's royalty-free (ah ah !!!)

As a sidenote I insist this would solve the problem wich arises when a 
database using a user-defined collating function is dissociated from 
the user code. It would make collation and case translation as portable 
as the database itself, provided it is standard in SQLite out of the "box".


Even if you aren't persuaded it is worth a second thought, I seriously 
ask again if you could quote a custom development for this into SQLite.


Let me eat one more minute of your time explaining why I won't do it 
myself.  I've been in IT for quite a long time now and I've coded a lot 
in too many languages.  I've spend the last 10+ years as a project 
leader in formal methods.
But now I'm currently in the situation of being a simple _user_ of 
someone else' products: I need to concentrate on making things work 
without taking the pain with low-level details.  I need little 
maintenance done by myself.

So I certainly don't want to have to fetch new source code for each new 
release, setup a development machine (I only have _production_ machines 
here), take the time to learn how to add the feature, bring in this 
feature in each new SQLite release without breaking other code, test 
the result (who knows what bug that silly compiler will produce this 
time ?), ...

My development environment is AutoIt (another wonderful piece of 
software) and SQLite. I have good reasons for keeping it that simple. I 
strongly believe that offering the above feature in standard SQLite 
would allow really much better and useful applications for a large 
number of users!


I apologize for my lengthy mail and naive english.

Thanks in advance for your answer.

---------------

Jean-Christophe

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to