[sqlite] Proposal for SQLite and non pure ASCII letters

Jean-Christophe Deschamps Wed, 11 Mar 2009 15:15:35 -0700

Hello group,

I'd like to have the group opinion about a feature I would find utterly
useful in _standard_ SQLite.

Here's a rewrite of mails sent to hwaci about it, without success so far.

Note: I guess that non pure ASCII characters in the sample strings
below will translate to '?', but you can obviously get the idea.

---------------
First I'd like to congratulate contributors to SQLite. You know who and
why!

Now I'd like to either submit a request or ask for a quotation wrt (yet
another) new feature in SQLite.

SQLite has been supporting Unicode for a long time but Unicode brings
its own problems. Collating is one of the most important. I know one
can add a user-defined collating function but I think it doesn't solve
all situations.

I feel the need for a different implementation of collating support.
Not every user of SQLite needs full universal collating support à la
ICU. It's huge and slows things down significantly.

Most users only have a very small number of close languages to deal
with, e.g. European Western ones. For 98% of those users only the
diacritics are a problem, but these accented letters are only a tiny
subset of Unicode for a given language.

Hence I believe a large user base could benefit of having a new
"USERSTRING" collating support which could be implemented in only two
couples of strings of equal length _stored_ in a reserved table of a
database and a few lines of fast C code in the SQLite library.

The first string, call it OrderString, would define the specific
character set required by this precise application and its sort order
(ORDER BY). Any char not found there should be handled as per its
Unicode value. This would allow users to explicit the charset and
sorting order they need, without having to bring giant tables and slow
code in the picture.

The second string, call it ClassString, would hold the character class
for each character in the first string. Its use is for LIKE operator
and FTS3 support, after lookup in Orderstring to get letter index in
this string. You can choose to have both upper- and lower-case letters
there (case sensitive LIKE) or only lowercase (case insensitive LIKE).

The third string, call it UpperString, would hold the uppercase version
of the charset.

The fourth string, call it LowerString, would hold the lowercase
version of the charset. Should be identical to UpperString for
languages that don't have casing.

Let me elaborate how the OrderString can cope with letters in two or
more intervals. To compare strings, we need only know how to compare
single characters.
Let C and D be the characters to compare and let A be the first and Z
the last character in OrderString. Also let Idx(char) be the position
of char in Orderstring. We can proceed as follows:

ic = Index(C, OrderString) // index of C or -1 if not found
id = Index(D, OrderString) // index of D or -1 if not found
If C != -1
If D != -1
Return Sign(ic - id)
Else
Return Sign(D - A)
Endif
Else
If D != -1
Return Sign(C - A)
Else
Return Sign(C - D)
Endif
Endif

In plain language, if C & D are both inside or outside the user
charset, then compare them directly within their set (user or
rest-of-Unicode). Else compare the char not in userset against the
first char in userset. This trick makes the userset a separate "branch"
off Unicode, sorted apart from the Unicode "sequence".

The pseudo-code above should be compiled very efficiently by any decent
compiler. OrderString not only defines the userset and its sort order,
but it also defines the "letter" class for REGEXes, upper and lower
classes being also well defined by UpperString and LowerString.

The LIKE operator goes along quite the same line. With notation above:

ic = Index(C, OrderString) // index of C or -1 if not found
id = Index(D, OrderString) // index of D or -1 if not found
If C != -1 AND If D != -1
Return (ClassString[ic] == ClassString[id])
ElseIf C != -1 AND If D != -1
Return (C == D)
Else
Return False
Endif

If we would like to have LIKE act depending on a CASE_SENSITIVE flag
[sounds it could be the compareInfo.noCase flag], then it would require
ClassString be in fact TWO strings, UpperClassString and
LowerClassString. Anyway, once the choice made, it's as easy to
implement with straightforward code.

As an example, I'm interested in finding customers first or last names,
street addresses or city names mostly for French customers. I may also
have some Spanish and Portugese names around, as well as names from
Netherlands, Germany, Switzerland, Italy. I would use FTS3 virtual
tables knowing that FTS3 will use my user supplied strings to record
words in pure ASCII, but I could use any kind of search with data
containing diacritics: it would work like a charm transparently.

I would start with something like (mailing this could partly destroy it!):

OrderString =

'AÀÁÂÃÄÅÆaàáâãäåæBbCÇcçDdEÈÉÊËeèéêëFfGgHhIÌÍÎÏiìíîïJjKkLlMmNÑnñOÒÓÔÕÖoòóôõöPpQqRrSßsTtUÙÚÛÜuùúûüVvWwXxYyýÿZz'

ClassString =

'AAAAAAAAaaaaaaaaBbCCccDdEEEEEeeeeeFfGgHhIIIIIiiiiiJjKkLlMmNNnnOOOOOOOoooooooPpQqRrSssTtUUUUUuuuuuVvWwXxYyyyZz';

UpperString =
'AÀÁÂÃÄÅÆBCÇDEÈÉÊËFGHIÌÍÎÏJKLMNÑOÒÓÔÕÖPQRSßTUÙÚÛÜVWXYYYZ';
^^----
Uppercase version don't seem to exist, but no problem
LowerString =
'aàáâãäåæbcçdeèéêëfghiìíîïjklMnñoòóôõöpqrsstuùúûüvwxyýÿz';

If ever I would have to handle, say Hebrew, names as well (very
unlikely because I can't read/write Hebrew!) I would only have to
append [or prepend] the relevant Hebrew letters in the required order
to be ready within minutes.

The only limitation in this scheme is that the user should be aware
that using a character range in a regex class may span a non-trivial
range. I spare you more detailed explanation, but it can be made to
work rather intuitively without much complexity.

Of course such feature can't compete in universality with ICU and can't
even deal with "multichar" cases like the german ß --> ss and similar
"multiple characters" situations. BTW I believe most german users
wouldn't object much at having 'Fuß' LIKE 'fus' returning true.

But this approach has many advantages in my view:
-) it is very simple
-) it is very flexible
-) it is able to solve the collating problem in _most_ simple situations
-) it can be changed on the fly without recompile
-) it can be implemented with small and fast code
-) it keeps size requirements in scale with what the user really needs
-) it brings the metadata into the database (where it belongs), in a
portable way
-) it allows non standard collation / case translation (can fit ad
hoc needs)
-) it has near zero size and cycles overhead if not used
-) devoting a special "system" table for holding such data allows to
store
more than one. Each set of four strings can be given a handy user
name
and be used when the need arises with very low penalty.
-) it doesn't preclude the use of ICU for those who need caterpillars
-) it's royalty-free (ah ah !!!)

As a sidenote I insist this would solve the problem wich arises when a
database using a user-defined collating function is dissociated from
the user code. It would make collation and case translation as portable
as the database itself, provided it is standard in SQLite out of the "box".

Even if you aren't persuaded it is worth a second thought, I seriously
ask again if you could quote a custom development for this into SQLite.

Let me eat one more minute of your time explaining why I won't do it
myself. I've been in IT for quite a long time now and I've coded a lot
in too many languages. I've spend the last 10+ years as a project
leader in formal methods.
But now I'm currently in the situation of being a simple _user_ of
someone else' products: I need to concentrate on making things work
without taking the pain with low-level details. I need little
maintenance done by myself.

So I certainly don't want to have to fetch new source code for each new
release, setup a development machine (I only have _production_ machines
here), take the time to learn how to add the feature, bring in this
feature in each new SQLite release without breaking other code, test
the result (who knows what bug that silly compiler will produce this
time ?), ...

My development environment is AutoIt (another wonderful piece of
software) and SQLite. I have good reasons for keeping it that simple. I
strongly believe that offering the above feature in standard SQLite
would allow really much better and useful applications for a large
number of users!

I apologize for my lengthy mail and naive english.

Thanks in advance for your answer.

---------------

Jean-Christophe

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Proposal for SQLite and non pure ASCII letters

Reply via email to