Hello group,
I'd like to have the group opinion about a feature I would find utterly useful in _standard_ SQLite. Here's a rewrite of mails sent to hwaci about it, without success so far. Note: I guess that non pure ASCII characters in the sample strings below will translate to '?', but you can obviously get the idea. --------------- First I'd like to congratulate contributors to SQLite. You know who and why! Now I'd like to either submit a request or ask for a quotation wrt (yet another) new feature in SQLite. SQLite has been supporting Unicode for a long time but Unicode brings its own problems. Collating is one of the most important. I know one can add a user-defined collating function but I think it doesn't solve all situations. I feel the need for a different implementation of collating support. Not every user of SQLite needs full universal collating support à la ICU. It's huge and slows things down significantly. Most users only have a very small number of close languages to deal with, e.g. European Western ones. For 98% of those users only the diacritics are a problem, but these accented letters are only a tiny subset of Unicode for a given language. Hence I believe a large user base could benefit of having a new "USERSTRING" collating support which could be implemented in only two couples of strings of equal length _stored_ in a reserved table of a database and a few lines of fast C code in the SQLite library. The first string, call it OrderString, would define the specific character set required by this precise application and its sort order (ORDER BY). Any char not found there should be handled as per its Unicode value. This would allow users to explicit the charset and sorting order they need, without having to bring giant tables and slow code in the picture. The second string, call it ClassString, would hold the character class for each character in the first string. Its use is for LIKE operator and FTS3 support, after lookup in Orderstring to get letter index in this string. You can choose to have both upper- and lower-case letters there (case sensitive LIKE) or only lowercase (case insensitive LIKE). The third string, call it UpperString, would hold the uppercase version of the charset. The fourth string, call it LowerString, would hold the lowercase version of the charset. Should be identical to UpperString for languages that don't have casing. Let me elaborate how the OrderString can cope with letters in two or more intervals. To compare strings, we need only know how to compare single characters. Let C and D be the characters to compare and let A be the first and Z the last character in OrderString. Also let Idx(char) be the position of char in Orderstring. We can proceed as follows: ic = Index(C, OrderString) // index of C or -1 if not found id = Index(D, OrderString) // index of D or -1 if not found If C != -1 If D != -1 Return Sign(ic - id) Else Return Sign(D - A) Endif Else If D != -1 Return Sign(C - A) Else Return Sign(C - D) Endif Endif In plain language, if C & D are both inside or outside the user charset, then compare them directly within their set (user or rest-of-Unicode). Else compare the char not in userset against the first char in userset. This trick makes the userset a separate "branch" off Unicode, sorted apart from the Unicode "sequence". The pseudo-code above should be compiled very efficiently by any decent compiler. OrderString not only defines the userset and its sort order, but it also defines the "letter" class for REGEXes, upper and lower classes being also well defined by UpperString and LowerString. The LIKE operator goes along quite the same line. With notation above: ic = Index(C, OrderString) // index of C or -1 if not found id = Index(D, OrderString) // index of D or -1 if not found If C != -1 AND If D != -1 Return (ClassString[ic] == ClassString[id]) ElseIf C != -1 AND If D != -1 Return (C == D) Else Return False Endif If we would like to have LIKE act depending on a CASE_SENSITIVE flag [sounds it could be the compareInfo.noCase flag], then it would require ClassString be in fact TWO strings, UpperClassString and LowerClassString. Anyway, once the choice made, it's as easy to implement with straightforward code. As an example, I'm interested in finding customers first or last names, street addresses or city names mostly for French customers. I may also have some Spanish and Portugese names around, as well as names from Netherlands, Germany, Switzerland, Italy. I would use FTS3 virtual tables knowing that FTS3 will use my user supplied strings to record words in pure ASCII, but I could use any kind of search with data containing diacritics: it would work like a charm transparently. I would start with something like (mailing this could partly destroy it!): OrderString = 'AÀÁÂÃÄÅÆaàáâãäåæBbCÇcçDdEÈÉÊËeèéêëFfGgHhIÌÍÎÏiìíîïJjKkLlMmNÑnñOÒÓÔÕÖoòóôõöPpQqRrSßsTtUÙÚÛÜuùúûüVvWwXxYyýÿZz' ClassString = 'AAAAAAAAaaaaaaaaBbCCccDdEEEEEeeeeeFfGgHhIIIIIiiiiiJjKkLlMmNNnnOOOOOOOoooooooPpQqRrSssTtUUUUUuuuuuVvWwXxYyyyZz'; UpperString = 'AÀÁÂÃÄÅÆBCÇDEÈÉÊËFGHIÌÍÎÏJKLMNÑOÒÓÔÕÖPQRSßTUÙÚÛÜVWXYYYZ'; ^^---- Uppercase version don't seem to exist, but no problem LowerString = 'aàáâãäåæbcçdeèéêëfghiìíîïjklMnñoòóôõöpqrsstuùúûüvwxyýÿz'; If ever I would have to handle, say Hebrew, names as well (very unlikely because I can't read/write Hebrew!) I would only have to append [or prepend] the relevant Hebrew letters in the required order to be ready within minutes. The only limitation in this scheme is that the user should be aware that using a character range in a regex class may span a non-trivial range. I spare you more detailed explanation, but it can be made to work rather intuitively without much complexity. Of course such feature can't compete in universality with ICU and can't even deal with "multichar" cases like the german ß --> ss and similar "multiple characters" situations. BTW I believe most german users wouldn't object much at having 'Fuß' LIKE 'fus' returning true. But this approach has many advantages in my view: -) it is very simple -) it is very flexible -) it is able to solve the collating problem in _most_ simple situations -) it can be changed on the fly without recompile -) it can be implemented with small and fast code -) it keeps size requirements in scale with what the user really needs -) it brings the metadata into the database (where it belongs), in a portable way -) it allows non standard collation / case translation (can fit ad hoc needs) -) it has near zero size and cycles overhead if not used -) devoting a special "system" table for holding such data allows to store more than one. Each set of four strings can be given a handy user name and be used when the need arises with very low penalty. -) it doesn't preclude the use of ICU for those who need caterpillars -) it's royalty-free (ah ah !!!) As a sidenote I insist this would solve the problem wich arises when a database using a user-defined collating function is dissociated from the user code. It would make collation and case translation as portable as the database itself, provided it is standard in SQLite out of the "box". Even if you aren't persuaded it is worth a second thought, I seriously ask again if you could quote a custom development for this into SQLite. Let me eat one more minute of your time explaining why I won't do it myself. I've been in IT for quite a long time now and I've coded a lot in too many languages. I've spend the last 10+ years as a project leader in formal methods. But now I'm currently in the situation of being a simple _user_ of someone else' products: I need to concentrate on making things work without taking the pain with low-level details. I need little maintenance done by myself. So I certainly don't want to have to fetch new source code for each new release, setup a development machine (I only have _production_ machines here), take the time to learn how to add the feature, bring in this feature in each new SQLite release without breaking other code, test the result (who knows what bug that silly compiler will produce this time ?), ... My development environment is AutoIt (another wonderful piece of software) and SQLite. I have good reasons for keeping it that simple. I strongly believe that offering the above feature in standard SQLite would allow really much better and useful applications for a large number of users! I apologize for my lengthy mail and naive english. Thanks in advance for your answer. --------------- Jean-Christophe _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users