>So, for example, if one wanted to find all rows where  myNormalColumn
>ENDS WITH 'fi c d',  one could search myFlippedColumn like this:
>
>select * from LEXICON where myFlippedColumn LIKE 'd c if%'      --
>allows index use

Make this

select * from LEXICON where myFlippedColumn LIKE flip('fi c d') || '%'

and you get rid of _this_ issue.

But if you happen to have the decomposed A grave 'À' Igor examplified 
stored as a single codepoint (or vice-versa) or with any spacing 
modifier (or half an infinity of them!) then you're loosing any chance 
to match.  Also as Igor just replies, collation wouldn't work nicely.


>This doesn't really require combining-form intelligence on the part of
>the developer's code either.  As long as the search-term on the RHS gets
>flipped codepoint-by-codepoint and no attempt is made to "be
>intelligent" about the combining form, everything will be honky-dory.

That seems to me as another good instance for "know you data" 
thing.  The best bet for a given proprietary base would be to work with 
string conforming to some set of well defined rules and stick with 
them, at least for data subject to comparison.  The rules don't even 
have to be one of the "Normalized" form and can be any consistent 
invariant that fits the needs, the simpler the better of course.  If 
collation is needed, then a much more complex flipping is required in 
the general case.

Anyway, since the vast majority of DB applications appear to be in the 
business area, is there a common need to work with anything else than 
the most compact and easy to handle Norm C strings (and possibly filter 
out exotic spacing or modifiers) at the DB storage level?  Saying so, I 
mean for the "typical" data one is likely to index, search, compare in 
most applications.

BTW, this raises a side question.  Without hijacking the thread, I for 
one would be interested to know how other major RDBMS handle Unicode 
data in their predefined fixed-size CHAR(25)?  I wild guess that the 
filtering layers apply a severe filter to every input field to avoid 
having 12 significant characters represented by a 453 codepoint 
sequence and truncated to the first 25 including several 
non-informational codepoints.



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to