On 10/20/19 11:03 PM, Rowan Worth wrote: > On Sun, 20 Oct 2019 at 17:04, Simon Slavin <[email protected]> wrote: > >> Another common request is full support for Unicode (searching, sorting, >> length()). But even just the tables required to identify character >> boundaries are huge. >> > Nitpick: there are no tables required to identify character boundaries. For > utf-8 you know if there's another byte to come which is part of the current > codepoint based on whether the current byte's high bit is set, and > furthermore you know how many bytes to expect based on the initial byte. > > I'm less familiar with utf-16 which SQLite has some support for, but a > quick read suggests there are exactly two reserved bit patterns you need to > care about to identify surrogate pairs and thus codepoint boundaries. > > Tables relating to collation order, character case, and similar codepoint > data can of course get huge, so your point stands. > -Rowan
My memory is that Unicode is somewhat careful NOT to define what is a 'character' because that can really get complicated, and often application specific about what it wants. You have code-units, which for utf-8 are basically bytes. You have code-points, which is what most people think of as a 'character' which has a single Unicode Codepoint number. Then you have Graphemes, which are clusters of code-points that tend to be expressed in a single glyph in output. (and some code-points don't generate any output). Dealing with Graphemes gets complicated, and that is where you run into the need for lots of tables. Code-points them selves are fairly simple to deal with, the problem is that in some langauges just dealing with code-points doesn't let you fully handle some of the 'simple' operations like sorting, or case folding with 100% accuracy, that sometimes requires dealing with code-point clusters. But, you also run into the issue (as I understand it) that Unicode doesn't really define a universal ordering for all characters, that this can be a language specific problem, and Unicode can't really solve that issue. (Two langauges might use some of the same characters, but treat them differently for sorting). -- Richard Damon _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

